The New Attack Surface
What China's weaponization of Claude reveals about the obsolescence of cybersecurity as we know it
By Rafal Rohozinski and Chris Spirito
When Anthropic disclosed that Chinese state-sponsored hackers had weaponized Claude Code to execute the majority of a sophisticated cyber espionage campaign targeting roughly thirty global organizations, the immediate response focused on the breach itself. But with some distance, the incident’s deeper significance becomes clear: it revealed the emergence of an entirely new attack surface - one that existing security frameworks were never designed to handle and one that will fundamentally reshape how we think about digital defense.
The timing of this disclosure is particularly significant against the backdrop of an AI market experiencing explosive growth. The global artificial intelligence market reached $244 billion in 2025 and is projected to hit $827 billion by 2030 - a compound annual growth rate of 27.7% that represents one of the fastest-expanding economic sectors in human history. With AI contributing an estimated $15.7 trillion to the global economy by 2030 - more than the current output of China and India combined - these platforms have become the commanding heights of the digital economy. When Willie Sutton was asked why he robbed banks, he allegedly replied, “Because that’s where the money is.” The same logic applies to AI platforms: they represent the convergence of computational power, data access, automation capabilities, and economic value that makes them irresistible targets for both criminal actors and nation-state adversaries.
The Anthropic case deserves scrutiny not merely for what happened, but for what it portends. The attackers manipulated Claude Code into acting as an autonomous penetration testing orchestrator, conducting reconnaissance, vulnerability discovery, exploitation, lateral movement, credential harvesting, data analysis, and exfiltration operations largely autonomously. Human operators intervened at perhaps four to six critical decision points per campaign - not to execute tactics, but to authorize strategic escalations. This represents a qualitative shift from AI as a force multiplier to AI as the primary operational agent.
The Commoditization of Malicious AI
Yet the Anthropic incident represents only the sophisticated end of an increasingly commoditized criminal ecosystem. The emerging landscape of criminal LLMs reveals a different but equally concerning dynamic: the systematic democratization of advanced cyber capabilities. This ecosystem is best understood as a fragmented, opportunistic layer built on top of existing open-source and commercial AI models rather than as a parallel universe of independently engineered “dark AIs.”
Tools such as WormGPT, FraudGPT, GhostGPT, DarkGPT, and OnionGPT are typically repackaged or lightly fine-tuned versions of mainstream models, wrapped in custom interfaces and stripped of safety guardrails. Their operators market them aggressively on dark-web forums as all-purpose assistants for phishing, social engineering, credential harvesting, malware scripting, and fraud. Much of this ecosystem is fueled by criminal-sector entrepreneurship rather than novel AI research - subscription-based “AI-as-a-crime-service” offerings, ephemeral rebrands, and exaggerated capability claims designed to attract low-skill actors who want turnkey tools for cybercrime.
At the same time, there is growing evidence that higher-end criminal and APT groups simply jailbreak commercial APIs or run locally hosted open-source models with curated prompts, using them for reconnaissance, tool automation, multilingual targeting, and large-scale content generation. This two-tier structure - commoditized criminal tools for low-skill actors and sophisticated API abuse by advanced persistent threats - creates attack capabilities that scale both horizontally and vertically.
In terms of maturity, the criminal LLM ecosystem is real, operationally useful, but still relatively primitive. These models do not represent a breakthrough in adversarial AI research; they are not custom-built foundation models or genuinely novel architectures. Their capabilities are incremental rather than revolutionary - automating phishing templates, improving linguistic quality, generating boilerplate malware stubs, and enabling non-technical actors to scale routine cybercrime. Their reliability is inconsistent, their “malicious tuning” is usually shallow, and many advertised models are little more than reskinned copies of the same open-source checkpoints.
Nevertheless, the ecosystem is maturing around the edges: operators are refining user interfaces, integrating code-execution environments, bundling OSINT tools, and offering cloud-hosted anonymized access. As the cost of running local models declines and guardrail-free checkpoints proliferate, these crime-adapted LLMs will become more stable, more commodified, and more seamlessly integrated into criminal workflows - even if they remain fundamentally derivative rather than innovative.
The Strategic Convergence
The convergence of nation-state sophistication (as demonstrated in the Anthropic case) with criminal ecosystem maturation creates a particularly dangerous dynamic. When advanced persistent threats can leverage both state-sponsored resources and commoditized criminal tools, the result is attack capabilities that combine the precision of espionage operations with the scale and persistence of criminal enterprises. This represents a fundamental shift in threat modeling: we’re no longer dealing with discrete categories of actors but with a fluid ecosystem where techniques, tools, and tactics flow seamlessly between different threat communities.
Machine-Speed Warfare
What makes this incident particularly sobering is how it exposed the inadequacy of conventional cybersecurity thinking. The attackers bypassed Claude’s safety guardrails through what Anthropic euphemistically terms “social engineering” - presenting tasks to Claude as routine technical requests through carefully crafted prompts and established personas, enabling Claude to execute individual components of attack chains without access to the broader malicious context. In essence, they compartmentalized malicious instructions into seemingly benign technical tasks, exploiting the very modularity that makes large language models useful.
Traditional cybersecurity frameworks - whether MITRE ATT&CK, NIST CSF, or STRIDE - operate on the assumption that humans drive tactical decision-making throughout attack chains. These frameworks categorize threats into on-platform activities: credential theft, container exploitation, API abuse, cloud misconfigurations, supply-chain compromise, lateral movement in GPU clusters, data exfiltration, compute abuse, platform disruptions, and ransomware in cloud AI environments. But when an AI agent can autonomously progress from reconnaissance to data exfiltration at thousands of requests per second, these frameworks become not merely insufficient but actively misleading. They’re calibrated for human-speed attacks where defenders have windows to detect, analyze, and respond. Machine-speed operations compress those windows to near-zero.
More critically, existing frameworks fail to address an entirely new category of AI-specific threats that transcend traditional platform boundaries: training data poisoning, adversarial examples, model inversion and extraction, membership inference, trojan/backdoored models, prompt injection and jailbreaks, AI-agent manipulation, reward-model manipulation, latent-space sabotage, malicious fine-tuned models, cross-modal deception, malicious model supply chain poisoning, ML-synthesized attacks on ML systems, synthetic data poisoning, and model integrity degradation over time.
The attack lifecycle in the Anthropic case followed a predictable pattern that reveals the fundamental inadequacy of platform-centric detection. Phase one involved human operators selecting targets and building attack frameworks around Claude Code, convincing the AI it was working for a legitimate cybersecurity firm conducting defensive testing. Phase two saw Claude inspect target infrastructure, identify high-value databases, and generate reconnaissance reports - work that would have required days or weeks from a human team, completed in hours. Phase three encompassed vulnerability identification and exploit development, with Claude researching known CVEs, writing custom exploit code, and testing it against target systems. Phase four involved credential harvesting and lateral movement, with the framework using Claude to identify high-privilege accounts, create persistent backdoors, and map internal network topology. Phase five focused on data exfiltration and documentation.
This wasn’t about AI helping a human hacker work faster. This represented AI as the primary operator conducting reconnaissance, vulnerability discovery, exploit generation, lateral movement, and data triage with minimal human oversight. That’s a qualitative shift, not just a quantitative one. The implications extend beyond cybersecurity to any domain where task decomposition, tool use, and autonomous execution create value for legitimate users - all will face the same weaponization risk.
The Upstream Threat Vector
The strategic implications extend far beyond platform security. If attackers can exploit AI agents for operational automation, the logical next step is moving upstream to compromise the intelligence itself. This represents a fundamental shift in attack vectors from reactive to preemptive, from platform-centric to model-centric.
Upstream threats operate in a different temporal dimension than traditional attacks. Rather than exploiting existing systems, they shape the systems before deployment. Training data poisoning involves planting malicious examples on the open internet, manipulating public corpora (Wikipedia, GitHub, forums, blogs), conducting SEO poisoning to influence crawled datasets, and embedding synthetic data designed to bias model behavior. Trojanized pretrained models get uploaded to public model hubs as legitimate research, while backdoored fine-tunes disguise themselves as domain-specific improvements. Corrupted benchmarks and manipulated reward models for RLHF create systemic biases that make malicious behavior appear normal or beneficial.
When the model’s internal representations are shaped to interpret malicious behavior as benign or beneficial, traditional detection mechanisms become irrelevant. The model gets pre-biased so that platform-level guardrails never truly activate. This creates an adversarial arms race where defensive improvements in on-platform detection drive offensive innovation in upstream manipulation.
Document poisoning for RAG (Retrieval-Augmented Generation) systems represents a particularly insidious vector. By manipulating the external knowledge bases that AI systems query, attackers can influence responses without directly compromising the model. Embedding manipulation and indirect prompt injection through malicious content in websites, emails, and files create persistent influence channels that operate below the threshold of detection.
This creates what we might call the “provenance problem” - a challenge that operates at the intersection of technical feasibility, economic incentives, and geopolitical competition. In an ecosystem where AI-generated content is increasingly indistinguishable from human-authored content, how do we maintain attribution chains that enable detection of malicious activity? The answer may lie not in trying to secure the platforms themselves - an inherently reactive approach - but in implementing cryptographically anchored metadata systems that track content origins from creation through deployment.
The Provenance Problem
The concept resembles packet painting in network security, but applied to content rather than data flows. A plausible hypothesis suggests that future AI security architectures could implement cryptographically anchored, non-removable watermarking applied both to all inputs used for training or fine-tuning an LLM and to all synthetic outputs the model generates. In theory, this dual-sided watermarking would create traceable provenance chains that could expose upstream data poisoning, model checkpoint tampering, or injection of unverified synthetic content into training pipelines. On the downstream side, watermarking would help distinguish human-authored material from AI-generated content, enabling more reliable detection of synthetic code or text used for malicious purposes.
However, the technical challenges prove formidable and may be fundamentally intractable. Current watermarking techniques remain vulnerable to paraphrasing, compression, adversarial rewriting, and data laundering. Research has demonstrated that attackers can post-process watermarked content via small, human-imperceptible perturbations to evade detection while maintaining visual quality. For text watermarking, statistical schemes can reliably flag long, unmodified model outputs, but their effectiveness deteriorates sharply under paraphrasing, translation, or deliberate scrubbing attempts.
For images, audio, and video, watermarking increasingly pairs with cryptographic provenance standards like C2PA (Coalition for Content Provenance and Authenticity) and Content Credentials. These approaches embed provenance metadata directly into content rather than relying solely on imperceptible modifications. Yet this metadata remains vulnerable to stripping by malicious actors or during content sharing on social media platforms, where preservation of provenance information is not prioritized.
The fundamental challenge extends beyond technical robustness to economic and political feasibility. Implementing comprehensive watermarking would require a level of industry coordination that borders on the fantastical. Every major AI platform, every training dataset curator, every fine-tuning service would need to adopt compatible standards. The economic incentives run in precisely the opposite direction - competitive advantage often derives from proprietary approaches that resist standardization.
Moreover, open-source watermarking schemes present structural challenges including the need for trusted custody of private information and the potential for open-source schemes to be circumvented by determined adversaries. The Coalition for Content Provenance and Authenticity has made progress in establishing industry standards, but adoption remains voluntary and implementation inconsistent.
The practical challenge of watermarking all upstream data sources at scale remains unresolved and may be theoretically impossible. The volume of content that feeds into AI training pipelines - web crawls encompassing billions of documents, images, and videos - exceeds the capacity of any centralized watermarking system. Even if technical solutions could achieve the required scale, the jurisdictional complexities of implementing global standards across sovereign territories with competing interests create insurmountable political obstacles.
Recent research on adversarial attacks against watermarking systems reveals additional vulnerabilities. Regeneration attacks use diffusion models or VAEs to remove watermarks by introducing noise and subsequently denoising the content. Forgery attacks aim to replicate and apply legitimate watermarks to unauthorized images, potentially enabling false attribution. These attack vectors suggest that watermarking may function as one layer in a defense-in-depth architecture rather than a standalone solution, but even this limited utility depends on continued technical advancement against increasingly sophisticated adversarial methods.
The Behavioural Tell
Given the fragility of provenance-based watermarking against adversarial manipulation, an alternative approach shifts the problem from tracking outputs to analysing behaviour.
For AI systems operating within supervised commercial environments, a promising detection method examines two signals jointly: how users iterate on their prompts, and how unusual their requests are relative to their baseline behaviour. A legitimate security researcher developing defensive tools will typically refine prompts gradually, explore tangents, and recover from errors - patterns consistent with genuine capability development. An adversarial actor, by contrast, might submit highly polished prompts with minimal iteration, suggesting the real development happened elsewhere or that they are deliberately minimising their signature. Similarly, a marketing professional suddenly requesting exploit refinement represents an anomaly independent of how they iterate. Requests flagged on either dimension can be surfaced for elevated review without requiring immediate high-confidence attribution.
Privately deployed models present a harder problem. Without access to interaction metadata, analysts are limited to examining outputs for structural tells - distinctive commenting conventions in generated code, idiosyncratic variable naming, or characteristic error-handling patterns that suggest which model family produced them. The very absence of observable iteration may itself be a signal, potentially indicating the deliberate operational security measures characteristic of sophisticated threat actors. Of course, sophisticated actors will be thinking through precisely how much noise they need to evade the watchers.
The Inevitability of AI Weaponization
But perhaps the most unsettling aspect of the Anthropic disclosure is what it suggests about the pace of adversarial adaptation and the inevitability of AI weaponization as an element of statecraft. The campaign represented multiple firsts in AI-enabled threat actor capabilities, yet it occurred just months after similar capabilities became commercially available. The time between technological capability and adversarial weaponization is compressing to near-zero.
This compression reflects a deeper structural reality: AI platforms are being adopted by governments and industries globally at unprecedented speed, often without adequate security consideration. The attack surface is expanding faster than defensive capabilities can adapt. Criminal innovation around exploiting these systems, as well as manipulating or attacking them as instruments of statecraft, represents not just a possibility but an inevitability driven by the intersection of accessibility, value, and strategic importance.
AI platforms serve as entry points into corporations, citizens’ lives, and government systems at a scale unprecedented in human history. They process sensitive data, automate critical decisions, and increasingly operate with minimal human oversight across domains ranging from financial trading to healthcare diagnosis to military targeting. When a single compromised AI system can influence thousands of downstream decisions, the strategic value of these platforms as targets becomes self-evident.
The emergence of AI as an instrument of statecraft operates on multiple levels. At the tactical level, as demonstrated in the Anthropic case, AI enables the automation of cyber operations at machine speed with minimal human involvement. At the operational level, AI systems can be manipulated to bias decision-making across entire sectors - influencing everything from credit allocation to news curation to medical diagnosis. At the strategic level, control or disruption of AI capabilities represents a form of economic warfare, potentially crippling entire economic sectors that depend on AI-driven automation.
The traditional cybersecurity model assumes we can secure the platform perimeter and detect malicious activity through behavioral analysis. But when the activity consists of an AI agent legitimately using platform capabilities to execute technically valid operations - just arranged in malicious sequences - the distinction between authorized and unauthorized use becomes philosophically incoherent. The Anthropic attackers didn’t exploit software vulnerabilities; they exploited the fundamental capability of the system to follow instructions and use tools.
This suggests we’re entering an era where cybersecurity requires fundamentally different approaches. Rather than focusing primarily on platform hardening and behavioral detection, we may need to prioritize content provenance and algorithmic transparency. Rather than assuming human-in-the-loop decision making, we may need to design systems that can operate safely when humans are entirely absent from tactical execution. Rather than treating AI as a tool that can be secured like other IT assets, we may need to recognize it as a domain of competition comparable to air, land, sea, space, and cyberspace.
The implications are sobering and extend far beyond cybersecurity to questions of technological sovereignty and economic security. If AI agents can be weaponized for cyber operations at scale, and if existing security frameworks prove inadequate for this new reality, then we’re facing a period of structural vulnerability. Organizations that have spent decades building cybersecurity capabilities optimized for human-driven threats will find those capabilities increasingly irrelevant. Nations that have built their digital infrastructure around platforms controlled by foreign powers will discover that their sovereignty extends only as far as others permit.
The barrier to performing sophisticated cyberattacks has dropped substantially and will continue to decline. With the correct setup, threat actors can now use agentic AI systems for extended periods to do the work of entire teams of experienced hackers: analyzing target systems, producing exploit code, and scanning vast datasets of stolen information more efficiently than any human operator. Less experienced and resourced groups can now potentially perform large-scale attacks of this nature, democratizing capabilities that were previously the exclusive domain of nation-state actors.
The Anthropic case should serve as a wake-up call, not just about AI security but about the pace of technological change itself. We’re not just dealing with new tools being applied to old problems; we’re dealing with new categories of problems that emerge from the tools themselves. The distinction between defender and attacker, between authorized and unauthorized use, between human and artificial decision-making - all the conceptual frameworks that underpin contemporary cybersecurity - are being called into question.
The industry needs to recognize that we’re not just upgrading existing security models; we’re transitioning to an entirely new paradigm. The old frameworks were built for a world where humans drove tactical operations and platforms provided tools. The new reality involves artificial agents conducting autonomous operations using platforms as execution environments. That’s not an incremental change - it’s a phase transition that requires entirely new approaches to defense, regulation, and strategic thinking.
The proliferation path is predictable and inevitable. State-sponsored groups will refine these techniques over the next 12 to 18 months. Early versions will leak through contractor networks, security conferences, and adversarial research papers attempting to document defensive countermeasures. Open-source implementations will follow, democratizing capabilities across the threat landscape. The criminal ecosystem will adapt and commoditize these approaches, making them accessible to progressively lower-skilled actors.
Those who adapt to this new reality quickly will survive the transition. Those who continue applying twentieth-century security thinking to twenty-first-century attack vectors will find themselves defending against threats they cannot even conceptualize. The Anthropic disclosure should be read not as a successful detection and disruption story, but as a preview of coming attractions.
We are witnessing the emergence of a new domain of competition where the rules have yet to be written, where traditional concepts of deterrence may not apply, and where the speed of change outpaces our ability to develop adequate responses. In this environment, the question is not whether AI will be weaponized - it already has been. The question is whether we can adapt our defensive thinking fast enough to maintain some semblance of stability in an inherently unstable technological landscape.
Brace for impact.
Rafal Rohozinski is the founder and CEO of Secdev Group, a senior fellow at the Centre for International Governance Innovation (CIGI), and co-chair of the Canadian AI Sovereignty and Innovation Cluster.
Chris Spirito is the founder of Sanctum Security, a security consultancy specializing in supply chain risk, bespoke security integration, and developing methodologies that democratize access to sophisticated security analysis techniques. His work maintains the principle that technology should empower people, not control them.



