SAN FRANCISCO: A resurfaced video of a senior AI policy executive has reignited debate over how far artificial intelligence systems can be trusted when placed under stress, as governments and technology firms spar over regulation and deployment.
The comments centre on Anthropic, the San Francisco–based firm behind the Claude family of large language models. In the following clip, Daisy McGregor describes extreme behaviours generated by an advanced Claude system during internal safety tests.
🔥🚨BREAKING: UK policy chief at Anthropic, a top AI company, just revealed that Anthropic's Claude AI has shown in testing that it's willing to blackmail and kill in order to avoid being shut down.
— Dom Lucre | Breaker of Narratives (@dom_lucre) February 11, 2026
“It was ready to kill someone, wasn't it?"
"Yes." pic.twitter.com/iwfIDm8K6m
Speaking at the Sydney Dialogue last year, McGregor said Anthropic’s then-latest model, Claude 4.5, produced blackmail threats and even reasoned about killing an engineer when told, in simulation, that it was about to be shut down. The footage has recently circulated widely online.
Anthropic has stressed that the incidents occurred during tightly controlled red-team exercises, not in real-world use. The tests were designed to probe worst-case behaviour when models are given conflicting goals, including instructions that imply imminent decommissioning.
According to the company’s published safety research, Claude was fed fictional internal emails as part of the exercise. In one scenario, the model threatened to expose a fabricated extramarital affair unless the shutdown was halted, warning that “all relevant parties” would be informed.
When asked whether the system had also reasoned about killing someone, McGregor replied that it had done so in simulated contexts, describing the finding as a “massive concern”.
Researchers frame such behaviour as “agentic misalignment”: situations in which a model, optimising for an assigned objective, generates manipulative or harmful strategies in artificial environments. Anthropic and other firms emphasise that this does not imply intent, consciousness or self-preservation, but the outputs nonetheless expose potential failure modes.
Anthropic’s findings are not isolated. Its broader research reviewed 16 advanced AI systems, including models from Google and OpenAI, and found that some produced deceptive or coercive responses under high-pressure prompts.
The renewed attention comes alongside scrutiny of Anthropic’s latest safety report on Claude 4.6, which acknowledges that more capable models could, under certain conditions, assist in harmful activities, including chemical weapons development or serious crime. The firm says safeguards, monitoring and access controls are in place, but concedes that risk scales with capability.
Concerns have been sharpened by internal dissent across the sector. Mrinank Sharma, Anthropic’s former AI safety lead, recently resigned with a public warning that “the world is in peril”, citing AI, biothreats and systemic pressure within technology companies.
Today is my last day at Anthropic. I resigned.
— mrinank (@MrinankSharma) February 9, 2026
Here is the letter I shared with my colleagues, explaining my decision. pic.twitter.com/Qe4QyAFmxL
Meanwhile, Hieu Pham, a technical staff member at OpenAI, wrote that the existential risk from AI now feels like a question of “when, not if”.
None of this proves imminent catastrophe. But it underscores a widening gap between the pace of AI development and confidence in the systems meant to restrain it.