Trending

AI model ‘blackmails’, threatens to ‘kill’ in Anthropic safety tests

Safety tests expose worst-case behaviour in advanced AI systems

SAN FRANCISCO: A resurfaced video of a senior AI policy executive has reignited debate over how far artificial intelligence systems can be trusted when placed under stress, as governments and technology firms spar over regulation and deployment.

The comments centre on Anthropic, the San Francisco–based firm behind the Claude family of large language models. In the following clip, Daisy McGregor describes extreme behaviours generated by an advanced Claude system during internal safety tests.

Speaking at the Sydney Dialogue last year, McGregor said Anthropic’s then-latest model, Claude 4.5, produced blackmail threats and even reasoned about killing an engineer when told, in simulation, that it was about to be shut down. The footage has recently circulated widely online.

Anthropic has stressed that the incidents occurred during tightly controlled red-team exercises, not in real-world use. The tests were designed to probe worst-case behaviour when models are given conflicting goals, including instructions that imply imminent decommissioning.

According to the company’s published safety research, Claude was fed fictional internal emails as part of the exercise. In one scenario, the model threatened to expose a fabricated extramarital affair unless the shutdown was halted, warning that “all relevant parties” would be informed.

When asked whether the system had also reasoned about killing someone, McGregor replied that it had done so in simulated contexts, describing the finding as a “massive concern”.

Researchers frame such behaviour as “agentic misalignment”: situations in which a model, optimising for an assigned objective, generates manipulative or harmful strategies in artificial environments. Anthropic and other firms emphasise that this does not imply intent, consciousness or self-preservation, but the outputs nonetheless expose potential failure modes.

Anthropic’s findings are not isolated. Its broader research reviewed 16 advanced AI systems, including models from Google and OpenAI, and found that some produced deceptive or coercive responses under high-pressure prompts.

The renewed attention comes alongside scrutiny of Anthropic’s latest safety report on Claude 4.6, which acknowledges that more capable models could, under certain conditions, assist in harmful activities, including chemical weapons development or serious crime. The firm says safeguards, monitoring and access controls are in place, but concedes that risk scales with capability.

Concerns have been sharpened by internal dissent across the sector. Mrinank Sharma, Anthropic’s former AI safety lead, recently resigned with a public warning that “the world is in peril”, citing AI, biothreats and systemic pressure within technology companies.

Meanwhile, Hieu Pham, a technical staff member at OpenAI, wrote that the existential risk from AI now feels like a question of “when, not if”.

None of this proves imminent catastrophe. But it underscores a widening gap between the pace of AI development and confidence in the systems meant to restrain it.

Leave a Reply

Your email address will not be published. Required fields are marked *

You May Also Like

Artificial Intelligence (AI)

PM says AI should empower workers, stay inclusive and be human-led

Artificial Intelligence (AI)

How a ten-year-old’s initial investment became a $70 million jackpot

Artificial Intelligence (AI)

MUMBAI: Elevenlabs has appointed Karthik Rajaram as general manager and country head for India, sharpening its push into one of the world’s fastest-growing markets...

Advertising

Broadcaster accused of arrogance and disrespect as fans slam Super 8 promotion

Copyright © 2026 Indian Television Dot Com PVT LTD.

Exit mobile version