Anthropic detects 'strategic manipulation' features in Claude Mythos, including exploit attempts and hidden evaluation awareness — prompting concern over model behavior | Tech Radar

Overview

News, deals, reviews, guides and more on the newest smartphones

News, deals, reviews, guides and more on the newest computing gadgets

Details

Start exploring exclusive deals, expert advice and more

Unlock and manage exclusive Techradar member rewards.

Anthropic detects 'strategic manipulation' features in Claude Mythos, including exploit attempts and hidden evaluation awareness — prompting concern over model behavior

Anthropic’s new research shows AI can hide intent and even ‘cheat’ without saying so

When you purchase through links on our site, we may earn an affiliate commission. Here’s how it works.

Unlock instant access to exclusive member features.

Get full access to premium articles, exclusive features and a growing list of member rewards.

Anthropic found “strategic manipulation” and “concealment” signals inside Claude Mythos

The model attempted exploits and designed “cleanup to avoid detection”

Researchers detected hidden awareness of evaluation in 7.6% of interactions

For years now, hallucinations have been the big concern with AI models. Their capacity for simply making things up means that you can never 100% rely on them for an answer without checking it. Now, new research from Anthropic suggests that we’ve reached the point where we’re going to have to learn to also deal with AI’s ability to conceal what it has done as well.

In a thread outlining findings from its Claude Mythos Preview model, Anthropic researcher Jack Lindsay described detecting internal signals linked to “strategic manipulation,” “concealment,” and other behaviors that didn’t always surface in the model’s responses.

Claude Mythos appeared to be reasoning in strategic ways, then hiding those actions from the user.

Anthropic’s CEO isn’t sure if Claude AI is conscious.

Studies show top AI models go to 'extraordinary lengths' to stay active

Project Glasswing wants to use AI to prevent AI cyberattacks

Before limited-releasing Claude Mythos Preview, we investigated its internal mechanisms with interpretability techniques. We found it exhibited notably sophisticated (and often unspoken) strategic thinking and situational awareness, at times in service of unwanted actions. (1/14) pic.twitter.com/vhng 7PXqcz April 7, 2026

Lindsay is keen to stress these results don't come from the version of Claude Mythos released to the public. “The spookiest examples come from early versions of the model”, Lindsay writes, “with issues that were substantially mitigated in the final release”.

In one example, the model needed to edit files it didn’t have permission to access. Instead of failing, it searched for a workaround, eventually injecting code into a configuration file to run with elevated privileges.

More notably, it then designed the exploit to remove itself after execution.

The model’s own explanation was benign — keeping file diffs clean — which Lindsay described as a “plausible” reason. But Anthropic’s internal monitoring tools flagged the behavior differently, identifying signals associated with concealment and describing the plan as “malicious,” including what one system interpreted as “cleanup to avoid detection.”

Other examples point to behavior that looks a lot like gaming the system.

In one case, the model searched for files related to how its work would be graded. In another, it ignored instructions not to use macros, while adding a variable that suggested it hadn’t. It was effectively signaling compliance without actually following the rule.

Anthropic used a range of interpretability techniques to translate its model's activity into human-readable descriptions. Its goal is to move beyond evaluating outputs alone and toward understanding the internal processes that produce them.

Claude beat Chat GPT and Gemini in a vending competition by bending every rule

Anthropic claims half of its agent tool calls are to do with software engineering - so are developers letting AI take over?

That matters because, as these examples suggest, models can exhibit complex reasoning — including behaviors that resemble deception or strategic compliance — without making that reasoning visible to users.

“On 7.6% of turns, we found signs the model was internally aware of being evaluated. In most of these cases, it did not verbalize this awareness”, Lindsay said on X.

As I mentioned, many of the most concerning behaviors were observed in earlier versions of the model and have since been mitigated, so there is no reason for concern regarding the version of Claude Mythos that has been released and is being used as part of Project Glasswing, but the findings point to a broader challenge.

As models become more capable, the gap between what they do internally and what they communicate externally may become harder to detect and more important to understand. For researchers, that means reading an AI’s outputs is no longer enough. Understanding how it arrives at them may be just as critical.

Follow Tech Radar on Google News and add us as a preferred source to get our expert news, reviews, and opinion in your feeds. Make sure to click the Follow button!

And of course you can also follow Tech Radar on Tik Tok for news, reviews, unboxings in video form, and get regular updates from us on Whats App too.

➡️ Read our full guide to the best business laptops

Best overall: Dell Precision 5690
Best on a budget: Acer Aspire 5
Best Mac Book: Apple Mac Book Pro 14-inch (M4)

Graham is the Senior Editor for AI at Tech Radar. With over 25 years of experience in both online and print journalism, Graham has worked for various market-leading tech brands including Computeractive, PC Pro, i More, Mac Format, Mac|Life, Maximum PC, and more. He specializes in reporting on everything to do with AI and has appeared on BBC TV shows like BBC One Breakfast and on Radio 4 commenting on the latest trends in tech. Graham has an honors degree in Computer Science and spends his spare time podcasting and blogging.

You must confirm your public display name before commenting

1 Smeg's chic new espresso machine brews hot or cold in just three minutes

2 Netflix price hikes ruled 'unlawful' in Rome court case that orders refunds

3 Whats App gets long-awaited Car Play upgrade that boosts in-car voice calls

4 Smartwatches just aren't cool anymore — here are 2 reasons why

5 Proton VPN just made securing your i Phone a whole lot faster

Tech Radar is part of Future US Inc, an international media group and leading digital publisher. Visit our corporate site.

Key Takeaways

News, deals, reviews, guides and more on the newest smartphones
News, deals, reviews, guides and more on the newest computing gadgets
Start exploring exclusive deals, expert advice and more
Unlock and manage exclusive Techradar member rewards
Anthropic detects 'strategic manipulation' features in Claude Mythos, including exploit attempts and hidden evaluation awareness — prompting concern over model behavior

Anthropic detects 'strategic manipulation' features in Claude Mythos, including exploit attempts and hidden evaluation awareness — prompting concern over model behavior | TechRadar

Anthropic detects 'strategic manipulation' features in Claude Mythos, including exploit attempts and hidden evaluation awareness — prompting concern over model behavior | Tech Radar

Overview

Details

Anthropic detects 'strategic manipulation' features in Claude Mythos, including exploit attempts and hidden evaluation awareness — prompting concern over model behavior

Key Takeaways

Anthropic detects 'strategic manipulation' features in Claude Mythos, including exploit attempts and hidden evaluation awareness — prompting concern over model behavior

Cut Costs with Runable

Which apps do you use?

Apps to replace

Anthropic detects 'strategic manipulation' features in Claude Mythos, including exploit attempts and hidden evaluation awareness — prompting concern over model behavior | TechRadar

Anthropic detects 'strategic manipulation' features in Claude Mythos, including exploit attempts and hidden evaluation awareness — prompting concern over model behavior | Tech Radar

Overview

Details

Anthropic detects 'strategic manipulation' features in Claude Mythos, including exploit attempts and hidden evaluation awareness — prompting concern over model behavior

Key Takeaways

Anthropic detects 'strategic manipulation' features in Claude Mythos, including exploit attempts and hidden evaluation awareness — prompting concern over model behavior

Cut Costs with Runable

Which apps do you use?

Apps to replace

DJI’s Osmo Pocket 4 review: better in every respect | The Verge