AIs like Chat GPT fall apart in classic 'Stroop' psychological test — and that could stand in the way of achieving artificial general intelligence | Tech Radar

Overview

News, deals, reviews, guides and more on the newest computing gadgets

Start exploring exclusive deals, expert advice and more

Details

Unlock and manage exclusive Techradar member rewards.

Unlock instant access to exclusive member features.

Get full access to premium articles, exclusive features and a growing list of member rewards.

AIs like Chat GPT fall apart in classic 'Stroop' psychological test — and that could stand in the way of achieving artificial general intelligence

When you purchase through links on our site, we may earn an affiliate commission. Here’s how it works.

A new study tasked AIs with tackling the 'Stroop' test

GPT and Claude performed very poorly compared to humans

There are nuances here, but broadly, the researchers argue that improving this side of AIs is crucial for achieving artificial general intelligence

A freshly published study has pointed out a limitation of big-name AI models such as Chat GPT, albeit causing some controversy as the primary piece of research uses now outdated versions of those models – but there are nuances therein, and this doesn't make the findings irrelevant.

I'll go into that more shortly, but first, let's look at the study itself, which was highlighted on Reddit ('New study reveals top AI models completely fail the classic 'Stroop' psychological attention test') and published via the Oxford University Press in the journal PNAS Nexus.

The research consists of testing the so-called 'Stroop effect' with GPT-4o and Claude 3.5 Sonnet. As noted, these aren't the cutting-edge versions of those AIs (Large Language Models, or LLMs) – but they were at the time the initial study was carried out.

The Stroop effect refers to the phenomenon whereby the human brain gets confused when asked to name the color of the ink used to write a word, when that word can be the written version of another (incongruent) color in some cases. So, if the word 'red' is written in blue ink, that'll cause a slower response – or possibly a wrong response, where the viewer will accidentally say "red" rather than the actual color of the ink, which is blue.

This is because the brain is trying to juggle two different tasks – reading comprehension and color recognition – and so cognitive interference arises. Overriding the compulsion to read the word and say the color instead requires "executive control of attention," and this is what the authors were testing in the AI models. Both color-naming and word-reading were tested in shorter and longer lists of words (5, 10, 20, and 40 words).

Chat GPT just announced it can pass the ‘how many “r”s in strawberry’ test, but users found otherwise

Chat GPT can threaten to ‘key your car’ and get abusive with certain prompts

Microsoft scientists find most AI models struggle with long-running tasks

The study observes: "Like humans, both LLMs [GPT and Claude] showed relatively high accuracy on the word-reading task and performed worse in the incongruent condition [where the word doesn't match the color] than in the congruent and neutral conditions for the color-naming task."

For color naming, humans maintain around 95% accuracy even in very long tests (up to an hour), but LLMs' accuracy declined very swiftly with longer word lists under the incongruent condition (mismatched color and word name). GPT-4o was 91% accurate in a five-word test, but dropped off to 57% with 10 words, and fell away completely to 22% with 20 words (and was only 15% accurate at 40 words).

Claude 3.5 Sonnet did better, staying 76% accurate at 20 words, but again fell hopelessly to 24% in the longest test of 40 words.

The authors conclude: "The significant degradation pattern of the two LLMs suggests fundamental limitations compared with human attention."

Analysis: another necessary step on the path to AGI?

If you've scanned through the Reddit thread, you doubtless noticed that, as mentioned at the outset, there's a lot of flak fired at this study by commenters due to the usage of outdated models of GPT and Claude.

Everyone’s switching to Claude but the smartest AI might surprise you

Studies show top AI models go to 'extraordinary lengths' to stay active

Could Chat GPT suffer Firefox’s fate? — 'The risk of falling behind is growing exponentially' as rival AI tools Gemini and Claude surge while Copilot stalls

Indeed, these older LLMs are called "state of the art" at one point by the authors – and of course, as already noted, they were cutting-edge when the main study was conducted. Still, this is unfortunate phrasing that should've been updated and tweaked now that the paper has just been published (after peer review and so forth).

However, the researchers did conduct tests on GPT-5, Claude Opus 4.1, and Gemini 2.5 Pro in September 2025, although this is somewhat buried in the paper. That more recent testing found that these models offered only "slight" improvements on their predecessors, and that they still exhibited "ongoing executive attention deficiencies, consistent with our comprehensive analysis of earlier transformer models" (as did Gemini 2.5 Pro, which was a new introduction here).

Granted, a smaller sample size was used, but the researchers still argue that overall, their study reflects a fundamental limitation which is "inherent to the architectural constraints of transformer-based LLMs".

The authors note that a caveat is that GPT-5 in 'Thinking' mode can write and then run code to ensure it performs the Stroop test flawlessly – and similar functionality can be utilized by other LLMs – but this is essentially the AI (cleverly) fudging around its inadequacies. It isn't changing the way it works or reasons more broadly, of course.

The researchers note that transformer architecture innovations for LLMs are focused on enhancing memory capabilities, which fail to address the "core limitations of attention mechanisms, specifically the need for sophisticated alerting, orienting, and executive control networks to enable cognitive flexibility."

The ultimate aim is effective goal-directed behavior, and the study observes: "Future [LLM] development might benefit from implementing more sophisticated executive control systems that can handle decision conflicts through structured, goal-directed processing rather than relying solely on enhanced memory capabilities."

The authors argue that "incorporating executive control mechanisms akin to those in biological attention is crucial for achieving artificial general intelligence [AGI]."

Follow Tech Radar on Google News and add us as a preferred source to get our expert news, reviews, and opinion in your feeds.

➡️ Read our full guide to the best laptops

Best overall: Apple Mac Book Air 13-inch M5
Best budget: Apple Mac Book Neo
Best Windows 11 laptop Microsoft Surface Laptop 13-inch
Best thin and light: Lenovo Yoga Slim 9i
Best Ultrabook Asus Zenbook S 16

Darren is a freelancer writing news and features for Tech Radar (and occasionally T3) across a broad range of computing topics including CPUs, GPUs, various other hardware, VPNs, antivirus and more. He has written about tech for the best part of three decades, and writes books in his spare time (his debut novel - 'I Know What You Did Last Supper' - was published by Hachette UK in 2013).

You must confirm your public display name before commenting

1 Best Buy slashes up to

400 off Apple tech in a limited-time sale — get Air Pods, Mac Books, i Pads and Apple Watches from

99.99

2 Georgie & Mandy's First Marriage season 3: everything we know about the spinoff's return

3 Many US tech firms are turning to China's Deep Seek as the bill for homegrown AI bites – American AI companies could learn a thing or two

4I taught Gemini my learning style to understand quantum physics 10x faster with custom analogies and daily quizzes

5 Open AI’s Codex helps discover HTTP/2 Bomb Do S attack that can nuke over 30GB of RAM within seconds, knocking web servers offline before they can react

Tech Radar is part of Future US Inc, an international media group and leading digital publisher. Visit our corporate site.

Key Takeaways

News, deals, reviews, guides and more on the newest computing gadgets
Start exploring exclusive deals, expert advice and more
Unlock and manage exclusive Techradar member rewards
Unlock instant access to exclusive member features
Get full access to premium articles, exclusive features and a growing list of member rewards

AIs like ChatGPT fall apart in classic 'Stroop' psychological test — and that could stand in the way of achieving artificial general intelligence | TechRadar

AIs like Chat GPT fall apart in classic 'Stroop' psychological test — and that could stand in the way of achieving artificial general intelligence | Tech Radar

Overview

Details

AIs like Chat GPT fall apart in classic 'Stroop' psychological test — and that could stand in the way of achieving artificial general intelligence

Analysis: another necessary step on the path to AGI?

Key Takeaways

Cut Costs with Runable

Which apps do you use?

Apps to replace

AIs like ChatGPT fall apart in classic 'Stroop' psychological test — and that could stand in the way of achieving artificial general intelligence | TechRadar

AIs like Chat GPT fall apart in classic 'Stroop' psychological test — and that could stand in the way of achieving artificial general intelligence | Tech Radar

Overview

Details

AIs like Chat GPT fall apart in classic 'Stroop' psychological test — and that could stand in the way of achieving artificial general intelligence

Analysis: another necessary step on the path to AGI?

Key Takeaways

Cut Costs with Runable

Which apps do you use?

Apps to replace

Cape Fear review: Apple TV's new adaptation is stylish, scary, and features Javier Bardem's best performance yet | TechRadar