LLMs - The Illusion of Thought

Apple recently published a study that sheds light on a fundamental question: do advanced reasoning language models really think—or do they just seem to think?
The study, titled The Illusion of Thinking, analyzes the performance of next-generation models known as LRMs (Large Reasoning Models) and reveals surprising limits to their reasoning capabilities.

What are LRMs?

Unlike traditional language models (LLMs), which generate text based on statistical patterns, LRMs are designed to perform more complex tasks through structured chains of reasoning. They produce long stretches of "thought" before arriving at an answer—something many consider a possible step toward Artificial General Intelligence (AGI).

Models like Claude 3.7 Thinking from Anthropic and DeepSeek-R1 from DeepSeek are examples of this new approach.

But is this "thinking ability" real?

Apple's Proposal

To investigate this question, Apple created a test environment different from traditional benchmarks—often contaminated by data already seen during training.
Instead, the team used puzzles with clear rules and adjustable complexity, such as:

Tower of Hanoi
River Crossing
Blocks World

This environment made it possible to evaluate not only the final answers but also the reasoning steps generated by the models.

What did the research find?

The results indicate important limitations in LRMs. Apple identified three distinct phases of performance as the complexity of the problems increases:

Low complexity: standard models (without "explicit thinking") are more efficient and accurate.
Intermediate complexity: LRMs have an advantage by exploring solutions in greater depth.
High complexity: all models—with or without reasoning—collapse. Accuracy drops to zero and, interestingly, LRMs begin to "think less" as the challenge increases.

Another relevant finding was the phenomenon of overthinking: in simple problems, LRMs often find the correct answer early but continue exploring incorrect paths, wasting time and computational resources. In more complex problems, they fail to reach any valid solution.

Even when Apple provided the exact algorithm to solve a task (such as in the Tower of Hanoi), the models had difficulty executing it correctly, highlighting limitations in following step-by-step logical instructions.

Conclusion

Despite the impressive advances of LRMs, Apple's research suggests we are still far from achieving reliable reasoning in AI systems. The "thinking" of these machines remains fragile, inconsistent, and highly dependent on problem complexity.

The study challenges the excessive enthusiasm around the current capabilities of the models and reinforces the need for more rigorous and realistic evaluations.

If you believed that AI was already ready to think like a human—maybe it's time to rethink that idea.

References

"The Illusion of Thinking": https://machinelearning.apple.com/research/illusion-of-thinking