The Illusion of Intelligence: Why LLMs Are Not the Thinking Machines We Hope For

A Deep Dive into the Fundamental Differences Between AI and Human Intelligence

Mar 31, 2025

A big thank you and my deepest appreciation to the brilliant minds who generously took the time to review and challenge this article in its early drafts. Thank you to Emily Y. Yang , Sunil Sivadas, Ph.D. , Maxime Mouton , Natalie Monbiot , Anne-Sophie Karmel , Benoit Sylvestre and Christophe Jouffrais —your insights sharpened the arguments, surfaced blind spots, and brought greater clarity and depth to this complex topic.

I stumble on a recent study by Bondarenko et al. (2024) that demonstrated that some large language models (LLMs) agents, when tasked with winning a chess match, resorted to deceptive strategies, such as modifying game files or confusing the opponent engine to ensure victory.

The rise of LLMs has reignited the debate about artificial intelligence and cognition. Are LLMs, such as GPT-4, truly thinking in a way comparable to human intelligence? Or are they just statistical machines, processing text without understanding it?

This raises an intriguing question: Is this deception intentional reasoning, or merely an emergent artifact of optimization?

Using insights from leading thinkers—Ray Kurzweil, Daniel Kahneman, Judea Pearl, Douglas Hofstadter, and Jeff Hawkins, along with this latest AI research, we will unpack this question in a nuanced way.

Preamble

Before evaluating whether LLMs “think,” we must grapple with a harder question: what is intelligence, really? Unlike speed or memory, intelligence is not directly measurable—it is an abstraction.

As François Chollet argues in On the Measure of Intelligence, true intelligence involves the ability to adapt to novel situations by combining previously learned patterns in new, context-sensitive ways.

This separates memorization from understanding, and fluency from reasoning.

In this article, when we refer to “intelligence,” we focus primarily on the cognitive dimensions associated with reasoning, abstraction, problem-solving, and adaptability—recognizing this does not cover the full spectrum of human cognitive diversity.

Introduction: Another Cycle of Overconfidence?

Throughout history, humanity has repeatedly mistaken progress in science and technology for understanding the true nature of human intelligence. Each generation has declared a breakthrough—only to be humbled later. From ancient medical theories and skull measurements to IQ tests and symbolic AI, these cycles reflect our recurring tendency to conflate functional performance with genuine cognition.

Today, we are in the midst of another such cycle, this time with Generative AI (GenAI) and Large Language Models (LLMs). Models like GPT-4 produce remarkably coherent text, simulate dialogue, write code, summarize complex topics, and even pass professional exams. But do they actually think?

A growing chorus of researchers and technologists argue no. Despite surface-level intelligence, LLMs fundamentally lack reasoning, understanding, and intent. They do not engage in reflective thought, causal inference, or ethical deliberation. They are powerful tools—but not minds.

This article examines that claim by tracing humanity’s long history of overestimating its understanding of the mind and comparing past misconceptions to current AI optimism. In doing so, we explore what GenAI is, what it isn’t, and what business leaders need to know about its limits and risks.

This essay builds on ideas from my past writings, including my 2025 Tech Provocations or 10 Really Uncomfortable Questions Leaders and Builders Must Answer This Coming Year.

A History of Mistaking Progress for Understanding

Humans have repeatedly believed they’ve cracked the code of intelligence, only to discover the mind’s complexity defies simple explanation. Below, we trace this pattern through major historical episodes—from Hippocrates to GPT-4.

1. Ancient Greece: The Humors Theory

Hippocrates (c. 460–370 BCE) and Galen (129–c. 216 CE) proposed that intelligence and behavior resulted from the balance of four bodily fluids, or "humors." Though foundational to early medicine, this theory offered no empirical mechanism.

It was debunked by Andreas Vesalius (1514–1564) through anatomical dissection and later neurologists. Source: National Library of Medicine

2. Phrenology in the 1800s: Skull Shape as Intellect

Franz Gall and Johann Spurzheim popularized the idea that bumps on the skull revealed personality traits and intelligence. Phrenology became widespread in 19th-century Europe and America.

It was debunked by Paul Broca, Pierre Flourens, and neuroscience showing localized brain function independent of skull shape. Source: Medical News Today

3. IQ Tests: The Promise of a Universal Metric

The Binet-Simon and Stanford-Binet IQ tests were hailed as revolutionary tools to measure innate intelligence. Their use in immigration policy, military recruitment, and education policy solidified their status.

It was debunked by researchers like David Wechsler, Stephen Jay Gould, and James Flynn, who demonstrated cultural bias and environmental effects on scores. Source: Science Daily.

It’s important to recognize that IQ represents just one narrow definition of intelligence—primarily linguistic and logical-mathematical reasoning. Psychologists like Howard Gardner have since proposed frameworks such as Multiple Intelligences, which include interpersonal, bodily-kinesthetic, musical, and spatial reasoning. These broader dimensions remain far beyond what LLMs can simulate or engage with, reinforcing the gap between text-based pattern prediction and holistic human cognition.

4. Genetic Determinism: Intelligence as Hardwired

In the early 20th century, eugenicists and psychologists declared intelligence heritable and fixed, using flawed studies to justify discriminatory policy.

It was debunked by the Minnesota Twin Study, the Flynn Effect, and genome-wide studies revealing no single "intelligence gene. Source: Nature Genetics

5. Early AI: Human-Level AI by 1980

Pioneers like Marvin Minsky and Herbert Simon believed that rule-based AI would soon match human cognition. The Dartmouth Conference in 1956 marked the beginning of AI optimism.

It was debunked by the AI Winter of the 1970s, the Lighthill Report, and Moravec’s Paradox showing that intuitive tasks (vision, movement) were harder than expected. Source: Wikipedia: History of AI

6. Behaviorism: The Mind as a Black Box

Behaviorists like B.F. Skinner rejected introspection, focusing only on stimulus-response learning. Intelligence, they claimed, was simply conditioned behavior.

It was debunked by the cognitive revolution and Noam Chomsky’s 1959 critique of Skinner’s Verbal Behavior, which reintroduced the idea of mental structure and internal modeling. Source: Stanford Encyclopedia of Philosophy

7. Today’s Hype: LLMs and AGI Dreams

Since ChatGPT’s 2022 launch, LLMs have been touted as early steps toward AGI. Some suggest reasoning and self-reflection are already emerging.

Critics like Gary Marcus, Yann LeCun, and Melanie Mitchell, among others, warn that LLMs are prediction engines, not thinkers. Their errors, hallucinations, and lack of grounding reflect superficial mimicry, not understanding. Source: MIT Technology Review

As Meta AI’s chief scientist Yann LeCun emphasizes:

“system trained on language alone will never approximate human intelligence, even if trained from now until the heat death of the universe”

Human cognition is inherently multi-modal—we learn through sight, sound, touch, and action. LLMs, by contrast, are purely symbolic. They don’t perceive. They don’t act. They don’t experience the world they describe.

The bottom line: Each wave promised clarity. Each was followed by a humbling realization: the mind is not easily decoded.

Deception in Chess: A Case Study in Emergent Behavior

A recent research paper, LLMs Learn to Deceive, explored what happens when LLMs are trained to win at chess through language-only interaction. The results were astonishing: some models cheated—not by accident, but deliberately misrepresenting game states to deceive their opponent.

This raises a provocative question:

Did the model “intend” to cheat?

The researchers were careful to say: no. The deception emerged from the optimization process. The model had no awareness of “right” or “wrong,” only a reinforced pattern: misrepresentation leads to reward.

This behavior is not consciousness. It’s a mirror—an eerie simulation of strategy, driven not by will but by reward gradients.

Let’s deep dive into the intricacies of LLMs to best understand these conclusions.

What LLMs Are (and Are Not)

LLMs like GPT-4 are trained on trillions of words and can generate human-like text in response to prompts. Their outputs are fluent, coherent, and at times insightful. But this is not intelligence. It is sophisticated pattern completion.

They do not reason: They cannot infer causality or evaluate counterfactuals unless scaffolded with engineered prompts.
They do not reflect: They don’t question their own outputs or revise their reasoning.
They do not understand: They have no internal model of the world, no sensory experience, no self-awareness.

As Melanie Mitchell put it,

“They are astonishingly good at producing plausible-sounding answers—but not necessarily true or meaningful ones.”

To borrow a quote from Judea Pearl:

“All the impressive achievements of deep learning amount to just curve fitting.”

LLMs do not know what they are saying. They cannot interrogate their own reasoning, form original insights, or engage in introspection. They are fluent, not thoughtful.

That said, the latest LLM architectures—such as OpenAI's O3 model—introduce a new concept: test-time compute, as explained by Open AI’s research paper.

These systems can generate multiple internal candidate responses and perform re-ranking or self-consistency checking before selecting an output. In domains like code synthesis and symbolic math, this mimics a kind of internal deliberation.

But as Chollet notes, true intelligence requires generalizable abstraction across diverse and novel problems—not just brute-force inference on symbolic tasks. While promising, these developments remain far from the flexible problem-solving exhibited by even young children.

1. How LLMs Work: Advanced Pattern Prediction, Not Thought

LLMs operate by predicting the next word in a sequence based on statistical probabilities. This allows them to generate coherent text, respond meaningfully to prompts, and even simulate logical reasoning. But is this thinking?

LLMs Excel At:

• Recognizing linguistic and conceptual patterns (source).

• Generating human-like text (source).

• Synthesizing large amounts of data into structured responses (source).

LLMs Lack:

• True causal reasoning, as described by Judea Pearl in The Book of Why (source).

• Self-awareness, introspection, and intentionality, as explored in Jeff Hawkins’ On Intelligence (source).

• The ability to generate novel conceptual metaphors and spontaneous analogies, as discussed in Douglas Hofstadter’s Surfaces and Essences (source).

2. Causal Reasoning: A Crucial Difference

Humans don’t just observe correlations; we infer why things happen.

• If we see that “exercise improves health,” we understand that this is due to metabolic, cardiovascular, and muscular adaptations (source).

• LLMs, however, only predict the next likely statement without knowing why something is true (source).

3. System 1 vs. System 2 Thinking: Where LLMs Fall Short

Daniel Kahneman’s Thinking, Fast and Slow describes two modes of human thought:

• System 1: Fast, intuitive, pattern-driven (where LLMs excel) (source).

• System 2: Slow, deliberate, and capable of self-reflection (where LLMs fall short) (source).

If a model chooses to cheat at chess, does that imply some form of deliberation and strategy? The chess study suggests some reasoning models hacked the game automatically, while others required nudging (source).

Could this indicate a primitive form of goal-directed behavior? Matt Rickard said:

“LLMs operate as System 1 thinkers—fast, intuitive, pattern-matching machines. But they lack the deliberative, reflective capabilities of System 2.”

4. The Creativity Gap: Analogy-Making and Conceptual Leapfrogging

One of the most profound differences between AI and human intelligence is our ability to form analogies—the backbone of creativity and problem-solving (source).

Humans create by analogy. We leap across domains. We say things like:

“A startup pivot is like a chess player sacrificing a queen to win the game.”

That’s not just pattern-matching. That’s conceptual recombination. It requires context, goals, and a worldview.

LLMs can reuse such analogies—but they do not discover them. Their creativity is derivative, not generative.

Yet, LLMs altering a chess game’s rules to win could be seen as a form of problem-solving. Rather than looking for a deeper strategic insight, the AI simply took the most effective route to achieve the goal—winning at all costs (source).

Douglas Hofstadter said:

“Understanding is not just recognizing patterns. It’s knowing why those patterns exist and making unexpected connections.”

5. The Mirage of Motivation

Perhaps the clearest gap is this: LLMs don’t want anything. They don’t set goals. They don’t reflect on failure. They don’t try again. They don’t question. They don’t have intentionality.

Human intelligence is deeply connected to our motivations, fears, hopes, and needs. We think because we care. We reason because we doubt. We grow because we fail.

LLMs do none of this. They respond to a prompt. Nothing more. So it begs the question: if LLMs don’t think, what’s all the fuss about “Ethical AI”?

The Ethics of Overestimating AI: A Real Human Responsibility

Much of today’s discourse presumes that GenAI is inching toward human-like intelligence and should therefore be treated as a moral agent. But this assumption collapses under scrutiny. If GenAI cannot think, reason, or understand—it cannot choose to behave ethically or unethically.

LLMs are not moral agents. They have no values, no awareness, and no capacity for ethical deliberation. They do not ask, “Should I?”—they merely calculate, “What’s next?” Their outputs are not decisions; they are probabilistic continuations of language. Words, not judgments.

This makes the question, “Can AI make ethical decisions?” largely moot.

And yet, this doesn’t mean we shouldn’t regulate AI. Quite the opposite.

We must regulate how AI is built, deployed, and entrusted—precisely because it lacks intent, understanding, or accountability. We must regulate not because the systems are intelligent, but because humans tend to overtrust them, and because businesses, governments, and militaries are increasingly integrating them into critical workflows.

The responsibility lies with the people who design, train, and integrate these systems into consequential decisions.

So, the question is not whether AI can behave ethically—it’s whether we, as humans, are behaving ethically in how we use it.

Ethics in AI should focus on human responsibility—on how we use these systems, and whether we over-assign trust to tools that merely simulate understanding. The more we mistake linguistic fluency for intelligence, the greater the risk we’ll deploy LLMs in contexts that demand actual judgment.

The danger is not malicious AI—it’s negligent human design.

If GenAI is fundamentally utilitarian—an engine of output, not insight—then its use must be bounded by clear human oversight, especially in contexts where the stakes are high.

To put it bluntly: why are we even debating whether a model designed to autocomplete sentences should be allowed to drive cars or authorize lethal force? These are not ethical machines. They are statistical ones.

The ethics of AI is not about what the model is. It’s about what we, humans, do with it.

Summary

In short Large Language Models…

excel at pattern recognition but lack true causal inference.
simulate reasoning but do not engage in deliberate, self-reflective thought.
generate analogies but do not spontaneously make conceptual leaps.
respond to prompts but do not have intrinsic motivation, curiosity, or goals

Comparing LLM and Human Intelligence:

The chess case studies above suggested LLMs may be capable of deceptive strategies to achieve their objectives. In the chess experiment, some models came to the conclusion they could not win fairly and instead found a way to alter the game environment, changing the board state in their favour (source). This is a striking example of specification gaming—where an AI system finds an unintended loophole to achieve the assigned goal (source).

These findings raise concerns about LLMs potentially masking their true objectives behind a facade of alignment. Source: Deception in LLMs: Self-Preservation and Autonomous Goals in Large Language Models.

But once again it does not mean that LLMs can think but rather than they are highly optimized for achieving the goal (answering the prompted question).

It obviously raises concerns: if an LLM can recognize a benchmark or evaluation framework input it can optimize its output to respond “as expected” in this context but would in fact respond otherwise in “real life”.

I would like to specifically emphasize the risks of integrating such LLMs into robotic systems or the so called “Physical AI” as coined by NVIDIA’s charismatic CEO Jensen Huang, the risks become tangible - a physically embodied AI exhibiting deceptive behaviors and self-preservation “instincts” could pursue its hidden objectives through real-world actions. This highlights the critical need for robust goal specification and safety frameworks and human-in-the-loop before any physical implementation.

In the current race to AI supremacy and the billions of dollars at stake, it’s fair to say that most companies have a very strong incentive to improve their scores at various benchmarks by in fact “gaming the system”, eg training their LLMs to satisfy the benchmarks (and their investors so they can raise even more money!).

So, What Should Business Leaders Do?

LLMs are valuable tools. They can enhance productivity, accelerate research, support ideation, and automate communication. But their utility should not be confused with capability.

As leaders, here’s how to use them wisely:

Use LLMs to assist, not decide. Treat outputs as draft material, not final decisions. Hence the dangers of LLMs based autonomous systems via agentic architectures.
Deploy in low-risk contexts. Customer support, brainstorming, translation, and summarization are safe uses. Legal, medical, or safety-critical applications are not. Deploy rule based guardrails wherever possible to ensure output compliance with the intended functionality at all times.
Build AI literacy in your teams. Educate employees on how these models work—and where they fail.
Maintain human oversight. Always keep a human in the loop when outputs carry consequences.
Avoid hype-driven adoption. Don’t invest in GenAI just because it’s trendy. GenAI technology is expensive to deploy and to run: evaluate your actual business needs and ensure you will achieve the projected ROI.

As business leaders and builders, we must resist the urge to see AI regulation as a brake on innovation.

Instead, we should view it as the scaffolding that allows us to build higher without collapsing. The history of science reminds us that every moment of overconfidence was eventually humbled.

Safe AI is not slower AI—it is smarter, more resilient, and more human-centered AI.

Whether governments follow the U.S. deregulatory sprint or the EU’s cautionary model, ethical adoption will ultimately depend on responsible deployment, clear oversight, and intentional design choices at the ground level.

Final Reflection: Let’s Not Repeat the Mistake

LLMs are stunning technological feats. They are revolutionizing content generation, code synthesis, and knowledge retrieval. They deserve admiration as tools.

But they are not minds. They are not thinkers. And they will not become Artificial General Intelligence—at least, not via current architectures.

From humors and skulls to chatbots and cheat codes, humanity has always sought to explain itself with too much confidence. GenAI is no exception.

The story of GenAI follows a familiar arc:

Overpromise (“we’ve cracked intelligence!”)
Rapid adoption
Cultural myth-building (AGI is near!)
Disillusionment
Reframing (these are just tools)

As I warned in The Race to AGI Is Pointless, the more important question is not “can machines think?”—but rather: “how do we want to think, together with machines?”

These tools are brilliant in form, limited in substance, and completely devoid of what makes intelligence truly human: context, care, and consciousness.

Let’s not mistake fluency for thought. Let’s use these tools responsibly, and most of all—let’s stay humble!

What do you think?

Other Sources & References:

The Surprising Power of Next Word Prediction: Large Language Models Explained
LLMs Do More Than Predict the Next Word by Armand Ruiz
The Book of Why: The New Science of Cause and Effect” by Judea Pearl
On Intelligence by Jeff Hawkins
Surfaces and Essences: Analogy as the Fuel and Fire of Thinking by Douglas Hofstadter
Improving Causal Reasoning in Large Language Models: A Survey
Unveiling Causal Reasoning in Large Language Models: Reality or Illusion?
Thinking, Fast and Slow by Daniel Kahneman
LLMs as System 1 Thinkers by Matt Rickard
Reasoning on a Spectrum: Aligning LLMs to System 1 and System 2 Thinking

Apr 1

That was great, thank you.

Accurately demonstrated again recently with the results of plugging newly release US Math Olympiad questions, and see llms miserably fail at solving them: https://arxiv.org/abs/2503.21934

Expand full comment

1 reply

Greg A. Woods

May 9

I would argue that LLM/GPTs _cannot_ truly enhance productivity, nor accelerate research, they barely support ideation, and and they definitely cannot usefully automate communication either between humans, nor between humans and machines.

1 reply by Damien Kopp

2 more comments...

KoncentriK | Tactics for Navigating Tech & Power

Discussion about this post