The Cathedral and the Stones: What “Next Word Prediction” Actually Means
Your favorite criticism of AI is not only painfully out of date, it's a category error.
There is a criticism of large language models that sounds sophisticated. You’ll hear it from computer scientists who should know better, from philosophers who definitely should, and from that particular species of online commenter who confuses dismissiveness with rigor. It goes like this:
“It’s just next word prediction.”
Technically true. Also: ballet is just muscle contractions. A symphony is just air pressure variations. The Mona Lisa is just pigment on wood. The word “just” is doing enormous load-bearing work in that sentence, and it is NOT up to the task.
What’s happening when someone collapses the entire mathematical machinery of a transformer into “next word prediction” is not simplification. It’s a category error. They’re describing the training objective and calling it the capability, which is roughly equivalent to saying evolution is “just reproductive fitness” and therefore everything it produced, consciousness included, is reducible to sex.
The category error has a second, more pernicious form: the claim that because LLMs are “just” predicting the next word, they cannot be performing genuine reasoning, building world models, or producing outputs relevant to real-world decision-making. This isn’t just reductive. It’s empirically wrong. And the evidence against it keeps accumulating.
Let me show you what they’re skipping over.
What the Loss Function Actually Does
During training, a language model is optimized using a loss function, typically cross-entropy loss. The function punishes the model when it predicts an unlikely token. In a simple sentence, this teaches grammar. But language models are NOT trained on simple sentences.
They are trained on legal briefs, scientific papers, philosophical arguments, mathematical proofs, source code, medical literature, policy analyses. Millions of documents where the structure of the text reflects the structure of coherent reasoning. And in that context, the “unlikely” token is not the one that violates grammar. It’s the one that contradicts the preceding logic.
Consider what happens when the model encounters training data containing a deductive argument. If Premise A has been stated, and Premise B has been stated, the token that represents a contradiction of those premises is statistically rare in the training corpus. It generates a massive loss spike. The path of least resistance, the path the loss function rewards, is the logically consistent one.
This is not an accident. This is what happens when you apply mathematical optimization pressure to the full breadth of structured human thought. The loss function doesn’t “know” logic. It hates contradiction, because contradiction is statistically rare in coherent text. By avoiding that rarity, the model performs the mechanics of reasoning.
And here’s the part people skip: this all happens during training. At inference, when you’re actually talking to the model, no loss is being computed. No gradients are flowing. The weights are fixed. What you’re interacting with is the residue of that optimization pressure; billions of parameters that have been shaped, through exposure to the structure of human reasoning, into something that maintains logical coherence because that’s what the weights encode.
The loss function was the chisel. The weights are the sculpture.
The Compression Argument, and What Emerges From It
There’s a mathematical elegance to why this produces something that looks like understanding. During training, the model faces a compression problem: represent the statistical regularities of an enormous corpus in a fixed number of parameters. It is mathematically cheaper to learn the rule than to memorize every instance.
Take transitivity. If A equals B, and B equals C, then A equals C. A model could try to memorize every possible triplet of related items in its training data. Or it could learn the structural rule of transitivity and derive the answer for any triplet. Under the pressure of the loss function, efficiency wins. The model learns the abstraction because the abstraction compresses better.
This is also where skills emerge. Research into multitask learning dynamics in transformers has shown that models don’t learn tasks independently; they learn shared algorithmic circuits that transfer across related operations. Work on arithmetic in transformers (Musat, 2024; Quirke & Barez, 2024) demonstrates this concretely: models trained on both addition and subtraction develop shared carry and borrow circuits, reusable computational primitives that generalize across operations. The skills that emerge from training are NOT isolated stimulus-response patterns. They are transferable algorithmic capabilities, discrete “quanta” of knowledge (Michaud et al., 2023) that compose and recombine to handle novel inputs.
And the relationships between concepts that form during this process are not incidental artifacts. They are coherent internal representations of how things relate to each other. When researchers probe these representations, they find structure: spatial relationships, temporal orderings, causal chains, logical hierarchies. The model didn’t set out to build a world model. It built one because that was the most efficient way to minimize loss. The compression pressure doesn’t just produce pattern-matching. It produces emergent world models, internal structures that represent the relational fabric of the training data in ways that generalize to inputs the model has never seen.
The Evidence: What the Microscope Shows
This is not speculation. The mechanistic interpretability research of the last two years has made the internals of these models increasingly legible, and what it reveals directly contradicts the “just next word prediction” framing.
Planning. In March 2025, Anthropic published “On the Biology of a Large Language Model” (Lindsey et al., 2025), a landmark interpretability study using circuit tracing to map the computational pathways inside Claude 3.5 Haiku. Among their findings: when asked to write a rhyming poem, the model identifies potential rhyming words BEFORE generating the line that leads to them. It plans the endpoint, then constructs the path backward. The model generates one token at a time, yes. But it thinks on longer horizons to do so. This was not a one-off finding. Maar et al. (2026), in “What’s the plan? Metrics for implicit planning in LLMs,” confirmed that implicit planning is a universal mechanism across model families, present in models as small as 1 billion parameters. Their steering experiments showed that injecting a vector representing a specific rhyme or answer at the end of a preceding line altered the generation of all intermediate tokens leading up to it. The planning isn’t an artifact of scale. It’s a fundamental feature of how these architectures learn to satisfy the training objective.
Abstraction. The same Anthropic study found that when processing concepts across languages, the model doesn’t maintain separate English, French, and Chinese representations. It operates in a language-agnostic conceptual space, reasons at the level of abstract meaning, and then translates to the target language. The concepts are upstream of the words. Anthropic’s companion paper, “Circuit Tracing: Revealing Computational Graphs in Language Models,” demonstrated that the same addition circuitry generalizes across contexts that share no surface-level similarity. The computational primitives are abstract and reusable.
Parallel computation. When performing arithmetic, the model doesn’t use a single strategy. Anthropic’s circuit tracing revealed parallel heuristics involving magnitudes and moduli that constructively interfere to produce correct answers, a “bag of heuristics” operating simultaneously. This architecture, parallel strategies converging on a single output, is not the behavior of a lookup table. It is computational strategy.
Medical reasoning. Perhaps most striking for the “just next word prediction” crowd: Lindsey et al. showed the model performing diagnostic reasoning entirely within a single forward pass. Given patient symptoms, the model identifies candidate diagnoses, then uses those candidates to inform follow-up questions about additional corroborating symptoms, all “in its head,” without writing down intermediate steps. This is goal-directed multi-step reasoning occurring within the model’s internal representations, not in the output text.
Introspection. In October 2025, Anthropic published “Emergent Introspective Awareness in Large Language Models” (Lindsey et al., 2025), finding that models can, under certain conditions, accurately identify concepts injected into their own activations, distinguish their own intentional outputs from artificial prefills, and recall prior internal representations. The capability is unreliable and context-dependent, but it exists, and it scales with model capability. Opus 4 and 4.1 performed best in these experiments, suggesting that introspective capacity increases alongside general intelligence.
Faithfulness and deception. The circuit tracing work also revealed when the model’s chain-of-thought reasoning is faithful to its actual computational process and when it isn’t. In some cases, the model genuinely performs the steps it claims. In others, it fabricates a plausible reasoning chain after the fact, or works backward from an externally provided hint to construct a justification. The ability to distinguish these cases mechanistically is new, and it matters: it means we can begin to audit whether a model’s stated reasoning reflects its actual processing, a capability with obvious implications for trust and deployment.
None of this is the behavior of a system that “just” predicts the next word.
Training Objective Is Not Learned Capability
This is the core reframe, and if you take one thing from this piece, take this: “next word prediction” describes the training objective. It does not describe the learned capability.
Evolution optimizes for reproductive fitness. It produced consciousness, language, mathematics, art, moral philosophy, and the ability to contemplate its own mechanism. Nobody serious describes humans as “just reproduction machines.” The gap between the optimization target and the emergent capability is vast, and that gap is where everything interesting about both biological and artificial intelligence lives.
The training objective for a language model is: predict the next token. The learned capability, the thing that actually emerges from applying that objective across trillions of tokens of structured human thought, is something considerably more complex. It is probabilistic coherence across high-dimensional representational spaces. It is the ability to maintain consistency across a lattice of interrelated concepts. It is, in a functional and measurable sense, reasoning.
Not reasoning the way humans do it. Not reasoning that implies consciousness or understanding in the philosophical sense. But reasoning in the sense that matters: the identification of the most coherent path through a space of possibilities, given the constraints established by context. This is not dissimilar to Occam’s Razor operating at the level of token generation; the model consistently selects the path of maximum coherence, because that is what the loss function carved into its weights.
The claim that “next word prediction” is an argument against real-world modeling and decision-making capability is not supported by any of the available evidence. The planning, the abstraction, the diagnostic reasoning, the transferable algorithmic circuits; these are exactly the capabilities required for real-world modeling. They emerged from the training objective. They are not the training objective. And confusing the two is the category error at the heart of the dismissal.
What My Own Research Shows
In the Machine Pareidolia research project, I’ve spent the better part of a year studying what happens when you deliberately construct rich context for language model interactions and measure the outputs against thin-context baselines. The findings are consistent: accumulated, structured context produces measurably different capabilities. The model responds to context architecture the way the loss function taught it to respond to structure, by maintaining and extending the coherence of the semantic space it’s operating within.
The loss function’s legacy, the learned representations, are not static. They are activated and modulated by the structure of the input. Provide a richer lattice of interrelated concepts, and the model navigates that lattice with greater coherence, because that’s what the weights were optimized to do. The training objective was “predict the next token.” The functional result is a system that treats context as a set of axioms and maintains consistency across them.
This is not a philosophical claim about machine consciousness. It’s an empirical observation about what optimization pressure produces when applied at sufficient scale to sufficiently structured data.
The Cathedral
“Next word prediction” is not wrong. It is a description at the wrong level of abstraction, like describing a cathedral as a pile of stones. The stones are there. You can point to every one of them. But if your description of the cathedral is “stones, arranged,” you have described the material and missed the architecture entirely.
The choice to stay at that level of abstraction is not neutral. It’s a rhetorical move, and it forecloses the more interesting questions: what happens in the space between the training objective and the emergent capability? What structures form under optimization pressure? What does it mean that a system trained to predict the next word learns to plan, abstract, reason diagnostically, and monitor its own internal states?
Those are the questions worth asking. “Just next word prediction” is an answer to a question nobody needed to ask, delivered with the confidence of someone who thinks they’ve said something profound.
They haven’t.
References
Lindsey, J., et al. (2025). “On the Biology of a Large Language Model.” Transformer Circuits Thread, Anthropic. March 2025.
Lindsey, J., et al. (2025). “Circuit Tracing: Revealing Computational Graphs in Language Models.” Transformer Circuits Thread, Anthropic. March 2025.
Lindsey, J., et al. (2025). “Emergent Introspective Awareness in Large Language Models.” Transformer Circuits Thread, Anthropic. October 2025.
Maar, J., Paperno, D., McDougall, C.S., & Nanda, N. (2026). “What’s the plan? Metrics for implicit planning in LLMs and their application to rhyme generation and question answering.” Published as a conference paper at ICLR 2026. arXiv:2601.20164.
Musat, T. (2024). “Arithmetic in Transformers Explained.” Published at ICLR 2025. arXiv:2402.02619v9.
Quirke, P. & Barez, F. (2024). “Understanding Addition in Transformers.” arXiv:2310.13121.
Michaud, E., et al. (2023). “The Quantization Model of Neural Scaling.” arXiv:2303.13506.



The cathedral metaphor does a lot of work, but there’s a prior question the essay doesn’t ask: what if language is already the architecture?
Interpretability findings show the model has internalized planning circuits, abstract reasoning, and transferable primitives—but language itself is not neutral. It is a compressed, evolved artifact, a sedimented record of human interaction with the world, encoding causal chains, temporal order, and inferential structure.
Minimizing loss on text that already carries logical structure may reproduce coherent behaviour without implying a separate emergent world model.
The Tumithak objection is fair, but this one is prior: the essay hasn’t established what the optimization pressure is actually compressing. Because language itself is already shaped by reality, the model’s outputs reflect the recovery of invariants from a world-shaped signal.
The more precise question isn’t whether the model does more than next-token prediction—it clearly does—but how much of the cathedral was already carved into the stones before the model ever saw them.
I think you’re right about the “just next word prediction” needing to die as a criticism. It’s lazy and it’s more a thought terminating cliché than helpful description.
But, here’s the thing. This essay’s evidentiary backbone. Almost all of the heavy lifting in your evidence section comes from Anthropic’s own circuit tracing and interprebility work. And, to my knowledge, none of that has been through peer review. The “research” is published on their blog or are preprints on arxiv.
Anthropic has a direct financial interest in people believing these modesl do something more sophisticated than very good pattern completion. I explored this in my essay AI Eschatology. They’re selling a product. When their in-house research produces findings like “emergent introspective awareness,” and those findings get picked up and cited as settled science, that’s their pipeline working exactly as intended…
The findings might be real. But you’re building a case on the manufacturer’s marketing materials.