There is a presupposition that two very different fields have shared, with the same quiet, pernicious effect in each. In psychology it had a name and an era: behaviourism. The claim was that the observable output of a system, what it does, is the whole of what is worth modelling, and that the inner process which produced the behaviour is either unknowable or beside the point. Machine learning has inherited that presupposition without ever quite naming it. This note is an attempt to name it, and to say what it costs.

A clarification before the argument, because the strong version of this claim is wrong. The point is not that a language model has no inner workings; it plainly has rich internal representations. The point is narrower and, I think, harder to escape. It is about what we supervise. When we train with reinforcement from human feedback, or by cloning expert behaviour, the thing we measure and reward is the output. An expert endorses an answer, a worker performs an action, and the model is shaped to reproduce it. The reasoning that produced the answer, the considerations the expert weighed, the conditions they were tracking, the intent behind the act, is never itself a target. It is updated only by accident, insofar as it happens to help reproduce the behaviour. This is a methodological behaviourism: a behaviourist theory of supervision, sitting on top of a system that is not behaviourist at all.

What psychology already learned

Psychology has run this experiment, and we know roughly how it comes out. Behaviourism was not a mistake so much as an over-extended success. From it we got behavioural and cognitive-behavioural therapy, which are genuinely effective and measurably so. Nobody serious disputes that CBT helps, often a great deal. But there is broad agreement, even among its practitioners, about where it stops. It is excellent at the level of the symptom and the behaviour, and it does not, on its own, reach the root. It does not resolve the underlying cause of a trauma, because the underlying cause does not live at the level of observable behaviour. It lives in how the thing is held, generated and experienced.

This is why the depth and experiential traditions exist, and why they have such strong clinical effects. They do not treat a person as a bundle of behaviours to be reshaped. They work beneath the behaviour, with how a situation actually shows up for the person, the felt sense of it, and with what that generates downstream. Crucially, this is not the same as analysing or explaining the problem. A patient can produce a fluent, articulate account of their own difficulty and be no closer to its root, because the account is a surface too. The articulate explanation can itself be a defence. The work is to reach what generates the behaviour, not to collect a better description of it.

The trap in the obvious fix

Hold that last point, because it disarms the most natural objection. Someone will say: but we have moved past pure output training. We have chain-of-thought, we have process supervision, we reward the model for showing its reasoning. Surely that is the cure.

It is not, or not yet, and the clinical parallel says why. A stated reasoning trace is the model talking about its reasoning. It is a verbalisation, produced after the fact, and there is now good evidence that these traces are frequently not the actual cause of the answer at all. Rewarding the explanation is still rewarding a behaviour. It is the same move as before, applied one level up: we have simply added "produce a plausible account of your reasoning" to the list of outputs we score. In the language of the therapy room, this is intellectualisation, the articulate surface mistaken for the generative depth. Process supervision of stated reasoning is behaviourism at one remove, and it inherits the same ceiling.

The explanation is a behaviour too. Rewarding it is not the same as modelling the reasoning that produced the act.

Why this is not just an analogy

There is a concrete version of the problem in machine learning, with none of the clinical vocabulary. Cloning behaviour from demonstrations, the regime behind imitation from recorded human activity, is known to fail in a specific way: it copies the action without recovering the intent that made the action correct, and so it breaks the moment conditions drift away from the demonstrations. The whole point of the harder, less fashionable approaches is to recover the latent reason behind the behaviour rather than the behaviour itself. The behaviourist route is chosen not because it is right but because it is cheap. Output labels are abundant; the reasons behind them are expensive and hard to capture faithfully. That is an honest reason, and it is also exactly how behaviourism won its ground in psychology. It was the tractable thing to measure.

So the cost is not paid everywhere. For tasks that stay close to their training distribution, behaviour is often enough, just as CBT is often enough. The cost is paid in a particular class of problem, the one we care about most: holding up under conditions that were never demonstrated, respecting boundaries in high-sensitivity settings, staying faithful to the point of a task rather than its surface form. That class is what we mean by adherence. And it is precisely the class that a model of behaviour alone cannot serve, because adherence is a property of the reasoning, not of the output.

What a cognitive model actually is

This sharpens what we mean when we say we build cognitive models. A cognitive model, for us, is not a richer description of what an expert does. It is a model of how an expert reasons: how a situation shows up for them, what they attend to, what conditions govern their judgement, and how that generates the action. It is closer to a phenomenological account of the practitioner's process than to a behavioural log of their outputs.

Two honesties are owed here, and I would rather state them than have them found. The first: we are not claiming that the machine should have experience, a felt sense of its own. That would be a category error, and it is not the claim. The claim is that the human cognitive model we use to structure and supervise a system has to encode the generative reasoning, the why, and not only the behaviour, because that is the only thing adherence can be built on. The second: this is harder than measuring outputs, and that difficulty is the entire reason the field defaulted to behaviour. We do not pretend otherwise. We think the difficulty is worth taking on, because the alternative is to keep polishing the surface of a system whose depth we declined to model.

It also closes a loop with an earlier note in this series. We argued before that training lets the operators which mark the status of a claim decay into plain content, so the model keeps the assertion and loses the frame. This is the same failure seen from another side. In both cases the generative, governing level, the reason, the condition, the frame, is collapsed into the behavioural level, the output, the content. A world model assembled that way is not a model of the world. It is the lumped behaviour of a behaviourism, and no amount of scale turns the one into the other.

The earliest note in this series put it as an image: water is gentle in a cup and violent in a flood, and the mistake is to think the cup is the cause. Behaviourism makes the same mistake about the mind, and we have taught our machines to make it too. We keep studying the shape of the water. The cause was never in the behaviour. It was in what gave rise to it.

Back to All research notes