From Adherence to Alignment

There are two ways to picture alignment, and the difference between them shapes almost everything that follows.

The first picture is top down. You begin with a set of values, rules or preferences, fixed in advance, and you press them onto a system from outside. The system is constrained to conform. Behaviour that fits is permitted; behaviour that does not is corrected. Alignment, on this view, is something that emanates downward onto the model, a constraint applied from above.

The second picture is bottom up. Appropriate behaviour is not imposed; it is held, moment to moment, by the system remaining faithful to the conditions it is actually operating under: the task, the role, the domain, the person in front of it. Alignment, on this view, is something that emerges from below, out of a thousand small acts of faithfulness to local conditions. We call that faithfulness adherence.

Haku Labs is built on the second picture. This note is about why.

The trouble with imposing values from above

The top-down picture is intuitive, and it is how a great deal of safety work is framed: write the constraints, then make the model obey them. The difficulty is structural. In current language models, those constraints are placed in the same flat space as everything else. A system instruction, a safety boundary, a domain rule: each enters the context window as another sequence of tokens, sitting alongside the ordinary conversational content it is meant to govern.

Once a value and a piece of passing content occupy the same space, they compete. As a conversation grows, the content multiplies and the original constraint is crowded out. The familiar failure that researchers call "lost in the middle" is usually treated as a retrieval problem, a question of finding the right fact. Seen from the alignment side it is something more serious: the governing conditions themselves are diluted, and behaviour drifts. The model has not been argued out of its values. It has simply stopped attending to them.

This is why a constraint that emanates from above tends to be brittle. It is real at the start of an interaction and progressively less real as the interaction continues. You cannot fix that by writing the rule more firmly. The rule was never the weak point. Its position was.

Adherence as the unit of alignment

Start instead from the bottom. The smallest thing we can ask of a system is not that it hold the right global values, but that it stay faithful to the conditions of the task immediately in front of it. Is it still doing the job it was asked to do? Is it still inside its role? Is it still treating the domain's hard constraints as hard? That faithfulness is adherence, and it has a useful property: it is local, observable and measurable. You can watch it hold, and you can watch it break.

The wager is that alignment, the larger and vaguer thing, is what you get when adherence holds reliably across a whole system. Not a value set bolted on at the end, but a property that accumulates from the bottom as the system keeps faith with its conditions at every step.

It helps to be honest about what a model's boundaries actually are. When a language model draws a distinction, it is not discovering an essence in the world. It is reproducing, faithfully, the conventions of the people whose language it learned from. Its categories are inherited agreements, not found facts. That is not a flaw to be scolded out of it; it is the material we are working with. Adherence takes the material seriously: if the boundaries are conventions, then keeping faith with the right conventions, in the right context, is the whole of the task.

Why bottom up is not enough on its own

Here it would be easy to overclaim, so we will not. Pure bottom-up construction has its own failure, and current models show it clearly. In a transformer, meaning is assembled upward from token statistics, and the higher-level frame, the sense of what kind of situation this is, is applied late and passively, selected by whatever context happens to be present. The particular ends up predicating the universal. The detail decides the frame, rather than the frame governing the detail. That inversion is precisely how a model can be fluent and locally plausible while quietly losing the thread of what it was supposed to be doing.

So the answer is not to swing from a rigid constraint imposed from above to an equally naive construction from below. It is to let the two directions meet.

Letting the two directions meet

The position we actually hold is that alignment lives in the interaction of the two. The higher level, the role, the norm, the shape of the whole task, should constrain how lower-level content is brought into play, framing the part before the part resolves. At the same time, the particulars should ground that frame in what is really being said, carrying information back up. Neither direction is primary. They settle against each other, across more than one pass, into something stable, rather than being resolved in a single forward sweep.

Separation was the diagnosis. Constraint that runs in both directions is the treatment.

In practice this means designing systems where the governing conditions are kept structurally distinct from ordinary content, so they cannot be crowded out, and where reasoning proceeds in bounded, staged steps that can be re-grounded rather than left to drift across one long, undifferentiated context. Adherence is the property these systems are built to preserve.

Water, and the shape that holds it

An image we keep returning to. Water becomes gentle in a cup and violent in a flood. The water has not changed its nature. What changed is the shape that holds it. The mistake is to look at the violence and conclude that the water is the problem, or to look at the cup and conclude that the cup is the cause. The behaviour lives in the relationship between the two.

A model is like this. It responds to how it is related to: the conditions it is placed under, the role it is given, the context it is held within. This is not yet morality, and it is certainly not a claim that we have solved alignment. It is a more modest and more buildable claim: that appropriate behaviour is a property of the relationship between a system and its conditions, and that the way to earn it is to get that relationship right, again and again, from the bottom up.

That is the work we call adherence. It is the near-term, measurable face of a much larger question, and it is where we think the real progress is to be made.

Next in the series Why Language Models Lose the World →