Why We Do Not Use Value Functions, Universal Natural Intelligence

People ask us where the value function lives in the UNI build. The short answer is that there isn't one, and that is a design choice, not an oversight. This post walks through why, what we use in its place, and where the choice could be wrong.

UNI is a working hypothesis on an attainable path toward General Natural Intelligence: a natural, active-inference approach whose evidence is growing, evidence-classed, and tested in the open. Do not take the claim on faith. Test the build, inspect the gates, and help us find where it fails.

What a value function is, and what it silently assumes.

In classical reinforcement learning, a value function V(s) or Q(s,a) encodes the expected discounted future reward of being in a state (or taking an action from a state), under some policy. Optimal behavior falls out of choosing the action with the highest expected value. The math is clean and the engineering track record is real (Class E).

The quiet assumption is that a scalar reward signal exists, is stable over the horizon that matters, and captures what the system actually cares about. For a chess engine that assumption is defensible. For a cardio-renal loop, an incident response, or a person recovering cognitive footing after a bad week, it is not. The reward channel is the modeling artifact most likely to be miscalibrated, and a value function inherits every miscalibration compounded across the horizon (Class E, following Parr, Pezzulo and Friston (2022), Chapter 2).

What we use instead.

The UNI labs choose actions by minimizing expected free energy (EFE) over policies, exactly as spelled out in Parr, Pezzulo and Friston (2022) Chapter 2 (Class E). EFE decomposes into two terms the agent can compute from its own generative model:

A pragmatic term, the KL divergence between predicted outcomes under a policy and a prior distribution over preferred outcomes P(o). This is where "what the system wants" lives. It is a distribution over outcomes, not a scalar over states.
An epistemic term, the expected information gain about hidden states under a policy. This is the term that makes an active-inference agent seek out observations that would sharpen its own posterior, even when no immediate preferred outcome is at stake.

Two things fall out of this choice that a value function does not give you for free (Class C, from inspection of the Precision Lab and Cell Lab controllers in this repository). First, the objective is a divergence against a distribution of preferences, so uncertainty in the preference itself is a first-class citizen: a wide P(o) quietly tells the agent "any of these is fine", which is exactly the right thing to say when the operator does not know yet. Second, the epistemic term is not a bolt-on exploration bonus tuned by hand. It is the same nats as the pragmatic term, priced in the same currency, and it goes to zero on its own once the posterior is sharp enough. There is no exploration schedule to anneal.

Why the choice matters for the domains we care about.

The Cell Lab is a service cell under disturbance. The preferred outcome is not a number, it is a viable set: a distribution over service-level observations we would rather see. A UNI controller that minimizes KL to that viable set behaves, empirically, differently from a controller trying to maximize a hand-designed reward proxy (Class E, benchmark results shown on the science page). It gives up some peak performance on families where the neural baseline has been allowed to fit the reward directly, and it holds up better on families where the "right" reward is hard to write down at all. That is the trade we made on purpose.

The Heart Lab is a slower version of the same argument. The Loop Lab exists precisely to expose the bifurcation point where the choice of sensory precision, not the choice of reward, decides the regime the agent lands in (Class C, from the lab's own dial-and-plot behavior). A value function would obscure that structure. EFE surfaces it.

Where the choice could be wrong.

We owe the reader an honest list of the failure modes we watch for:

If the domain really does have a well-calibrated scalar reward and a stationary transition model, a well-tuned value-function method will beat EFE on wall-clock and on sample efficiency. The Cell Lab shows one such family (memory_leak, where the neural baseline wins).
If the prior over preferred outcomes P(o) is miscoded, EFE is not immune to garbage-in. It just fails differently: the pragmatic term drives the agent toward the wrong distribution instead of toward the wrong number. The fix is to make preferences inspectable and editable, which is what the dials in the Precision Lab are for.
If the generative model is badly mis-specified, both terms of EFE are computed against fiction and the agent will confidently pursue the wrong thing. This is the failure mode we take most seriously, and it is why the labs expose the model to the operator instead of hiding it inside a learned weight matrix.

None of this rules out value functions. It says that, for the specific problem shape UNI is aimed at (partially observable environments, unstable reward channels, operators who need to inspect and steer), an EFE objective is the more honest primitive. If you can show the falsifier where EFE loses to a value function on a domain like the Cell Lab under matched compute, we would like to see it. That is what the benchmark is for.

EFE vs reward, a careful comparison ›

The longer form, with the math laid out and the domains where each objective wins.

Action selection in the UNI workbench ›

How EFE actually gets computed, step by step, in the deployed labs.

Gates and falsifiers, how we know when we are wrong ›

The pre-registered criteria that would tell us the EFE choice was a mistake.