Policy Selection, A Conceptual Walkthrough, Universal Natural Intelligence

At the moment of choice, an active-inference agent does not consult a value table. It scores each candidate policy with a single number, the expected free energy G(pi), and lets a softmax turn those scores into a probability over what to do next. This post walks the score, end to end, on a deliberately small example.

For the frame around G(pi), see the pillar, expected free energy and goal-directed action. For the two components in detail, see epistemic versus pragmatic value. This walkthrough sits between them: the mechanics of picking one policy.

The setup

An agent lives in a small POMDP whose single hidden state variable takes one of two values, s = left or s = right. Its preference distribution P(o | C) over future observations puts most of its mass on the goal tile (Parr, Pezzulo and Friston, 2022, ch. 7) Class E. Its current belief over the hidden state is roughly 50 / 50.

Two policies are on the table over the next two timesteps.

Policy A, direct. Head straight for what the agent currently guesses is the goal tile, given its uncertain belief.
Policy B, probe then commit. Take one step through a room whose observation would sharply distinguish left from right, then head to the goal on the second step.

Scoring each policy

For every policy, the agent rolls its generative model forward and computes two expectations under the predicted future Class C: the expected log evidence of preferred observations (the pragmatic term), and the expected information gain about hidden states (the epistemic term). Both are measured in nats, in the same currency as inference itself, so they add cleanly.

Concretely, and with round numbers chosen to be legible rather than authoritative:

Policy	Pragmatic value	Epistemic value	Negative G(pi)
A, direct	+1.8	+0.1	+1.9
B, probe then commit	+1.1	+1.4	+2.5

Policy A wins on immediate preference: half of the belief already sits on the correct configuration, so walking toward the current best guess lands often enough on the preferred outcome. Policy B pays a pragmatic cost for the detour and more than earns it back by resolving the ambiguity. The negative G column is what feeds the next step.

From scores to a distribution: the softmax

The policy posterior is not the argmax of negative G. It is a softmax, q(pi) = sigma( gamma times negative G(pi) ), where gamma is a precision weight on the policy prior (Parr, Pezzulo and Friston, 2022) Class E. That one knob sets how greedy the choice is. At low gamma (say 0.5), the two policies come out roughly 58 percent B and 42 percent A, exploratory. At medium gamma (2.0), B rises to about 77 percent. At high gamma (5.0), B is about 95 percent, and the softmax collapses to a near-hard decision on the argmin.

Precision is the reason the same generative model can look decisive at one setting and ambivalent at another. It is not a display temperature; it is the inverse variance of the policy prior in the same variational apparatus that governs perception Class E. If it looks unfamiliar, the sister post KL divergence and Bayesian inference in active inference walks through where it comes from.

What changes when you move the dials

The numbers above are illustrative; the durable pattern is what matters. Sharpen P(o | C) around one outcome and Policy A rises. Widen the initial belief, or raise sensory precision on the probe observation, and Policy B rises. Turn gamma up and the argmin wins more decisively; turn it down and the agent hedges. These are the same three dials exposed in the Precision Lab Class C, so the behavioral regimes on the site map onto the same logic the table above shows.

The action-selection loop inside the labs implements this schedule directly: posterior update, forward roll under each policy, G(pi) as pragmatic plus epistemic, softmax with a precision, sample. For where the generative model that gets rolled forward lives in code, see encoding a generative model in Elixir.

What would break this walkthrough Class F. If, at fixed generative model and precision, the labs consistently sample a policy whose G(pi) is higher than an unselected alternative, the story on this page fails. The Precision Lab and Cell Lab both log G per candidate per tick, so a reader can check the claim without asking us to. Sample far more often outside the softmax band, and either the model is not the one the agent is actually using, or G is not the quantity being minimized.

Keep going

Expected free energy and goal-directed action ›

The pillar post that frames G(pi), pragmatic plus epistemic.

Epistemic versus pragmatic value ›

The two terms in detail, and why exploration is not a bolt-on.

Encoding a generative model in Elixir ›

Where the model that gets rolled forward under each policy lives in the codebase.

The Precision Lab ›

Move sensory precision, transition precision, and policy temperature yourself.

References. Parr, T., Pezzulo, G., and Friston, K. (2022). Active Inference: The Free Energy Principle in Mind, Brain, and Behavior. MIT Press. Chapter 7, policy selection and expected free energy; chapter 4, precision. UNI preprint, DOI 10.5281/zenodo.19785799, unrefereed. Numbers in the table are illustrative round values, not measurements from a specific lab run.