At the moment of choice, an active-inference agent does not consult a value table. It scores each candidate policy with a single number, the expected free energy G(pi), and lets a softmax turn those scores into a probability over what to do next. This post walks the score, end to end, on a deliberately small example.
For the frame around G(pi), see the pillar, expected free energy and goal-directed action. For the two components in detail, see epistemic versus pragmatic value. This walkthrough sits between them: the mechanics of picking one policy.
The setup
An agent lives in a small POMDP whose single hidden state variable takes one
of two values, s = left or s = right. Its preference
distribution P(o | C) over future observations puts most of its mass on the
goal tile (Parr, Pezzulo and Friston, 2022, ch. 7)
Class E. Its current belief over the hidden state is
roughly 50 / 50.
Two policies are on the table over the next two timesteps.
- Policy A, direct. Head straight for what the agent currently guesses is the goal tile, given its uncertain belief.
-
Policy B, probe then commit. Take one step through a
room whose observation would sharply distinguish
leftfromright, then head to the goal on the second step.
Scoring each policy
For every policy, the agent rolls its generative model forward and computes two expectations under the predicted future Class C: the expected log evidence of preferred observations (the pragmatic term), and the expected information gain about hidden states (the epistemic term). Both are measured in nats, in the same currency as inference itself, so they add cleanly.
Concretely, and with round numbers chosen to be legible rather than authoritative:
| Policy | Pragmatic value | Epistemic value | Negative G(pi) |
|---|---|---|---|
| A, direct | +1.8 | +0.1 | +1.9 |
| B, probe then commit | +1.1 | +1.4 | +2.5 |
Policy A wins on immediate preference: half of the belief already sits on the correct configuration, so walking toward the current best guess lands often enough on the preferred outcome. Policy B pays a pragmatic cost for the detour and more than earns it back by resolving the ambiguity. The negative G column is what feeds the next step.
From scores to a distribution: the softmax
The policy posterior is not the argmax of negative G. It is a softmax, q(pi) = sigma( gamma times negative G(pi) ), where gamma is a precision weight on the policy prior (Parr, Pezzulo and Friston, 2022) Class E. That one knob sets how greedy the choice is. At low gamma (say 0.5), the two policies come out roughly 58 percent B and 42 percent A, exploratory. At medium gamma (2.0), B rises to about 77 percent. At high gamma (5.0), B is about 95 percent, and the softmax collapses to a near-hard decision on the argmin.
Precision is the reason the same generative model can look decisive at one setting and ambivalent at another. It is not a display temperature; it is the inverse variance of the policy prior in the same variational apparatus that governs perception Class E. If it looks unfamiliar, the sister post KL divergence and Bayesian inference in active inference walks through where it comes from.
What changes when you move the dials
The numbers above are illustrative; the durable pattern is what matters. Sharpen P(o | C) around one outcome and Policy A rises. Widen the initial belief, or raise sensory precision on the probe observation, and Policy B rises. Turn gamma up and the argmin wins more decisively; turn it down and the agent hedges. These are the same three dials exposed in the Precision Lab Class C, so the behavioral regimes on the site map onto the same logic the table above shows.
The action-selection loop inside the labs implements this schedule directly: posterior update, forward roll under each policy, G(pi) as pragmatic plus epistemic, softmax with a precision, sample. For where the generative model that gets rolled forward lives in code, see encoding a generative model in Elixir.