Active inference does not choose what to do by summing rewards. It chooses by minimizing the expected free energy of a policy. That single quantity carries both the pull of a goal and the pull of information, and once you see the decomposition, a lot of goal-directed behavior stops looking mysterious.
This post is the pillar entry for how policies get selected in the framework UNI sits inside. It covers the definition conceptually, walks a tiny example, notes how the UNI action loop wires the same math in code, and closes with a falsifier: the behavioral pattern that would break the story.
Where EFE lives in active inference
Perception in active inference minimizes variational free energy over beliefs about hidden states, a KL divergence bound on surprise about observations (Parr, Pezzulo and Friston, 2022) Class E. Action selection is the same principle projected into the future: for each candidate policy, the agent computes the free energy it expects to encounter if it follows that policy, then softmaxes over policies weighted by negative expected free energy. Lower expected free energy, higher policy probability.
Expected free energy (EFE), often written G(pi) for a policy pi, is the quantity that lets a Bayesian agent trade off two things at once: staying close to what it prefers, and finding out what it does not yet know Class E.
The two terms
The standard decomposition breaks G(pi) into a pragmatic term and an epistemic term. Both are expectations under the agent's predictive model of what future observations and states will look like under the policy.
- Pragmatic value is the expected log evidence of preferred observations under the policy. Concretely, the agent has prior preferences over outcomes, written as a distribution P(o | C), and the pragmatic term rewards policies whose predicted future observations place mass in that preferred region. This is how goals enter without a reward signal (Parr, Pezzulo and Friston, 2022, ch. 7) Class E. For deeper mechanics of how a preference prior shapes behavior, see the companion post, prior preferences and goal-directed behavior.
- Epistemic value is the expected information gain about hidden states under the policy. Formally, it is the expected KL divergence between the posterior over states after observing and the prior before observing. Policies that resolve ambiguity about the world get a bonus. The unpacking of these two components has its own pillar post, epistemic versus pragmatic value.
The KL and posterior machinery come from the same variational inference apparatus that governs perception. If you have not seen how the divergence sits inside the free energy bound, the sister post on KL divergence and Bayesian inference in active inference walks through it slowly.
Prior preferences instead of rewards
The preference distribution P(o | C) is a soft target, not a scalar reward. It admits multi-modal goals, satisficing regions, and constraints, all in the same currency as inference: log probability, measured in nats Class E. In practice this means an active-inference agent can prefer a range of outcomes, weight some more strongly than others, and never confuse the value of information with the value of a state, which is what a reward function tends to do.
That is the difference with reinforcement learning value functions. A Q-value collapses the future into one number, expected discounted reward, and epistemic behavior only appears if you add exploration bonuses by hand. In active inference, exploration is not a bolt-on; it falls out of the same G(pi) the pragmatic term does. Epistemic and pragmatic value are commensurate because both are measured in nats of the same generative model Class E.
A tiny policy-selection walkthrough
Consider an agent with two policies over the next two timesteps in a small POMDP. Policy A moves toward a well-known preferred location. Policy B detours through a room that would resolve ambiguity about which of two possible world configurations is real, then heads for the preferred location.
- Under Policy A, the pragmatic term is large and positive (preferred observations are highly probable), and the epistemic term is small (the agent learns little).
- Under Policy B, the pragmatic term is somewhat smaller (the detour costs time inside preferred outcomes), but the epistemic term is larger (the ambiguous state gets disambiguated).
Whether the agent picks A or B is not a matter of taste. It is decided by which policy has the lower G(pi) under the agent's current generative model and precision. When the prior on preferences is sharp and the world is well-known, A wins. When ambiguity is high and preferences are diffuse, B wins. Precision (an inverse variance on the policy prior) tunes how greedy this argmin is, giving the same knob that separates deliberate from compulsive behavior in the labs on this site. For a full step-by-step, see policy selection, a conceptual walkthrough.
How UNI wires the action-selection loop
Inside the Precision Lab and its siblings, the loop reads compactly Class B: on each tick, the agent (a) updates its posterior over hidden states from the current observation, (b) rolls the generative model forward under each policy for a small horizon, (c) computes G(pi) as pragmatic plus epistemic components from those rollouts, (d) softmaxes with a temperature (policy precision) to get a policy posterior, and (e) samples the next action. Nothing in that loop is a reward. The dials you can move in the browser (sensory precision, transition precision, policy temperature) enter at the inference and softmax steps, not as external rewards.
The action loop is discrete-time and horizon-limited on purpose: it is a concrete instantiation of the Parr, Pezzulo and Friston (2022) POMDP formulation, close enough that a reader who has worked through their chapters can point at each variable in the code Class B, Class E. What UNI adds on top is the benchmark, which asks whether this loop actually holds under adversarial disturbance families, not just in a toy maze. That is where the Stratified Palimpsest benchmark lives.
Falsifier
What this post does not claim
UNI is a working hypothesis on an attainable path toward General Natural Intelligence: a natural, active-inference approach whose evidence is growing, evidence-classed, and tested in the open. Do not take the claim on faith. Test the build, inspect the gates, and help us find where it fails. Nothing here is medical advice or a diagnosis. The Parr, Pezzulo and Friston mapping is a citation, not a certification, and any equation-level correspondence between UNI internals and that textbook is documented where it exists and left as open work where it does not.