Expected Free Energy and Goal-Directed Action, Universal Natural Intelligence

Active inference does not choose what to do by summing rewards. It chooses by minimizing the expected free energy of a policy. That single quantity carries both the pull of a goal and the pull of information, and once you see the decomposition, a lot of goal-directed behavior stops looking mysterious.

This post is the pillar entry for how policies get selected in the framework UNI sits inside. It covers the definition conceptually, walks a tiny example, notes how the UNI action loop wires the same math in code, and closes with a falsifier: the behavioral pattern that would break the story.

Where EFE lives in active inference

Perception in active inference minimizes variational free energy over beliefs about hidden states, a KL divergence bound on surprise about observations (Parr, Pezzulo and Friston, 2022) Class E. Action selection is the same principle projected into the future: for each candidate policy, the agent computes the free energy it expects to encounter if it follows that policy, then softmaxes over policies weighted by negative expected free energy. Lower expected free energy, higher policy probability.

Expected free energy (EFE), often written G(pi) for a policy pi, is the quantity that lets a Bayesian agent trade off two things at once: staying close to what it prefers, and finding out what it does not yet know Class E.

The two terms

The standard decomposition breaks G(pi) into a pragmatic term and an epistemic term. Both are expectations under the agent's predictive model of what future observations and states will look like under the policy.

Pragmatic value is the expected log evidence of preferred observations under the policy. Concretely, the agent has prior preferences over outcomes, written as a distribution P(o | C), and the pragmatic term rewards policies whose predicted future observations place mass in that preferred region. This is how goals enter without a reward signal (Parr, Pezzulo and Friston, 2022, ch. 7) Class E. For deeper mechanics of how a preference prior shapes behavior, see the companion post, prior preferences and goal-directed behavior.
Epistemic value is the expected information gain about hidden states under the policy. Formally, it is the expected KL divergence between the posterior over states after observing and the prior before observing. Policies that resolve ambiguity about the world get a bonus. The unpacking of these two components has its own pillar post, epistemic versus pragmatic value.

The KL and posterior machinery come from the same variational inference apparatus that governs perception. If you have not seen how the divergence sits inside the free energy bound, the sister post on KL divergence and Bayesian inference in active inference walks through it slowly.

Prior preferences instead of rewards

The preference distribution P(o | C) is a soft target, not a scalar reward. It admits multi-modal goals, satisficing regions, and constraints, all in the same currency as inference: log probability, measured in nats Class E. In practice this means an active-inference agent can prefer a range of outcomes, weight some more strongly than others, and never confuse the value of information with the value of a state, which is what a reward function tends to do.

That is the difference with reinforcement learning value functions. A Q-value collapses the future into one number, expected discounted reward, and epistemic behavior only appears if you add exploration bonuses by hand. In active inference, exploration is not a bolt-on; it falls out of the same G(pi) the pragmatic term does. Epistemic and pragmatic value are commensurate because both are measured in nats of the same generative model Class E.

A tiny policy-selection walkthrough

Consider an agent with two policies over the next two timesteps in a small POMDP. Policy A moves toward a well-known preferred location. Policy B detours through a room that would resolve ambiguity about which of two possible world configurations is real, then heads for the preferred location.

Under Policy A, the pragmatic term is large and positive (preferred observations are highly probable), and the epistemic term is small (the agent learns little).
Under Policy B, the pragmatic term is somewhat smaller (the detour costs time inside preferred outcomes), but the epistemic term is larger (the ambiguous state gets disambiguated).

Whether the agent picks A or B is not a matter of taste. It is decided by which policy has the lower G(pi) under the agent's current generative model and precision. When the prior on preferences is sharp and the world is well-known, A wins. When ambiguity is high and preferences are diffuse, B wins. Precision (an inverse variance on the policy prior) tunes how greedy this argmin is, giving the same knob that separates deliberate from compulsive behavior in the labs on this site. For a full step-by-step, see policy selection, a conceptual walkthrough.

For readers of the resource map by Themesis, the point of contact is direct. Themesis lists SolutionWright as one of five pathways into active inference in Where to Start with Active Inference, A Resource Map for 2026. In our voice: an external map that names SWU among several pathways, which we treat as a fact of listing, not an endorsement. If you are here from that map, this post is the family's canonical entry for EFE, and the labs let you move the dials on the same math yourself.

How UNI wires the action-selection loop

Inside the Precision Lab and its siblings, the loop reads compactly Class B: on each tick, the agent (a) updates its posterior over hidden states from the current observation, (b) rolls the generative model forward under each policy for a small horizon, (c) computes G(pi) as pragmatic plus epistemic components from those rollouts, (d) softmaxes with a temperature (policy precision) to get a policy posterior, and (e) samples the next action. Nothing in that loop is a reward. The dials you can move in the browser (sensory precision, transition precision, policy temperature) enter at the inference and softmax steps, not as external rewards.

The action loop is discrete-time and horizon-limited on purpose: it is a concrete instantiation of the Parr, Pezzulo and Friston (2022) POMDP formulation, close enough that a reader who has worked through their chapters can point at each variable in the code Class B, Class E. What UNI adds on top is the benchmark, which asks whether this loop actually holds under adversarial disturbance families, not just in a toy maze. That is where the Stratified Palimpsest benchmark lives.

Video companion. Themesis has a two-part talk, Deep Learning Did It. Transformers Did It. Active Inference Just Did It Again (Part 1). In our voice: it is a general explainer for why the active-inference research program keeps showing up, useful background for anyone new to the vocabulary this post uses. We link it as a resource, not as a claim about our own work.

Falsifier

What would break this story Class F. If UNI agents in the labs consistently select policies whose G(pi) is higher than an unselected alternative, holding the generative model and precisions fixed, the EFE-minimization account fails. Concretely: run the same disturbance families in the Cell Lab with the pragmatic and epistemic terms logged per candidate policy per tick. If the sampled policy is not the argmin of G(pi) more often than the softmax temperature predicts, and if retuning precision does not close the gap, then either the model is not the one the agent is actually using, or EFE is not what is driving action here. That would not save reinforcement learning, but it would sink the specific claim this pillar rests on. We publish the logs; readers can check.

What this post does not claim

UNI is a working hypothesis on an attainable path toward General Natural Intelligence: a natural, active-inference approach whose evidence is growing, evidence-classed, and tested in the open. Do not take the claim on faith. Test the build, inspect the gates, and help us find where it fails. Nothing here is medical advice or a diagnosis. The Parr, Pezzulo and Friston mapping is a citation, not a certification, and any equation-level correspondence between UNI internals and that textbook is documented where it exists and left as open work where it does not.

Keep going

Policy selection, a conceptual walkthrough ›

Step through G(pi) on a tiny POMDP, with numbers you can follow.

Epistemic versus pragmatic value ›

Why an active-inference agent explores without a bonus term bolted on.

Prior preferences and goal-directed behavior ›

How P(o | C) replaces a reward function, and what that buys you.

Cell Lab, the falsification benchmark ›

Run the EFE-minimization claim against random, rule-based, and neural baselines.

References. Parr, T., Pezzulo, G., and Friston, K. (2022). Active Inference: The Free Energy Principle in Mind, Brain, and Behavior. MIT Press. Chapter 7 covers policy selection and expected free energy. Namjoshi (2026), UNI preprint, DOI 10.5281/zenodo.19785799, is the unrefereed technical companion. Themesis resource map at themesis.com.