Reinforcement learning tells the agent to chase reward. Active inference tells the agent to minimize expected free energy. The two objectives sit next to each other on the page and look like siblings, yet they optimize different things. This post walks the seams carefully.
What reward-based value captures
In standard reinforcement learning the agent selects actions to maximize the expected sum of scalar rewards over some horizon. Value functions estimate that expected return. The reward signal is exogenous to the agent: someone (a designer, an environment) hands it in. Exploration is bolted on with heuristics like epsilon-greedy, entropy bonuses, or intrinsic-motivation auxiliaries. (Class E, textbook citation)
What expected free energy captures
Expected free energy (EFE) is a single objective with two additive terms. One term penalizes the divergence between predicted outcomes under a policy and the agent's prior preferences over outcomes. The other term rewards information gain about hidden states. Parr, Pezzulo and Friston (2022) present the EFE decomposition as: pragmatic value (how well predicted outcomes align with preferences) plus epistemic value (how much a policy is expected to reduce posterior uncertainty over states). (Class E, Parr, Pezzulo, Friston 2022)
In active inference the generative model already carries the agent's preferences as a prior distribution over outcomes, written P(o | C). There is no external reward channel. Preference and belief share a language. (Class C, model inspection)
Where they overlap
When epistemic uncertainty is small and prior preferences are peaked on a small outcome set, EFE reduces cleanly to a reward-shaped objective. The KL divergence between predicted outcomes and preferences behaves like a negative log-likelihood of a reward signal, and the pragmatic term dominates. In this regime an EFE agent and a well-tuned reinforcement learner will often select the same policy. (Class E, review literature; Class C, code inspection of UNI Precision Lab)
Where they diverge
Three seams matter.
- Exploration is native, not added. EFE's epistemic term is a first-class part of the objective. A reinforcement learner needs to be told to explore. An EFE agent explores because information gain is priced into the same argmax. (Class E, Parr et al. 2022)
- Preferences are distributions, not scalars. Prior preferences P(o | C) can express "I want to end near these outcomes with this shape of uncertainty." Reward is a scalar per step. Distributions carry structure that scalars flatten. (Class C, model inspection)
- The units are nats, not utils. EFE is a variational free energy quantity, measured in nats. It is not a thermodynamic quantity, and it is not commensurate with reward utility. Mixing the two invites category errors. (Class E, Parr et al. 2022, chapter 2)
Why the distinction is practical
In the UNI labs the same maze looks different under the two objectives. A reward-shaped agent that finds a corridor to the goal early tends to stop probing. An EFE agent with the same generative model keeps probing corridors whose hidden-state uncertainty is still high, then commits when posterior entropy has dropped. On the Cell Lab benchmark the epistemic term is what lets UNI hold RecoveryScore up under disturbance families it has never seen in training, because the agent updates beliefs while it acts. (Class C, code inspection; benchmark data on the Science page)
What this does not prove
EFE is not a universal winner. On memory-leak and cpu-noisy-neighbor families the neural baseline still edges UNI. On database-flaky the rule-based controller does. The point is not that active inference beats reinforcement learning. The point is that the two objectives optimize different quantities, and the choice of objective shapes what the agent will do when the world stops matching its training distribution. (Class E, benchmark results)
Themesis has written on the lineage of scaling shifts in deep learning, transformers, and SeedIQ: Deep Learning, Transformers, and SeedIQ, Three Industry Breakthroughs. Our reading, in our own voice: the pattern of successive scaling regimes sets the terrain UNI is building on, without claiming UNI has arrived at any of them. UNI is a working hypothesis on an attainable path toward General Natural Intelligence: a natural, active-inference approach whose evidence is growing, evidence-classed, and tested in the open. Do not take the claim on faith. Test the build, inspect the gates, and help us find where it fails.