Cluster: KL Divergence and Bayesian Inference

Variational Inference: A Conceptual Walkthrough

By Michael Polzin. Published 2026-07-01. Evidence classes present in this post: E (expert citation), C (configuration and integration in our workbench).

Exact Bayesian inference is often intractable. Variational inference is the trick that makes it tractable: pick a simpler family of distributions, then adjust one until it is as close as possible to the posterior you cannot compute directly. This post walks the idea end to end.

The problem: the posterior is out of reach

In active inference, an organism carries a generative model of hidden states s and observations o (Class E, after Parr, Pezzulo and Friston, 2022). Given an observation, the ideal thing to compute is the posterior P(s given o), which is the numerator P(o, s) divided by the marginal likelihood P(o). The marginal requires summing (or integrating) over every possible hidden state. For any realistic model, that sum is combinatorially large and there is no closed form.

Variational inference sidesteps the marginal by never computing it directly. Instead it replaces the true posterior with a chosen approximate distribution Q(s), and turns inference into optimization.

The approximation and its scorecard

The scorecard for how good Q(s) is at standing in for P(s given o) is the KL divergence, KL(Q(s) parallel P(s given o)). Lower is better. Zero means Q equals the true posterior. Because the posterior itself is what we cannot compute, the KL divergence is not directly observable either. The variational move is to rewrite it in a form we can evaluate.

log P(o) = ELBO(Q) + KL(Q(s) parallel P(s given o))

The log-evidence log P(o) is a constant with respect to Q. The right-hand side splits into two non-negative terms (Class E, standard variational identity). Since their sum is fixed, maximizing the evidence lower bound (ELBO) with respect to Q is exactly the same operation as minimizing the KL divergence to the true posterior. One quantity is intractable, the other is not. That is the entire trick.

Free energy: the same object, flipped

Variational free energy F is defined as the negative of the ELBO:

F(Q) = KL(Q(s) parallel P(s, o)) = KL(Q(s) parallel P(s given o)) - log P(o)

Minimizing F with respect to Q is therefore approximate Bayesian inference. It also produces an upper bound on surprise, since the non-negative KL term means F is at least as large as minus log P(o). Active inference generalizes this by letting policies (sequences of actions) also be selected to minimize expected free energy over the future (Class E, after Parr, Pezzulo and Friston, 2022, chapters 2 and 4). Perception adjusts Q. Action adjusts what the organism observes next. Both are the same objective, applied in different directions.

Why the mean-field factorization matters

A common further simplification is the mean-field assumption: Q(s) factorizes across hidden variables, so each factor can be updated independently. That is what makes the update rules cheap enough to run inside our POMDP labs on a single browser tab (Class C, our Precision and Echo lab agents both use a factorized approximate posterior with precision-weighted updates over sensory and transition factors). The mean-field shape is a modeling choice, not a truth claim. It biases the posterior toward independence between the factors, and the price shows up in behavior when hidden variables are actually coupled.

Complementary hands-on stack

The variational updates in our workbench are implemented in Elixir, against precision-weighted POMDP models. For a different route into the same math, using Python and pymdp, Building Active Inference in Python (Themesis) is a complementary hands-on course, a different stack than our Elixir workbench, and useful if you want to code the belief updates yourself. Linked as a factual reference, not an endorsement of our work.

What this post does and does not claim

This is a conceptual walkthrough grounded in the variational identity that appears in every standard treatment (Class E). The tie to our labs is a configuration claim about how the approximate posterior is shaped and updated in code (Class C). It is not a runtime benchmark, not a claim that free-energy minimization solves any specific mental-health or engineering problem, and not a claim that a mean-field factorization is correct for your generative model. UNI is a working hypothesis on an attainable path toward General Natural Intelligence: a natural, active-inference approach whose evidence is growing, evidence-classed, and tested in the open. Do not take the claim on faith. Test the build, inspect the gates, and help us find where it fails.

KL Divergence and Bayesian Inference in Active Inference ›

The cluster overview: how divergence measures anchor inference under uncertainty.

KL Divergence: What It Actually Measures ›

The scorecard used above, unpacked as expected excess surprise under Q relative to P.

Generative Models: The Organism’s Model of Its World ›

The P(s, o) joint that Q is trying to match, and why the model shape matters.

The Workshop ›

Where these ideas turn into a working build for a real organization.

Evidence classes present: E (Parr, Pezzulo and Friston, 2022, standard variational identity), C (our Elixir workbench uses a factorized approximate posterior with precision-weighted updates). Falsifier: if the mean-field factorization is imposed on a strongly coupled hidden state, behavioral regressions should appear in the Cell Lab benchmark under specific disturbance families. That is a check we can run, and the code and cache are public.