Cluster: KL divergence and Bayesian inference

The Evidence Lower Bound (ELBO) in Active Inference

What the ELBO bounds, why the bound matters for tractable Bayesian inference, and how it becomes variational free energy once you flip the sign. Conceptual math, honest citations, and notes from the UNI workbench.

By Michael Polzin, 1 July 2026. Evidence classes present: E, C

Exact Bayesian inference is often uncomputable in the models an agent actually cares about. The ELBO is the trick that turns that computational dead end into a working optimisation problem. It is the same object active inference calls variational free energy, wearing a different coat.

UNI is a working hypothesis on an attainable path toward General Natural Intelligence: a natural, active-inference approach whose evidence is growing, evidence-classed, and tested in the open. Do not take the claim on faith. Test the build, inspect the gates, help us find where it fails.

What the ELBO actually bounds

Given a generative model P(o, s) over observations o and hidden states s, the log-evidence log P(o) is what a Bayesian agent would ideally compute. To get it you have to marginalise the joint over every possible hidden state, which is often intractable for realistic state spaces (Class E, Parr, Pezzulo and Friston, 2022, chapter 4) Class E.

Variational inference sidesteps this by picking an approximate posterior Q(s) from a tractable family and writing:

The KL term is non-negative, so ELBO(Q) is a lower bound on log P(o). Maximising the ELBO over Q does two things at once: it tightens the bound, and it pulls Q(s) toward the true posterior P(s | o). That is the whole game (Class E, standard result in variational Bayes) Class E.

Read the identity carefully: the ELBO is not an approximation of the log-evidence, it is a bound. When the approximate posterior matches the true posterior, the bound is tight and the ELBO equals the log-evidence exactly.

The sign flip: ELBO and variational free energy

Active inference works with variational free energy F(Q, o) instead of the ELBO. The relationship is a sign flip:

Maximising the ELBO is minimising variational free energy. Because F is always greater than or equal to surprise (-log P(o)), minimising F tightens an upper bound on surprise, the mirror image of tightening a lower bound on log-evidence (Class E, Parr et al., 2022, chapter 2) Class E. The literature uses both framings. Machine learning tends to speak ELBO; active inference tends to speak free energy. The math is the same object.

Why the bound matters for tractability

Two things become possible once you have the ELBO in hand. First, you can optimise Q by gradient descent inside a chosen variational family (mean-field, structured, or amortised), instead of trying to integrate an intractable joint. Second, you can stop early: any Q gives you a valid bound, so partial optimisation still yields a usable estimate of belief, just a looser one. Perception under time pressure is exactly this: a partial ascent up the ELBO, cashed in as an approximate posterior (Class E, general variational Bayes) Class E. For a fuller conceptual pass on the inference machinery, see variational inference, a conceptual walkthrough.

Two decompositions, one identity

The ELBO has two decompositions used constantly in the active-inference literature. The first is the accuracy-complexity decomposition:

The second is the evidence-KL decomposition already given above, which is the one that establishes the bound. Both drop out of the same identity by rearranging the log-joint. If either derivation feels shaky, slow down on what KL divergence actually measures before pushing further.

What the ELBO looks like inside UNI

The UNI labs run a discrete-time POMDP active-inference core in the browser. Static inspection of that core Class C shows the ELBO appearing as variational free energy at each perceptual tick: an accuracy term against the current observation, and a complexity term against the transition-propagated prior from the previous tick. The precision dials on the Precision Lab modulate how much weight the accuracy term carries relative to the complexity term. That is a behavior you can watch: the same observation, at different sensory precisions, yields different posteriors because the ELBO is being ascended on a differently weighted surface. Falsifier posture applies Class F: if moving precision does not shift behavior as the decomposition predicts, the implementation or the theory is wrong in a way you can see. See our companion piece on generative models, the organism model of its world for how UNI chooses the model pieces the ELBO is bounding.

What this post is not

It is not a derivation, it is not a clinical instrument, and it is not the claim that our system has general intelligence. It is a conceptual map of one identity, cited to Parr, Pezzulo and Friston (2022), grounded in code inspection of the UNI core. The Zenodo preprint is unrefereed. Behavioral labels in the labs are hypotheses, not diagnoses.

The Workshop ›
The tightly qualified, publish-gate-backed working session where these pieces are taught and stress-tested.
Variational inference, a walkthrough ›
The inference machinery the ELBO drives: mean-field, structured, and amortised families, in plain language.
KL divergence, what it measures ›
The non-symmetric ruler that shows up in both terms of the ELBO decomposition, made concrete.
Generative models ›
Where the assumptions live: states, observations, transitions, and preferences, made explicit.