Cluster: KL divergence and Bayesian inference

KL Divergence: What It Actually Measures

KL divergence is a number you assign to a pair of probability distributions. It answers a very specific question, and if you get the direction wrong, the number answers a different question than the one you meant to ask.

The quantity, in one sentence

The Kullback-Leibler divergence from Q to P, written D_KL(Q || P), is the expected extra number of nats you pay to describe samples drawn from Q if you encode them using an optimal code built for P (Class E, Parr, Pezzulo, Friston, 2022, appendix on information theory).

D_KL(Q || P) = E_Q[ log Q(x) - log P(x) ]

It is measured in nats when the log is natural, in bits when the log is base 2. It is zero when Q equals P everywhere, positive otherwise, and never negative. That is the whole definition.

What it is not

KL divergence is not a distance. A distance is symmetric: the distance from A to B equals the distance from B to A. KL divergence is not (Class E). In general:

D_KL(Q || P) ≠ D_KL(P || Q)

People sometimes call KL a "distance" as informal shorthand. Do not carry that intuition into any equation. The asymmetry is not a rough edge to smooth over. It is the entire reason the quantity is useful for inference.

Why the direction matters

Look at the definition again. The expectation is taken under Q, not P. That means the value of D_KL(Q || P) is sensitive to what Q says about the world, weighted by how badly P disagrees at the places Q puts mass. Regions where Q is near zero contribute almost nothing to the sum, no matter how wrong P is there.

Flip the order and you flip the weighting. D_KL(P || Q) cares about the places P puts mass, and is nearly blind to what happens where P is near zero.

This asymmetry gives the two directions distinct behavior when one distribution is used to approximate another.

Mode-seeking versus mode-covering

If you fit a simpler Q to a complex P by minimizing D_KL(Q || P), Q pays a penalty whenever it places mass where P has none. Q therefore prefers to sit inside one of P's modes and ignore the rest. Practitioners call this "mode-seeking" (Class E).

If instead you minimize D_KL(P || Q), Q pays a penalty whenever P has mass where Q is thin. Q spreads out to cover every mode of P, at the cost of putting mass in the low-probability valleys between them. This is "mode-covering" (Class E).

Same two distributions, opposite fitted shapes, entirely because of which side of the double-bar the reference distribution sits on.

The direction variational inference picks

Variational inference approximates an intractable posterior P(z | x) with a tractable Q(z). The standard objective minimizes D_KL(Q(z) || P(z | x)) (Class E, Parr et al., 2022, chapter on variational inference). The direction is a modeling choice with real consequences.

That choice buys tractability. The expectation is taken under Q, which we designed to be easy to sample and to score. It also buys the mode-seeking behavior: the approximate posterior concentrates on a plausible explanation rather than hedging across all of them. In active inference, this is what a generative model does when it commits to an inferred cause of its sensations before selecting a policy (Class C, from the POMDP labs on this site).

The trade is real. A mode-seeking Q can under-report uncertainty by ignoring other plausible explanations. That is a limitation to know, name, and design around, not a bug to hide.

The free energy connection, briefly

The variational free energy F that active inference minimizes decomposes into a KL term plus a log-evidence term:

F = D_KL(Q(z) || P(z)) - E_Q[ log P(x | z) ]

The first term keeps the approximate posterior honest against the prior. The second rewards it for explaining the observation. Minimizing F is equivalent to minimizing D_KL(Q(z) || P(z | x)) up to a constant in x (Class E). The direction of the KL is the direction of the inference (Class C).

A sanity check for practitioners

Before you write down any KL term in a loss, ask yourself: which distribution am I taking the expectation under, and what behavior does that force on the approximator? If you cannot answer both, you do not yet know what your objective is asking the model to do.

Further reading, and one honest recommendation

For learners who want the information-theory scaffolding underneath this post, Themesis Top Ten Terms in Statistical Mechanics for AI covers the vocabulary (entropy, partition function, log-evidence) that KL divergence is built from. We recommend it as preparation for the UNI Workshop for math-hungry learners. This is a factual recommendation, not an endorsement in either direction.