Bayesian Model Comparison for Practitioners

By Michael Polzin. Published 2026-07-01.

You have three candidate generative models of the same process. Which one should the agent use? Bayesian model comparison gives a principled answer: pick the model whose evidence, the marginal likelihood of the data, is highest, penalised for how much probability mass the model spends on hypotheses that turned out not to matter.

In active inference the same quantity appears everywhere. The negative log evidence is (up to a bound) the variational free energy the agent is already minimising during perception and action (Class E, Parr, Pezzulo, Friston 2022, chapters 2 and 4). So model comparison is not a separate exotic step. It is the same math, applied to a discrete set of whole models rather than to states inside one model.

The quantity that matters

For a model M and observed data y, the model evidence is p(y given M), obtained by integrating the joint p(y, x given M) over all hidden states x. Two models are compared by their Bayes factor, the ratio of their evidences. Log evidence differences above roughly 3 nats are conventionally called strong (Class E, standard Bayesian reference, Kass and Raftery 1995). Under a uniform prior over models the posterior over models is proportional to evidence, so the ranking is the ranking of p(y given M).

The important intuition: evidence is not a raw fit score. A model that fits the training data by allocating probability to a huge, mostly irrelevant hypothesis space pays for that flexibility. Evidence automatically balances accuracy against complexity, no separate regularisation term required. This is the Bayesian Occam's razor (Class E, MacKay 2003, chapter 28).

A small example a practitioner can hold

Suppose an agent observes ten trials of a two-outcome signal: eight ones and two zeros. Three candidate generative models:

The log evidences work out to about minus 6.93, minus 2.40, and minus 1.90 nats respectively (Class C, closed-form Beta-Binomial marginal likelihood). M3 wins, but the gap between M3 and M2 is under 1 nat: the informative prior helps, but not overwhelmingly. The gap between M2 and M1 is over 4 nats: allowing rate to be a free parameter, even with a flat prior, overwhelmingly beats forcing it to 0.5. A practitioner reads this as "expert knowledge matters, but rate flexibility matters more."

How this fits the agent

During inference the agent minimises variational free energy for its current model. Across models, the same math tells the agent which generative model to keep. The two operations compose cleanly. This is also why comparing models is not the same as watching one model's KL divergence between beliefs. KL between posterior and prior tells you how much a single model updated on this observation. Evidence tells you how well the model, prior and likelihood together, explained the observation at all. Different questions, related tools. See our walkthrough on variational inference for how the bound is constructed, and our note on generative models for why an organism (or an organisation) always has one whether it names it or not.

Practical falsifiers

Model comparison can mislead when the candidate set is too small (the "best of a bad lot" problem, Class E, standard critique) or when the priors within a candidate model are chosen after seeing the data. A useful falsifier: hold out a slice of observations, refit the models on the rest, and check whether the ranking on held-out data agrees with the in-sample ranking. If it flips, the comparison was picking up prior artefacts, not structure. This is the same discipline we apply on the Cell Lab benchmark: every claim comes with a stated way to be shown wrong, and failures are shown as plainly as wins. Our note on gates and falsifiers unpacks that posture.

UNI is a working hypothesis on an attainable path toward General Natural Intelligence: a natural, active-inference approach whose evidence is growing, evidence-classed, and tested in the open. Do not take the claim on faith. Test the build, inspect the gates, and help us find where it fails.