Gates and falsifiers

Gates and Falsifiers: How We Know When We Are Wrong

By Michael Polzin. Published 2026-07-01. UNI is a working hypothesis on an attainable path toward General Natural Intelligence: a natural, active-inference approach whose evidence is growing, evidence-classed, and tested in the open. Do not take the claim on faith. Test the build, inspect the gates, and help us find where it fails.

The interesting question about any research program is not what it claims, it is what would make it change its mind. UNI answers that question the same way every time: with a gate. Every claim in the public ledger carries two tags, an evidence class and a falsifier, and neither is optional.

An evidence class says what kind of thing we saw. A falsifier says what we would have to see to be wrong. The pair is the contract. Without a class the claim is a mood. Without a falsifier the claim is a slogan. With both, the claim is a bet you can settle.

The seven classes we use

Our classes borrow from the standard evidence taxonomy in delivery engineering, then narrow it for a research setting (Class E). Class A is empirical, in-session: something we ran, something you can run alongside us. Class B is code or artifact inspection: repository state, config, logs, diffs. Class C is configuration and integration: what is wired to what, at what version, under which flags. Class E is expert citation into the literature: Parr, Pezzulo and Friston (2022) is a Class E anchor for the free energy principle, cited never rehosted. Class F flags that a falsifier is present on the claim: a specific observation that would sink it. Class U marks a claim as unverified and quarantines it from any downstream use.

A gate has three parts, always

Every gate in the ledger takes the same shape: the claim in one sentence, the evidence classes that support it, the falsifier that would kill it. Here is the shape of a gate on the precision claim from the Loop Lab (Class E, Class C):

Claim. In the 2-state Loop Lab toy, raising sensory precision above the bifurcation threshold pushes policy selection into a different regime.
Evidence. Class E, the analytic bifurcation map derived from the expected free energy decomposition in Parr, Pezzulo and Friston (2022), chapter 7. Class C, the config in loop-lab.js that names the precision dial and pins its range.
Falsifier. If two seeds of the Loop Lab, one below and one above the mapped threshold, produce statistically indistinguishable policy distributions across 200 episodes at planning depth 2, the claim is falsified for that region.

Notice what is missing. There is no rhetorical hedge, no "we believe", no "further work is needed". There is a bet and a way to settle it. That is the whole trick.

The Cell Lab gate, in flight

The pre-registered Cell Lab benchmark on the transparency page is not one gate, it is five, one per claim. The most public one reads: a UNI active-inference controller outperforms a random baseline on the 216-state hidden cell across all seven disturbance families, with a bootstrap 95% confidence interval for the median paired difference that excludes zero (Class C, Class F). Committed cache: depth 2, 6 seeds, 80 ticks.

Today that gate reads seven wins, six with the confidence interval clearing zero, one where it does not. The gate is not marked green. The claim is provisionally supported, with the exact excursion recorded (cpu_noisy_neighbor), and the falsifier stays on the wall until a re-run either closes the gap or opens the case. This is what science in the open looks like when you are willing to be wrong on the page.

Why the ledger, not a blog post

A blog post can be edited quietly. A ledger cannot. Every gate we open, every re-run we log, every falsifier we retire lives as an append-only entry with a hash and a timestamp. If we ever want to say "the CPU noisy-neighbor gate flipped green on run number 8", the entry either exists or it does not, and the class labels stamp the claim in the same motion. The Class U marker is important here: an unverified claim can sit in the ledger, but it is quarantined, it cannot be cited by any other gate until its class is upgraded.

What the gates cost, and why we pay it

Gate discipline is slower than the alternative. It costs more to say "we ran this against a rule-based baseline, a neural baseline, and a random baseline, and we lose to the neural baseline on memory_leak and cpu_noisy_neighbor" than to say "our method beats the baselines". We pay that cost because the point of the program is not to look correct, it is to be correctable. If the Cell Lab gates all read green forever we would not trust them. The gate that sometimes closes red is the gate that is doing its job.

Where to go next

The transparency page ›

The full ledger: every gate, every evidence class, every falsifier still in flight.

Active inference fundamentals, a working map ›

The math behind the gates: generative models, KL divergence, and the expected free energy decomposition.

The Stratified Palimpsest benchmark ›

The falsification benchmark and the preprint, together, with the pre-registered claims that gate them.

Steer a lab yourself ›

Every gate on this site started as a dial you can move. Open a lab and try to break one.