Gates and falsifiers

Evidence Classes A to E: What They Mean

UNI is a working hypothesis on an attainable path toward General Natural Intelligence: a natural, active-inference approach whose evidence is growing, evidence-classed, and tested in the open. Do not take the claim on faith. Test the build, inspect the gates, and help us find where it fails.

That posture only works if every claim we publish carries a label saying how we know. This post is the reference for those labels. If you have seen a "(Class B)" or a "(Class E, F)" tag on any UNI, SolutionWright, IamHITL, or EducateWright page, this is the definition it points back to.

Why we tag at all

In active inference, an agent decides by comparing its generative model of the world to what it actually observes, and it acts to minimize variational free energy (an upper bound on surprise) across both perception and action. Parr, Pezzulo and Friston formalize this in Active Inference: The Free Energy Principle in Mind, Brain, and Behavior, MIT Press 2022 Class E. The mathematical machinery cares deeply about the precision of each observation, its inverse variance, because higher-precision observations pull the posterior harder.

Evidence classes are that same idea, applied to our own claims. Not every thing we say deserves the same weight. A tag makes precision visible so a reader (or another agent) can compute an honest posterior instead of trusting our fluency.

The classes

Class A, empirical-in-session

Something we observed directly in a live run: a benchmark score, a lab transcript, an MCP call and its response, a screen recording of the Precision Lab under a specific dial setting. Class A is the highest precision we publish. Reproducing the run should yield the same observation within noise.

Class B, code and inspection

Something the source shows: a function does what its name says, a config file has a specific value, a route is wired to a specific handler. Class B is strong evidence about intent and structure, but it is not evidence that the running system is currently executing that code. Static inspection proves the code is written a certain way, nothing more.

Class C, configuration and integration

Something the wiring shows: a container is running, a systemd unit is active, an MCP server is reachable, an NGINX proxy has the expected upstream. Class C is the "the plumbing is connected" tier. It sits between B and A: the code is deployed and reachable, but a specific behavior has not yet been observed in this session.

Class E, expert citation

Something a named external source states, with a locatable reference: a book, a peer-reviewed paper, a preprint, a public lecture. We use Class E for claims we borrow rather than generate. Parr, Pezzulo and Friston (2022) on active inference Class E. Namjoshi (2026) on the Stratified Palimpsest benchmark Class E. The Zenodo preprint DOI 10.5281/zenodo.19785799 is a Class E pointer to our own unrefereed writeup.

Class F, falsifier present

A meta-tag. It says a claim ships with a written condition that would break it if observed. The Cell Lab benchmark is the flagship example: five claims were registered with their falsification criteria before the runs, and losses are recorded next to wins Class F. A claim without an F tag is a claim we do not yet know how to disconfirm, and we owe you one.

Class U, unverified

Something we said out loud but have not yet grounded. Sometimes this is a hypothesis we are still designing the test for. Sometimes it is a legacy sentence from an older draft we have not audited. Class U is a promise to the reader: we know this one is thin, and we are not hiding it behind confident prose.

What the classes do not do

A Class A observation in one session is not proof of the general claim, it is one data point on a curve we are still drawing. A Class B inspection of well-written code is not a demonstration that the running system behaves that way. A Class E citation is not an endorsement of our project by the cited author. And a Class F falsifier that has not yet fired is not a guarantee that it never will. The classes are calibration, not victory.

How to read a tagged sentence

When you see a paragraph on this site with a "(Class B, C)" tag, read it as: the code is written this way and the plumbing is connected, but we have not shown you a live run in this piece. When you see "(Class A, F)", read it as: we ran it, we recorded it, and we told you in advance what would have counted as failure. Both are legitimate. They are not the same.

The tags travel across the family. On SolutionWright they show up in client-facing writeups so a buyer can see what receipts back a claim. On IamHITL they annotate the investigation of extraction economics so a reader can separate the ledger from the interpretation. On EducateWright they help a teacher say "here is what we tested, here is what we cited, here is what is still open."

The ledger is the source of truth

Every non-trivial claim we publish should be traceable back to an entry in the UNI ledger, our append-only audit log of runs, decisions, and observations Class C. The ledger is where a Class A observation lives before it becomes a sentence on a page. If a sentence here disagrees with the ledger, the ledger wins and the sentence gets corrected. That rule is what lets us call this "science in the open" without flinching.