The Benchmark: What the Stratified Palimpsest Actually Tests

A benchmark is only honest if it can tell you when you are wrong. The Stratified Palimpsest is layered on purpose: each layer stresses a different assumption in the active-inference stack, and every layer carries a written falsifier committed before the run.

UNI is a working hypothesis on an attainable path toward General Natural Intelligence: a natural, active-inference approach whose evidence is growing, evidence-classed, and tested in the open. Do not take the claim on faith. Test the build, inspect the gates, and help us find where it fails.

Why a layered benchmark, not a single score

A single RecoveryScore column would flatten information. Active inference posits distinct mechanisms: a generative model over hidden states, precision-weighted sensory update, and policy selection by expected free energy (Parr, Pezzulo, Friston, 2022) Class E. If we want to know which mechanism carries a passing run, and which mechanism a failing run indicts, the benchmark has to be stratified so those signals do not collapse into a single mean.

The Cell Lab implementation instantiates the layers as a hidden 216-state service cell driven through seven disturbance families, depth-2 planning, six seeds, 80 ticks per episode Class C. The committed cache is the configuration a reviewer can rerun, byte for byte.

The layers, and what each one tests

Read top down: each layer sits on top of the one below it, and a failure at a higher layer only counts if the layers beneath it are intact.

Perception layer. Does the agent's posterior over hidden state track ground truth under noise? Tested by the divergence between the posterior and the (hidden) generator state across ticks Class C.
Precision layer. Do the three precision dials (sensory, transition, policy temperature) produce distinct behavioral regimes, as the preprint claims? The bifurcation map is the falsifier: if the regimes collapse, the claim fails Class E.
Planning layer. Does depth-2 expected-free-energy planning outperform depth-1 on disturbance families where the reward structure is non-myopic? A negative result here means the planning horizon does not carry the win.
Viable-set layer. Does the controller keep the cell inside its viable set under sustained disturbance? RecoveryScore is the fraction of ticks inside the set, weighted by excursion depth. This is what the top-line table reports.
Comparator layer. Does UNI beat random, rule-based SRE, and neural baselines? Not everywhere. Three of the seven families are losses. The benchmark surfaces the losses on purpose Class C.

What a passing run looks like

A passing run is not a headline. It is a signed row in a committed cache: the disturbance family, the six-seed RecoveryScore for UNI and each baseline, the bootstrap 95 percent confidence interval for the median paired difference, and the tag "sig" only when that interval excludes zero. A passing run also carries a link back to the exact generative-model configuration and the exact policy-selection code path that produced it. If any of those artifacts are missing, the row is unverified Class C.

What a failing run tells us

Three of the seven disturbance families are UNI losses. That is the finding, not a footnote. In memory_leak and cpu_noisy_neighbor the neural baseline outperforms UNI. In database_flaky the rule-based SRE wins. Each loss maps back to a specific layer: memory_leak stresses the generative model's temporal horizon; cpu_noisy_neighbor stresses the precision layer's ability to down-weight a nuisance channel; database_flaky rewards a hard-coded runbook over an inferred one. Reading the layer that lost is how the benchmark teaches us where the next revision has to work Class C.

Related, from outside our lab. Themesis reports a non-transformer active-inference system outperforming a large model on ARC-AGI-3 on consumer hardware: SeedIQ Just Stomped ARC-AGI 3 on a MacBook Pro. Our honest one-line frame, in our own words: third-party evidence that non-transformer, active-inference systems can outperform large models on hard tasks. We link, we do not paraphrase her prose Class E.

Honesty fences on the benchmark itself

Free energy here is the variational free energy of inference, measured in nats, not a thermodynamic quantity. Autopoiesis in the Cell Lab means viable-set maintenance, not life. No consciousness claim. No AGI claim. The agent never sees the hidden state. The losses are shown, not hidden. If the committed cache disagrees with the top-line table, the cache wins.

The pillar: benchmark and paper

How the layered benchmark and the preprint fit together, with the falsifier commitments.

Gates and falsifiers

Every claim carries a written falsifier and a gate. What triggers a retraction.

Replicate the run

Step by step: committed cache, seeds, ticks, and how to inspect the deltas yourself.

The workshop

Where we teach how to hold your own work to this standard, without shortcuts.