The Benchmark and the Paper: The Stratified Palimpsest, Universal Natural Intelligence

A benchmark is a receipt, or it is decoration. This one is a receipt. Here is what the Stratified Palimpsest measures, what our runs currently show, and where the honest gaps still sit.

What the Stratified Palimpsest is

The Stratified Palimpsest is the benchmark suite described in the Zenodo preprint by Namjoshi (2026) and colleagues, developed alongside our Cell Lab work. Class E It stresses one property that most single-task benchmarks quietly ignore: layered temporal structure. Real environments are palimpsests. Old causes are written under newer ones, slow disturbances run under fast ones, and a controller has to track several timescales at once without confusing them.

Concretely, the benchmark stacks disturbance families that vary in duration and depth against a hidden-state service cell (the same 216-state cell used in the Cell Lab, cf. our science page). Class B A controller sees only sensory observations. It must infer the hidden state through a Markov blanket, plan actions that minimize expected free energy, and keep the system inside a viable set across ticks, seeds, and stratified layers.

Why layered, temporally structured tasks matter for active inference

An active-inference agent maintains a generative model over hidden states and observations, and updates approximate posteriors by minimizing variational free energy (an upper bound on surprise, in nats). When tasks are flat and single-timescale, a well-tuned Bayesian filter looks a lot like a good rule-based controller and even a small neural policy. The differences only show up when a controller has to hold precision on the slow layer while responding on the fast one, without collapsing one into the other. Class B

Parr, Pezzulo and Friston (2022) frame this as precision-weighting across a hierarchical generative model. Class E The Palimpsest is our attempt to build a receipts page for that framing: a controller that claims to do active inference should show a distinctive profile on stratified tasks, not just on single-slice ones. If a UNI controller can only match rule-based baselines under stratification, that is a real update against the theory-of-mind story we tell about it.

Our current runs, evidence-classed

The Cell Lab table on our science page reports RecoveryScore across seven disturbance families for a UNI active-inference controller against random, rule-based, and neural baselines. Class B The committed cache is depth 2, six seeds, eighty ticks. Under those conditions, the UNI controller beats the random controller in seven of seven families (significant in six), the rule-based SRE in six of seven, and the neural baseline in five of seven. Class A

It also loses three times. That is not a footnote. That is the point of the benchmark. Class F A single active-inference controller is not universally best, and a benchmark that never surfaces losses is not a benchmark, it is a poster.

Where the Stratified Palimpsest extends this is in the disturbance depth and the timescale mix. Deeper stratification widens the gap between controllers that maintain distinct posteriors per layer and controllers that fold all disturbances into one running estimate. Our early runs at depth 3 and depth 4 look qualitatively consistent with the depth-2 story, but we have not published the full cache, so we mark those results Class U until the seeds, tick budget, and diff-from-baseline are on disk with the others.

Gates and falsifiers

Every claim on this page has a written falsifier. That is a rule we borrowed from the Cell Lab benchmark and keep across the science front. See our companion post on gates and falsifiers for the full list, and the /transparency page for the audit posture behind them. The gates specific to the Stratified Palimpsest are:

Depth gate. If UNI does not maintain a positive median paired difference against the rule-based baseline as depth increases from 2 to 4, the "distinctive on stratification" claim is falsified. Class F
Timescale gate. If precision-weighting a slow layer strictly worsens performance across seeds (not just noise), the precision-as-upstream-variable story loses one of its named supports. Class F
Replication gate. If an independent run at depth 2, six seeds, eighty ticks cannot reproduce our reported RecoveryScore within the published confidence interval, we retract the number and publish the diff. Class F
Neural-parity gate. Where a neural baseline wins outright (memory_leak, cpu_noisy_neighbor at depth 2), we do not claim a UNI advantage. We record the loss and move on. Class F

We would rather retract a number than defend it. That is the posture. Class B

What would count as replication

Replication of the Palimpsest is not a vibe. It is a recipe. A replicating run needs the same hidden 216-state service cell, the same seven disturbance families, the six committed seeds, the same tick budget, the same RecoveryScore definition (fraction of ticks inside the viable set, weighted by excursion depth), and the same paired-difference bootstrap for significance. Our companion post on how to replicate our benchmark run walks through the exact commands. If your run diverges, please publish the diverging numbers. That is more useful to us than a confirming replication.

The guided tour of the Zenodo preprint covers the mapping between the paper's notation and the code's variable names, which is where most replication attempts stall. Class B

Third-party evidence, honestly framed

Themesis, SeedIQ Just Stomped ARC-AGI 3 on a MacBook Pro.

Our one-line frame, in our voice: third-party evidence that non-transformer active-inference systems can outperform transformer LLMs on hard, structured tasks on modest hardware. It does not endorse UNI. It does support the wider claim that active inference is a live line worth benchmarking. Class E

Themesis, Deep Learning, Transformers, and SeedIQ, Three Industry Breakthroughs.

Our one-line frame, in our voice: a lineage view of scaling shifts in the field. Useful as context for where active inference sits historically; we do not claim UNI has arrived at that scale or that our results are of that kind. Class E

Open questions we are not hiding

Depth 3 and depth 4 caches are not yet committed. Until they are, treat depth-2 as the number of record. Class U
The preprint is not yet peer reviewed. Cite it as a preprint. Class F
We have not yet run the Palimpsest against a strong contemporary policy-network baseline trained specifically for stratified tasks. The neural baseline in the current table is a solid but not adversarial control. Class U
Cross-domain transfer (from service-cell physics to a different hidden-state family) is genuinely open. We suspect the precision story holds. We do not yet have the receipts to say so. Class U

Why this is the pillar

Everything else on the science front hangs on this page. The labs make the math steerable. The preprint states the position. The Palimpsest is where the position gets tested against structured disturbance, and where losses have to be reported next to wins. If you want to know whether UNI is worth your attention, read this page, read the preprint, then run a seed of your own and tell us what you find.

UNI is a working hypothesis on an attainable path toward General Natural Intelligence: a natural, active-inference approach whose evidence is growing, evidence-classed, and tested in the open. Do not take the claim on faith. Test the build, inspect the gates, and help us find where it fails.