How to Replicate Our Benchmark Run, Universal Natural Intelligence

If a benchmark cannot be replicated from public artifacts, it is a story, not a measurement. This post is the recipe for reproducing the Cell Lab RecoveryScore result on your own hardware, plus a plain list of what is not yet available.

UNI is a working hypothesis on an attainable path toward General Natural Intelligence: a natural, active-inference approach whose evidence is growing, evidence-classed, and tested in the open. Do not take the claim on faith. Test the build, inspect the gates, and help us find where it fails.

What you are reproducing

The Cell Lab is a pre-registered falsification benchmark on a hidden 216-state service cell. A UNI active-inference controller (POMDP, planning depth 2) is compared against a random controller, a rule-based SRE, and a neural baseline across seven disturbance families, with RecoveryScore as the metric: the fraction of ticks inside the viable set, weighted by excursion depth. The committed cache uses depth 2, 6 seeds, and 80 ticks per episode. Class C

Foundations follow Parr, Pezzulo, and Friston (2022), Active Inference: The Free Energy Principle in Mind, Brain, and Behavior, MIT Press, for the POMDP formulation, variational free energy, and expected-free-energy policy selection. Cell-as-viability framing follows Mikkilineni (2022), DOI 10.3390/info13010024. Class E

Step 1: read the preprint and the pre-registration

Read the preprint on Zenodo: DOI 10.5281/zenodo.19785799. It is unrefereed. Cite it as such. Class E
Read The Science page for the pre-registered five claims and their falsification criteria, written before the runs. This is the honesty spine of the exercise.
Read the benchmark page (what Stratified Palimpsest actually tests) for the disturbance families and the viable-set definition. Class C

Step 2: run the labs in your browser (no install)

The five interactive labs are static web applications with zero backend. You can open the Cell Lab in a browser and drive an episode yourself: pick a disturbance family, a seed, and a planning depth. The dashboard prints the RecoveryScore for that episode against the three baselines. This gives you a Class B artifact you observed in your own runtime, not just ours. Class C

Step 3: drive the labs from any LLM over MCP

The deployment exposes a public, anonymous Model Context Protocol server at https://universalnaturalintelligence.com/api/mcp. Point any MCP-capable client at that URL. Sixteen tools are available. The headless subset is what you want for replication: list_labs, list_mazes, describe_dial, run_episode, run_sweep, compare_labs. Class C

A minimal sweep against the committed cache looks like this. Values match the published cache (depth 2, 6 seeds, 80 ticks):

run_sweep(
  lab="cell",
  disturbances=["traffic_spike","memory_leak","bad_deploy","database_flaky","cache_down","cpu_noisy_neighbor","observability_loss"],
  seeds=[0,1,2,3,4,5],
  ticks=80,
  planning_depth=2,
  baselines=["random","rule_based","neural"]
)

The server returns a per-family, per-seed RecoveryScore array. Compare it to the table on the Science page. If your numbers match the committed cache within bootstrap noise, you have reproduced the headline result. If they diverge, we want to know: that is a Class B disconfirmation and it matters.

Step 4: check the falsifiers

The pre-registered claims each ship with a falsifier. A single active-inference controller is not universally best, by design: the published table shows UNI losing three of the seven families (neural wins memory_leak and cpu_noisy_neighbor, rule-based wins database_flaky). Reproducing those losses is as important as reproducing the wins. If your sweep hides a loss the paper shows, something is wrong with your run, not with the loss. Class C

What is available

The preprint (Zenodo DOI above). Class E
The five interactive labs (browser, no install). Class C
The public MCP server (16 tools, anonymous, no auth). Class C
The committed cache: depth 2, 6 seeds, 80 ticks, seven disturbance families, four controllers. Class C
The pre-registered claims list with falsifiers, on the Science page. Class E
Machine-readable indexes for agents: llms.txt and llms-full.txt. Class C

What is not available

Peer review. The preprint is not refereed. Layer 2 expert review is pending.
The full training pipeline for the neural baseline. What is published is the frozen policy used in the committed cache, not the training code.
Sweeps beyond the committed cache (deeper planning, more seeds, longer episodes). You can generate these from the MCP server, but we have not committed them to the cache yet.
The internal generative-model math beyond the POMDP formulation in the preprint. Some parameterizations are held back pending disclosure decisions and are not part of this replication.

Honesty fences. UNI is a brand; active inference is the science. Autopoiesis here means viable-set maintenance, not life. Free energy is variational free energy of inference (nats), not thermodynamic. No consciousness claim, no claim about general-purpose reasoning systems. The controller never sees the hidden state. Losses are shown, not hidden.

If you find a break

Email Michael.Polzin@SolutionWright.com with the seed, family, sweep parameters, and observed RecoveryScore. Bootstrap 95% confidence intervals for the median paired difference should exclude zero for a claimed significant win; if yours do not, that is signal we want. The ledger records disconfirmations the same way it records confirmations. Class C

The benchmark and the paper ›

The Stratified Palimpsest table, the pre-registered claims, and the Zenodo preprint in one place.

Transparency ›

Evidence classes, what we publish, what we hold back, and why.

Open the Cell Lab ›

Run an episode in your browser now. Pick a disturbance, watch the viable set, read the score.

The workshop ›

If you want to put an active-inference controller against your own system, we hold a working session for that.