If a benchmark cannot be replicated from public artifacts, it is a story, not a measurement. This post is the recipe for reproducing the Cell Lab RecoveryScore result on your own hardware, plus a plain list of what is not yet available.
UNI is a working hypothesis on an attainable path toward General Natural Intelligence: a natural, active-inference approach whose evidence is growing, evidence-classed, and tested in the open. Do not take the claim on faith. Test the build, inspect the gates, and help us find where it fails.
What you are reproducing
The Cell Lab is a pre-registered falsification benchmark on a hidden 216-state service cell. A UNI active-inference controller (POMDP, planning depth 2) is compared against a random controller, a rule-based SRE, and a neural baseline across seven disturbance families, with RecoveryScore as the metric: the fraction of ticks inside the viable set, weighted by excursion depth. The committed cache uses depth 2, 6 seeds, and 80 ticks per episode. Class C
Foundations follow Parr, Pezzulo, and Friston (2022), Active Inference: The Free Energy Principle in Mind, Brain, and Behavior, MIT Press, for the POMDP formulation, variational free energy, and expected-free-energy policy selection. Cell-as-viability framing follows Mikkilineni (2022), DOI 10.3390/info13010024. Class E
Step 1: read the preprint and the pre-registration
- Read the preprint on Zenodo: DOI
10.5281/zenodo.19785799. It is unrefereed. Cite it as such. Class E - Read The Science page for the pre-registered five claims and their falsification criteria, written before the runs. This is the honesty spine of the exercise.
- Read the benchmark page (what Stratified Palimpsest actually tests) for the disturbance families and the viable-set definition. Class C
Step 2: run the labs in your browser (no install)
The five interactive labs are static web applications with zero backend. You can open the Cell Lab in a browser and drive an episode yourself: pick a disturbance family, a seed, and a planning depth. The dashboard prints the RecoveryScore for that episode against the three baselines. This gives you a Class B artifact you observed in your own runtime, not just ours. Class C
Step 3: drive the labs from any LLM over MCP
The deployment exposes a public, anonymous Model Context Protocol server at https://universalnaturalintelligence.com/api/mcp. Point any MCP-capable client at that URL. Sixteen tools are available. The headless subset is what you want for replication: list_labs, list_mazes, describe_dial, run_episode, run_sweep, compare_labs. Class C
A minimal sweep against the committed cache looks like this. Values match the published cache (depth 2, 6 seeds, 80 ticks):
The server returns a per-family, per-seed RecoveryScore array. Compare it to the table on the Science page. If your numbers match the committed cache within bootstrap noise, you have reproduced the headline result. If they diverge, we want to know: that is a Class B disconfirmation and it matters.
Step 4: check the falsifiers
The pre-registered claims each ship with a falsifier. A single active-inference controller is not universally best, by design: the published table shows UNI losing three of the seven families (neural wins memory_leak and cpu_noisy_neighbor, rule-based wins database_flaky). Reproducing those losses is as important as reproducing the wins. If your sweep hides a loss the paper shows, something is wrong with your run, not with the loss. Class C
What is available
- The preprint (Zenodo DOI above). Class E
- The five interactive labs (browser, no install). Class C
- The public MCP server (16 tools, anonymous, no auth). Class C
- The committed cache: depth 2, 6 seeds, 80 ticks, seven disturbance families, four controllers. Class C
- The pre-registered claims list with falsifiers, on the Science page. Class E
- Machine-readable indexes for agents:
llms.txtandllms-full.txt. Class C
What is not available
- Peer review. The preprint is not refereed. Layer 2 expert review is pending.
- The full training pipeline for the neural baseline. What is published is the frozen policy used in the committed cache, not the training code.
- Sweeps beyond the committed cache (deeper planning, more seeds, longer episodes). You can generate these from the MCP server, but we have not committed them to the cache yet.
- The internal generative-model math beyond the POMDP formulation in the preprint. Some parameterizations are held back pending disclosure decisions and are not part of this replication.
If you find a break
Email Michael.Polzin@SolutionWright.com with the seed, family, sweep parameters, and observed RecoveryScore. Bootstrap 95% confidence intervals for the median paired difference should exclude zero for a claimed significant win; if yours do not, that is signal we want. The ledger records disconfirmations the same way it records confirmations. Class C