Cluster, the benchmark and paper

What the Benchmark Does Not Measure

A benchmark is only as honest as its silences. This one has real silences, and we would rather name them than let a good RecoveryScore do work it was not designed to do.

Why "does not measure" deserves its own page

The Stratified Palimpsest is a receipts page for one specific claim: that an active-inference controller with precision-weighted layers can hold a service cell inside a viable set under stratified disturbance. That is a narrow claim on purpose. Narrow claims are the ones a benchmark can actually adjudicate. The wider a claim gets, the more silently a benchmark starts to lie for you. Class E

So this post is the inverse of the pillar. Instead of walking through what RecoveryScore captures under six seeds at eighty ticks, it walks through the parts of the world our number is silent about, and the claims we therefore refuse to make from it. If you catch us stretching, please point at this page. Class C

Named gaps, one at a time

Consciousness, phenomenology, subjective experience

RecoveryScore is the fraction of ticks a controller keeps the hidden-state cell inside a viable set, weighted by excursion depth. Nothing in that definition, and nothing in the underlying POMDP, touches phenomenal experience, qualia, or first-person report. An agent that scores well is a controller that infers a hidden Markov process well enough to act on it. It is not evidence, in any direction, about whether anything is like anything to be the controller. Class E

Clinical outcomes, therapy, and human wellbeing

The service cell is a synthetic 216-state process modeled after infrastructure dynamics, not a human being. A win on the Palimpsest is not evidence that active inference improves clinical outcomes, changes symptom trajectories, or belongs anywhere near a treatment pathway. Our behavioral labels remain candidate computational phenotypes, hypotheses under a modeling lens. See the "measured accuracy" note on the science page. Class C

Cross-domain generalization

All committed runs stay inside the service-cell family. The benchmark is silent about transfer to a physically different hidden-state process, for example a cardio-renal loop of the sort the Heart Lab renders. We suspect the precision story survives transfer, based on the shape of the free-energy decomposition, but suspicion is not a receipt. Until we publish a cross-domain cache with paired-difference bootstraps, treat transfer as unmeasured. Class C

Adversarial and non-stationary environments

The disturbance families are stochastic, layered, and stratified, but they are not adversarial. Nothing inside the benchmark is trying to steer the controller into a failure mode. Real production systems sometimes are. Nothing in RecoveryScore lets us claim anything about security posture, red-team resilience, or robustness under a shifting distribution outside the seven committed families. Class E

Sample efficiency, wall-clock cost, energy budget

We report a viable-set fraction. We do not report tokens per tick, seconds per episode, joules per seed, or the cost of the sweep across baselines. A controller that wins on RecoveryScore at ten times the wall-clock cost of a rule-based baseline is not, for most operational settings, a controller you would deploy. The Palimpsest is silent on that trade. Efficiency deserves its own receipts, and we owe them. Class C

Peer review status

The Zenodo preprint is not yet peer reviewed. A benchmark run being replicable is not the same as a paper being refereed, and we do not conflate the two. Cite it as a preprint. Class E

What this means for our claim language

Because the benchmark is silent on those items, we hold our language inside its actual footprint. UNI is a working hypothesis on an attainable path toward General Natural Intelligence, natural not artificial. The Stratified Palimpsest is one of the receipts we point to, not all of them. It supports "our active-inference controller shows a distinctive profile on stratified service-cell disturbance under our committed conditions." It does not support "UNI generalizes," "UNI is safe in production," or any sentence that reaches past the cell family and the six seeds. Class C

How this fits with the wider gate list

The falsifiers named on the pillar, the depth gate, the timescale gate, the replication gate, and the neural-parity gate, cover the claims the benchmark is built to test. The gaps on this page are different. They are claims the benchmark is not built to test at all, so no gate on it can either confirm or falsify them. For the full standing list of falsifiers across the science front, see what would falsify UNI, a standing list. Class C

Read next

The Benchmark and the Paper: The Stratified Palimpsest ›
The pillar page for what the benchmark does measure, with the current evidence-classed table.
What the Stratified Palimpsest Actually Tests ›
The layered-disturbance mechanics behind RecoveryScore, in plain terms.
What Would Falsify UNI, a Standing List ›
The cross-cluster falsifiers, kept honest across the science front.
The Science Page ›
The current Cell Lab table, the honesty fences, and the measured-accuracy note in one place.