The benchmark and paper

Layered Temporal Tasks and Why They Matter

By Michael Polzin. Published 2026-07-01. Working hypothesis on an attainable path toward General Natural Intelligence, natural not artificial.

A task has layered temporal structure when what happens on one timescale only makes sense if you also track a slower, hidden variable on another. Reactive systems collapse the two layers. Inference-based agents keep them apart. The benchmark is built so that difference shows up in the score, not in the marketing copy.

Consider a service cell whose viable set drifts over hours while individual requests arrive in milliseconds. A memory leak is a slow story. A traffic spike is a fast story. Any controller that only reads the fast channel will confuse the two, treat drift as noise, and land outside the viable set (Class C). That failure mode is not a bug in the controller. It is a mismatch between the controller's generative model and the world's temporal factorization.

Active inference frames the fix cleanly. A hierarchical generative model factorizes hidden states across timescales so that slow states supply priors to fast states, and fast observations update posteriors over both (Class E). Parr, Pezzulo and Friston (2022) develop this factorization for discrete-time POMDPs, with slow policies indexing over sequences of fast policies and free-energy minimization propagating up and down the hierarchy. The math is standard. Getting a benchmark to actually test it is the harder move.

Why layered structure is the harder test

A flat task rewards any policy that can predict the next observation well. A layered task punishes any policy that predicts the next observation well but forgets that a slower regime is shifting underneath it. This is the source of the aliasing problem: two very different hidden states can produce nearly identical short-run observations, and only the slow layer distinguishes them (Class E, after Parr §2.8).

That is what makes layered temporal tasks a genuine discriminator. Reactive baselines can fit the fast channel. Neural baselines can memorize joint patterns given enough data. An inference-based agent has to earn its win by carrying a coherent posterior across both layers and using it to select policies whose expected free energy decomposes into an epistemic slow-layer term and a pragmatic fast-layer term.

How the Stratified Palimpsest is designed around it

The Stratified Palimpsest benchmark is built so the layers cannot be collapsed by luck (Class C). The 216-state service cell hides a slow regime variable (which disturbance family is active) behind a fast observation stream (per-tick health signals). Seven disturbance families produce overlapping short-run signatures, so a controller that only reads the fast stream will alias them. RecoveryScore weights excursion depth so that brief blips are cheap but sustained drift is not, which forces any winning policy to represent the slow variable explicitly.

The design is deliberately falsifiable. UNI does not win all seven families. On memory_leak the neural baseline scores higher, and on database_flaky the rule-based SRE wins (Class B, see the science page table). Those losses are the point: a benchmark that only surfaces wins is not measuring layered structure, it is measuring authorship.

A third-party data point

A recent third-party result from Themesis reported that a non-transformer active-inference system outperformed LLM baselines on ARC-AGI 3 tasks running on a laptop: SeedIQ Just Stomped ARC-AGI 3 on a MacBook Pro. Our honest read, in our voice: third-party evidence that non-transformer active-inference systems can outperform LLMs on hard tasks (Class E). We do not paraphrase her prose and we do not claim endorsement. We link, we frame in our own words, and we let the reader check the source.

What this does not yet prove

Layered temporal structure being hard for reactive systems does not prove active inference is the correct theory of mind, and a benchmark win in a service-cell world does not extend automatically to other domains. The commitment is narrower and more testable: on this specific benchmark, with pre-registered claims and published failure cases, a hierarchical active-inference controller carries slow-layer state where flat baselines cannot. That claim is falsifiable, and the cache is open for anyone who wants to try.

The benchmark: what Stratified Palimpsest actually tests ›

The full walk-through of the seven disturbance families, the viable set, and RecoveryScore.

Factorization, time, and hierarchy in generative models ›

How slow priors and fast likelihoods combine in a hierarchical POMDP, and why it matters for planning.

The benchmark and the paper: the Stratified Palimpsest ›

Where the benchmark sits in the paper, and how the pre-registered claims map to the runs.

The Science page ›

The preprint, the labs, the benchmark table, and the public MCP server that lets any LLM drive the runs.

Evidence classes used above: A empirical-in-session, B code or inspection, C configuration or integration, E expert citation, F falsifier present, U unverified. UNI is a working hypothesis on an attainable path toward General Natural Intelligence: a natural, active-inference approach whose evidence is growing, evidence-classed, and tested in the open. Do not take the claim on faith. Test the build, inspect the gates, and help us find where it fails.