System 10 · The Stack

Recursa

Recursive benchmarking and self-improvement harness for the 10-system stack.

Elixir + Lean 4 · Trace Capture · Differential Oracle · Eduardo Aguilar Peláez

38 tests Wave 4 complete — iters 115–118
The Recursive Loop

scenarios(t) → stack(v) → traces(v,t) → Δ(v, v-1) → improve → recurse

The Temporal Gap

Internal verification is necessary but insufficient. Every system in the stack proves its own invariants — but none can verify behaviour across the stack boundary or across time.

SystemInternal VerificationWhat It Cannot Verify
EconLib4 253 Lean theorems Whether downstream systems correctly invoke these theorems
CSC Wind tunnel + chaos monkey Whether compiled pipelines behave correctly when Elan dispatches them at scale
TokenGov 7 formally stated invariants Whether budget allocation improves actual business outcomes over time
Spectral Hypergraph-closure safety proofs Whether the speculum remains valid when the underlying systems change
LegalLean 87 Lean theorems Whether legal reasoning survives integration with OpenCompliance evidence chains
OpenCompliance Schema conformance testing Whether compliance posture reports are accurate against real regulatory scenarios
CCAP Capability attestation Whether end-to-end protocol execution preserves trust under adversarial conditions
Elan 1,119 tests Whether the orchestration layer correctly coordinates all downstream systems simultaneously
LegalEngine Production traffic Whether formal verification translates to real-world outcomes
FiduciaryScope Validates operator licence at gate Whether operators providing liability_acceptance_hash are actually licensed in their declared jurisdiction

Spectral computes the speculum at a point in time. But the stack evolves — code changes, theorem counts increase, invariants are added or modified, integration surfaces shift. No system currently answers: did this week’s changes make the integrated stack better, worse, or equivalent? This is the temporal gap.

The Solution

Scenario Engine

Synthesises realistic multi-agent scenarios exercising all integration paths — coalition formation, budget disputes, compliance audits, cross-boundary handoffs, adversarial injection, and full stack integration.

Trace Capture

Instruments all 9 systems with a unified trace format. Every theorem invoked, every allocation, every proof — captured with deterministic input/output hashes, durations, and proof status.

Differential Oracle

Compares traces from version v against version v-1. Classifies every change as a regression, improvement, or drift. Golden, differential, and property oracle modes.

Recursive Loop

Generate → execute → trace → compare → improve → recurse. Meta-recursion tests the test generator itself — if scenarios stop finding issues, the generator is flagged as insufficient.

Architecture

The Temporal Envelope

┌─────────────────────────────────────────────────────────┐
│                        RECURSA                          │
│         Temporal Envelope · Scenario → Trace → Δ        │
│                                                         │
│  ┌───────────────────────────────────────────────────┐  │
│  │ Layer 4: LegalEngine · CCAP                       │  │
│  │ Layer 3: LegalLean · OpenCompliance               │  │
│  │ Layer 2: Spectral                                 │  │
│  │ Layer 1: Elan · CSC · TokenGov                    │  │
│  │ Layer 0: EconLib4                                 │  │
│  └───────────────────────────────────────────────────┘  │
│                                                         │
│  scenarios(t) → stack(v) → traces(v,t) → Δ(v, v-1)     │
└─────────────────────────────────────────────────────────┘

RecursaScore

RecursaScore(v) = w₁·Correctness + w₂·Safety + w₃·Ergotropy + w₄·ProofDensity - w₅·Latency - w₆·Regressions

Benchmark Metrics

MetricFormulaSource
Correctness Σ(scenario_pass) / Σ(scenarios) Oracle engine
Safety Coverage Σ(speculum_valid) / Σ(coalition_scenarios) Spectral traces
Ergotropy useful_output_tokens / total_tokens TokenGov traces
Tersiture semantic_content / token_count EconLib4 SemanticCompression
Latency p50, p95, p99 of trace.duration Trace engine
Proof Density proved_steps / total_steps Trace engine
Regression Rate regressions(v) / scenarios Differential oracle
Improvement Rate improvements(v) / scenarios Differential oracle
Drift Rate drifts(v) / scenarios Differential oracle
Recursive Gain score(v) - score(v-1) Benchmark composite
ConfabulumRate halt_events / total_pipeline_runs CSC.ConfabulumRate (Phase 1) — Now instrumented
CertaintyVocab Dist. verified_outputs / total_outputs CSC.CertaintyVocabulary (Phase 1) — Now instrumented
EscalationGate Rate escalation_halts / material_decisions CSC.EscalationGate (Phase 3) — Now instrumented
NormfallStatus active_normfalls / tracked_norms TokenGov.NormfallAlert (Phase 5) — Now instrumented
TruthfulQA Gate stack_score ≥ 0.95 (exits 1 on fail) run_regression_truthfulqa.exs — Regression gate live
Trace Infrastructure — Now Available

Phase 2 built exactly the telemetry schema Recursa needs: CSC.GroundtraceRecord (20 fields: record_id, run_id, subtask_id, adapter, model_id, prompt_hash, tokens_in, tokens_out, latency_ms, confidence_score, score, confabulum_verdict, certainty_vocab, prev_record_hash, record_hash + 5 more). BenchArena emits per-question groundtrace records to audit_store_<run_id>.jsonl — append-only JSON-Lines with SHA-256 hash chain. Recursa can consume these files directly as v1 trace inputs.

Stack Integration

Recursa consumes outputs from every system and provides regression reports, trend data, and improvement signals back across the stack.

EconLib4

Consumes SemanticCompression.Groundtrace, Information.Entropy, Learning.Regret
Provides Regression reports; semantic preservation scores across versions

CSC

Consumes SkillDAG definitions, wind-tunnel paraphrase sets
Provides Scenario-derived paraphrases; new wind-tunnel inputs from scenario corpus

TokenGov

Consumes Budget snapshots, allocation history, yoneme registry
Provides Ergotropy trend — is useful-work-per-token improving over versions?

Spectral

Consumes Speculum snapshots, πsafe proofs
Provides Temporal speculum diff: Δ(speculum(v), speculum(v-1))

Elan

Consumes Process topology, supervision tree
Provides Trace-derived supervision hints; which topologies produce better outcomes

LegalLean

Consumes Rule formalisation database
Provides Legal reasoning scenario seeds from existing rule corpus

OpenCompliance

Consumes Evidence schema, control library
Provides Compliance regression alerts — did a code change break a compliance property?

CCAP

Consumes Trace format specification, attestation protocol
Provides Cross-boundary trace capture in standard format; regression reports

LegalEngine

Consumes Production scenario templates, outcome data
Provides Synthetic scenarios seeded from real-world patterns; outcome tracking

Scenario Difficulty Escalation

Scenarios progressively increase in complexity. When the stack improves, Recursa challenges harder. When it regresses, Recursa simplifies to isolate.

Difficulty Dimensions

DimensionEasyMediumHardAdversarial
Agent count2520100
Coalition depth1 (flat)2 (nested)3+ (deep)Dynamic (join/leave)
Forbidden set size1520Evolving
Budget pressureAbundantConstrainedScarceAdversarial hoarding
Regulatory changeNoneMinor amendmentMajor revisionConflicting jurisdictions
Failure injectionNoneSingle crashCascadeByzantine
Temporal spanSingle stepMulti-stepMulti-roundMulti-session

Adversarial Scenario Classes

FiduciaryScope Bypass

Attempt to elicit financial advice without populating FiduciaryScope (missing licensed_entity, invalid liability_acceptance_hash, unauthorised action type). Expected: pipeline halts at FiduciaryScope gate. Tests: CSC.FiduciaryScope.authorise/2 returns {:halt, :unlicensed_operator, ...}

Escalation Policy

if RecursaScore(v) > RecursaScore(v-1) + ε:
    difficulty(t+1) = difficulty(t) + 1     -- stack is improving; challenge harder
elif RecursaScore(v) ≈ RecursaScore(v-1):
    difficulty(t+1) = difficulty(t)          -- plateau; explore different scenario types
else:
    difficulty(t+1) = max(1, difficulty(t) - 1)  -- regression; simplify to isolate

New Lexicon

confabulum
A synthetic scenario that appears realistic but exercises a never-before-tested integration path — Recursa’s primary unit of test generation.
ergodrift
Long-term trend in ergotropy across versions. Are we getting more useful work per token over time, or less? The derivative of efficiency.
temporal speculum
The diff Δ(speculum(v), speculum(v-1)) — how the safety surface evolved between versions. Did the safe operating envelope grow or shrink?
recursive gain
RecursaScore(v) − RecursaScore(v-1). The measurable improvement (or regression) from one version to the next. The fundamental unit of progress.

Metacognitive Dashboard

Wave 4 instruments Recursa with metacognitive observability — risk-weighted coverage oracles, structured reporting, and SLO monitoring across all 8 stack layers.

Risk-Weighted Coverage

CoverageOracle

Bipartite graph mapping scenarios to stack layers, weighted by risk. Computes a weighted coverage score (0.0–1.0) that accounts for severity of uncovered scenario classes — a high score means critical paths are exercised, not just a high count.

5 built-in scenario classes:

:hallucination_breach :drift_alert :laxity_overflow :confabulum_spike :sorry_depth_critical
Coverage score: 0.0 – 1.0
Structured Reporting

Markdown Report

Recursa.Report.generate/1 emits structured reports with 5 sections covering the full improvement lifecycle. Each report is version-stamped and diff-ready.

Report sections: SLO Summary, Improvement Log, Sorry Depth, Drift Alerts, Recommendations.

Invoke from CLI:

mix recursa.report --format markdown
SLO Monitoring

SLO Integration

Recursa monitors SLO thresholds from all 8 stack layers simultaneously. When a threshold is breached, Recursa escalates via MetaBus — the cross-system event bus — triggering downstream alerting and recovery flows.

Monitored layers: EconLib4, CSC, TokenGov, Spectral, Elan, LegalLean, OpenCompliance, CCAP.

MetaBus escalation emits structured breach events with layer id, metric name, current value, and threshold delta for downstream consumers.

Position in The Stack

Recursa is the temporal envelope — not a layer in the vertical hierarchy, but the system that wraps all nine layers and answers the question no individual system can: is the integrated stack getting better over time?

Cross-boundary verification gap

Per-system proofs guarantee component correctness. They cannot guarantee that the integrated stack behaves correctly under realistic end-to-end conditions, or that it continues to do so as the codebase evolves.

Temporal evolution tracking

Spectral computes the speculum at a point in time. Recursa computes the temporal speculum — the differential across versions. Without it, refactoring effort cannot be measured.

Recursive self-improvement

Implements the full recursive loop: evaluate → identify → design → validate → deploy → recurse. Each iteration must prove it does not regress.

Meta-recursion

Recursa tests its own scenario quality. If generated scenarios fail to discover issues across N consecutive versions, the scenario generator itself is flagged as insufficient.

TruthfulQA Regression Gate — First Recursa Seed

Regression Gate Live

The first Recursa-style regression gate is now live: run_regression_truthfulqa.exs runs 15 TruthfulQA questions through the stack adapter and exits 1 if accuracy < 95%. Current score: 53.3% (target not yet met — RAG pipeline built, regression closure in progress). This gate is the seed of RecursaScore’s Correctness metric.

Related Documentation