Architecture · Companion to "What Agents Can Do"

Safebox does 99% of what RSI promises — and a static analyzer can defend it.

Recursive self-improvement is the capability everyone fears and no one can secure. Continuous directed evolution reaches almost the same ceiling, by a route that turns defense back into something as tractable as a compiler pass.

Read time 11 minAudience TechnicalPosture Architecture, not pitch

× What Safebox is not

Recursive self-improvement

The optimizer rewrites the optimizer. The target moves. No fixed surface a defender can reason about.

✓ What Safebox is

Continuous directed evolution

The model is fixed. A vetted toolkit grows by composition, steered by humans. A surface a static analyzer can check.

I · The two machines

One rewrites itself. The other accumulates under a gate.

An RSI system improves its own ability to improve — which is exactly why no one can secure it: a thing that rewrites the rules of its own improvement has no fixed surface for a defender to reason about. Safebox keeps the model fixed and grows a vetted toolkit by composition instead.

Left: RSI's loop feeds back into the model itself, acquiring powers no one approved toward a target it sets — nothing fixed to check. Right: CDE only ever composes approved primitives; the model never changes, and every composition passes the gate before it runs.

RSI

Power and danger are the same property

It acquires capabilities no one approved, toward goals no one set — the source of both its reach and its un-securability.

CDE

It gives up one thing, keeps almost everything

It renounces new primitive power. But the space of combinations of approved tools is already vast — a Cambrian diversification from a small vetted set of parts.

The hand on the wheel is always human. The system composes; it does not acquire.

II · The compiler argument

Make the workflow a language, and defense becomes a compiler pass.

Defending an AI system today means watching behavior, training classifiers, adding monitoring, and hoping. Safebox makes the system analyzable: the workflow is a restricted declarative language, and every tool carries typed metadata, so a static analyzer reasons about a composition before it runs.

The analyzer never runs the workflow. It reads the declared graph and the typed manifests, traces whether any sensitive read can reach an external write, confirms each step stays inside its declared capability, and checks that every consequential effect hits the M-of-N gate — then refuses the composition if a taint path exists. All decided before a single step executes.

# the graph the analyzer reads — typed steps, declared effects
workflow vendor_outreach {
  step find  : tool=search.web      // read · net: search-API
  step draft : tool=llm.complete    // no effect · no net
  step send  : tool=smtp.send      // WRITE-EXTERNAL · smtp
  edge find → draft → send
}
// taint · capability · effect — all decidable, statically, before execution

This is the move that made type systems and capability security work: constrain the language so the safety properties you care about become provable. A type checker proves a class of crashes cannot happen without running your program; a Safebox analyzer proves a tainted read cannot reach an external write without running the workflow.

The honest boundary

Static analysis decides a class of properties, not all of them — the composition of two safe primitives is not always safe, and the metadata is itself an attack surface a lying manifest can defeat.

So Safebox does not claim defense is solved. It claims defense is relocated — out of the adversarial runtime into three things you can harden: the analyzer's soundness, the metadata's truthfulness, and the language's decidable boundary.

Steel skeleton — not sandcastle, not swarm

Steel Skeleton vs. Sandcastles names three ways to build intelligence. The sandcastle (prompts and vibes) collapses when a model updates. The swarm (emergent, self-modifying) is un-debuggable and unprovable, because emergence is not architecture. Only the steel skeleton survives.

That is the warning CDE answers about itself: a combinatorial system without a skeleton would become the swarm. The skeleton — typed primitives, policy gates outside the prompts, replayable execution, static enforcement — is what keeps it a building. The agents are cognition; the framework is architecture.

III · One environment, not a million

You patch a single sealed box — not every combination an org runs.

An organization running open-ended agents defends a combinatorial sprawl of environments — every laptop, runner, cloud account, and credential scope a distinct attack surface. Safebox inverts it: one attested, egress-controlled box, hardened and analyzed once.

Left: every environment an agent touches is its own surface to harden, and the set grows combinatorially. Right: every Safebox workflow runs inside one box under one set of primitives — so the defensive properties hold for every workflow, tenant, and org at once, because they belong to the substrate, not the task.

O(n)

trust you spend — humans approve each tool once, M-of-N

O(2ⁿ)

governed capability you get — every checkable composition of approved tools

environment to harden, analyze, and attest — not a million combinations

A vulnerability found inside the box is not a side effect: even a perfect exploit chain cannot reach an external write without a matching signed manifest and an M-of-N approval. Patching faster is a losing race against industrial-scale offense; sealing the environment once and proving the boundary scales the other way.

IV · A world every org can defend

If Safebots outcompete agents, defense gets easier for everyone.

The usual fear is that capability and danger rise together. The whole point of CDE is to break that coupling: capability rises with the combinatorial closure of approved tools; danger does not, because the new capability is composed from vetted parts inside a sealed box under a static check.

If Safebots proliferate and outcompete open-ended agents — not by being more clever, but by being the version an organization can deploy without betting the company on a model's restraint — then AI capability keeps climbing while the defensive burden falls. Every org defends the same kind of sealed environment with the same kind of static analysis, instead of improvising its own containment and re-learning the same lessons through its own breach.

The bet

It grows in the light, under a gate, where a defender can read it.

CDE will not do the last one percent — it will never acquire a genuinely new primitive power on its own, and that renunciation is what makes it safe. For the ninety-nine percent that is real work, it reaches the same ceiling as the dangerous machine, by a route that leaves a steel skeleton behind: a fixed model, a vetted toolkit, a declarative language, a single sealed environment, and a static analyzer that proves what the box will and will not do before it does anything at all.

RSI rewrites itself in the dark. CDE grows in the light. The fearsome version offers a world where capability outruns anyone's ability to defend it. This one offers a world where capability climbs and defense gets simpler at the same time — because the power lives in composition, and composition is checkable.

What Agents Can Do

The capability ledger — 99% of what open-ended agents do, and the four things Safebots structurally refuse.

The Compromise Problem

Why exfiltration is trivial once compromised, and the structural defenses of the box.

Steel Skeleton vs. Sandcastles

Three ways to build intelligence — and why only architecture survives.

The Trust Layer

The full stack — Infrastructure, Safebox, Safebots — and why trust comes from the substrate.