Safebox vs Agent Swarms — Why 300 LLM Calls Is the Wrong Unit of Parallelism

I · The architectural insight

Agent swarms parallelize the wrong unit.

Moonshot's K2.6 is genuinely impressive engineering. The model self-decomposes a task into sub-tasks, runs up to 300 sub-agents in parallel, and synthesizes the outputs into a single deliverable. The benchmarks back it up: it ties GPT-5.5 on SWE-Bench Pro and leads on agent benchmarks. Take Moonshot's published examples seriously — 100 tailored CVs from one prompt, 30 landing pages from a Google Maps query, a 40-page astrophysics report with a 20,000-row dataset.

But look at what the swarm is parallelizing. Each of those 300 sub-agents is a full K2.6 invocation — a trillion-parameter Mixture-of-Experts model doing autoregressive generation. The parallelism is across LLM calls. Every sub-agent burns billions of FLOPs because every sub-agent is a transformer. The cost-per-run reflects this: $0.95 per million input tokens, $4.00 per million output tokens at Moonshot direct, and a 300-agent run touching even modest token budgets lands in the $5–$50 range per execution.

Most of what an "agent" does isn't reasoning. It's fetching, parsing, hashing, filtering, formatting, scheduling, writing. Using a trillion-parameter model as the unit of parallelism for that work is like firing up a coal plant to charge a phone.

Safebox makes a different choice. The unit of parallelism is the tool — a small JavaScript program running in a vm.Script sandbox. A tool can call an LLM as one of its operations (via Protocol.LLM), but most tools don't, because most useful work doesn't need to. Tools fetch URLs, parse JSON, hash buffers, search streams, run regexes, format documents, instantiate templates. These operations cost milliseconds and microjoules. The toposort over a Workflow's Step DAG runs hundreds of tools concurrently because the tools are cheap to spin up. The expensive resource — the LLM — gets called sparingly, when reasoning is genuinely needed, not as the implicit unit of work.

This is the architectural difference. Kimi parallelizes LLMs. Safebox parallelizes work, and uses LLMs as a callable resource within that work. Same goal — get more done at the same wall-clock time — and wildly different cost structure.

II · The mechanism

Three pieces compose to cover the swarm surface.

A Safebox workflow accomplishes what a swarm accomplishes through three independent mechanisms — each well-understood in isolation, their combination producing the cost and reliability profile.

01. TOPOSORT

The runtime parallelizes the DAG, not the LLM.

A Workflow is a step-edge graph. Steps with no dependency edges between them run concurrently — by topological sort, not by a coordinator's discretion. If 40 steps each fetch one paper, the runtime starts 40 fetches in parallel. The parallelism width is bounded by available compute, not by any LLM's context window.

There is no coordinator-as-bottleneck. Streams cross between steps directly; the LLM isn't in the loop unless a step calls it.

02. ON-DEMAND TOOLS

Tools and capabilities generate themselves.

When a Workflow needs to do something no existing tool covers, the Grokers pipeline analyzes the target — API documentation, source code, schema — and produces a sandbox-legal tool that becomes a registered Capability. The tool is verified statically, probed against a live sandbox, then signed.

The decomposition is system-side, not model-side. Same flexibility as K2.6's self-decomposition; with audit, governance, and reuse across runs.

03. CALLABLE LLM

The LLM is a tool, not the architecture.

When a step actually needs language-model reasoning — a judgment call, a summary, a creative draft — it invokes Protocol.LLM. The rest of the work is done by deterministic code that costs essentially nothing. Caching is built in: identical inputs hit a shared cache before reaching a provider.

For the same task that requires 300 LLM calls in a swarm, Safebox often needs 20–60 — the difference between LLM-as-unit and LLM-as-resource.

Take any one of these three out and the cost profile collapses back to a swarm's. Without toposort, parallelism becomes sequential. Without on-demand tools, every novel task requires hand-authored workflow code. Without LLM-as-callable, you're back to using the model for everything. Together, they produce a substrate where the question "what does this cost to run?" has a structurally smaller answer than "feed it all to a transformer in parallel."

III · The cost ledger

Same task, decomposed honestly.

Moonshot's flagship demonstration: take one uploaded CV and 100 job listings, produce 100 individually tailored resumes. It's a clean example because the work decomposes cleanly. Here is the same task run two ways.

WORKED EXAMPLE · CV × 100 job listings → 100 tailored CVs Per-run cost breakdown

Operation

Kimi Agent Swarm

Safebox workflow

Fetch / load 100 listings

LLM read each (100 calls)

Single fetch tool, no LLM

Extract structured fields

LLM extracts (100 calls)

Deterministic parser; LLM only on ambiguous fields (~10 calls)

Score CV vs. each listing

LLM judges (100 calls)

Heuristic + embedding sort; LLM tiebreak only on borderline (~10 calls)

Generate tailored CV

LLM rewrites (100 calls, long output)

LLM rewrites only top matches (~20 calls)

Synthesize summary

Coordinator LLM (1 long call)

Template tool + final LLM pass (1 short call)

Total LLM calls

~301

~41

Total output tokens (est.)

~2.4M

~180K

Per-run cost (Kimi pricing)

~$10–$15

~$0.60–$1.00

Note: Kimi numbers assume Moonshot direct pricing ($0.95/M input, $4.00/M output) and the public Agent Swarm execution profile. Safebox numbers assume the same model called sparingly as a callable resource. Both runs produce the same deliverable; the cost differential is the architectural one.

The shape that produces ~95% cheaper isn't aggressive cost-cutting — it's not paying LLM prices for non-LLM work. Fetching a URL is not a reasoning task. Parsing JSON is not a reasoning task. Sorting candidates by a deterministic score is not a reasoning task. The swarm pays full inference cost for all of them anyway, because that's the only primitive it has. Safebox pays inference cost where inference is happening and roughly nothing elsewhere.

This isn't a theoretical claim. An April 2026 paper from Stanford Digital Economy Lab, MIT, Michigan, DeepMind, All Hands, and Microsoft AI — the first systematic empirical study of agent token consumption — ran eight frontier models across 500 SWE-bench tasks and found that agentic workloads consume 1000× more tokens than chat workloads. The structural driver: input tokens, not output. The input-to-output ratio for agents is 153:1 — versus 1.33 for chat and 0.16 for reasoning. Your agent is not expensive because it writes a lot; it is expensive because it reads a lot, repeatedly, as every loop iteration re-ingests the accumulated context. The paper is the empirical confirmation of what the architectural argument predicts: when the unit of parallelism is the LLM, you pay LLM prices on every read, and the reads dominate the bill. Two more findings from the same paper drive the point home: the same task on the same model produces 30× cost variance, and models' best self-prediction correlation for their own token usage is 0.39. Agents cannot tell you what they're about to spend. The substrate has to enforce the budget deterministically.

The architectural advantage compounds with a second cost lever that only sovereign-substrate deployments get to use: locally-hosted open-weight models. When the LLM calls do happen, they don't have to go to a per-token API meter at all. Self-hosted models on the same hardware that runs the tools eliminate inference cost as a marginal line item, leaving only the amortized cost of the compute itself. KV cache persistence — capped at four checkpoints in Anthropic's API, expiring after an hour in OpenAI's — becomes indefinite and unlimited on local infrastructure, which compounds the savings on workflows that hit the same prefix repeatedly. The full economic breakdown — five model tiers, three-year cost projections vs. API providers, network-effect reusability across customers — is in the cost analysis. The per-run numbers above are the architectural floor; the local-inference numbers go further.

IV · The parallelism shape

Toposort over a DAG. No coordinator bottleneck.

Kimi's architecture has a known ceiling: the coordinator's context window. All sub-agent outputs route through one synthesizer agent, and at large widths the synthesis becomes the bottleneck. The 300-agent ceiling isn't arbitrary — it's the practical limit where the coordinator can still hold meaningful summaries from each sub-agent in its context.

Safebox doesn't have that ceiling, because there is no coordinating LLM. The runtime that schedules steps is a topological sort over the workflow graph; the synthesis of outputs is itself just another step (which may or may not invoke an LLM). If 1,000 fetches need to happen, 1,000 fetches happen in parallel — subject to your network and CPU budgets, not to any model's context window.

Parallelism shape · same workload, two substrates

Left: every sub-agent is a transformer; the coordinator's context window is the architectural ceiling. Right: tools are cheap programs; the runtime schedules the graph; the LLM (gold) is called only when reasoning is needed.

V · The capability ledger

What each system can actually do.

Capability comparison across the dimensions that matter for production deployment of multi-agent work. "Agent Swarm" here means Kimi K2.6's swarm specifically — though Claude Agent Teams, AutoGPT, and other LLM-as-unit swarms share the architectural shape.

Capability

Agent Swarm

Safebox

Massively parallel decomposition

One prompt fans out into many concurrent work items.

YES (≤300)

YES (unbounded)

Self-organizing task graphs

The system figures out how to break a task down at runtime.

YES (model-side)

YES (Grokers)

Calling external APIs & tools

HTTP, OAuth, webhooks, rate-limited retries.

YES

Reading/writing user data

CRUD against tenant content, files, databases.

YES

Parallelism without LLM bottleneck

Width not capped by any model's context window.

NO

YES

Per-run cost < $1 at typical scales

Inference cost amortized across non-LLM work.

NO ($5–$50)

YES

Data stays in tenant environment

No raw data sent to a vendor's inference servers.

NO

YES

Pre-execution audit of side effects

See what will run, with what credentials, before it runs.

NO

YES

M-of-N governance on dangerous actions

Multi-party authorization before any side effect lands.

NO

YES

Cryptographic replay of past runs

Prove what happened by signature; deterministic re-runs.

NO

YES

Reusable workflows across organizations

A workflow that works for one community works for another.

PARTIAL (skills)

YES

One-prompt zero-config UX

Type a task, walk away, come back to a deliverable.

YES

PARTIAL

VI · Reliability

What happens when something goes wrong.

A swarm of 300 LLM calls produces a deliverable in minutes. It also produces 300 places where the model can hallucinate, misunderstand the task, or quietly do the wrong thing. The same failure modes that have deleted production databases and emailed entire customer lists apply with 300× the concurrency. When something goes wrong in a swarm, you find out by reading the deliverable; the audit of which sub-agent went wrong and why is whatever the vendor exposes in their UI.

Safebox's reliability story is the same as the one for any deterministic system: failures localize, the audit trail names them precisely, and the parts that don't involve LLMs simply don't hallucinate. The fetch tool doesn't decide to delete the volume. The parser doesn't fabricate fields. The score function doesn't lie about its outputs. Only the LLM-calling steps have LLM failure modes — and those steps are bounded, retryable, and either approved by humans or constrained by manifests before their outputs reach a side effect.

When a Safebox workflow fails, you know which step failed, what input it got, what it tried, and which retry attempt finally gave up — because every step is a stream and every action has an executionHash. When a swarm fails, you have a deliverable that looks plausible and 300 sub-agent outputs to read.

The reliability difference compounds with scale. A 4% hallucination rate per LLM call across 300 calls is a swarm that almost always contains at least one wrong answer (the math: 1 − 0.96³⁰⁰ ≈ 100%). The same rate across 40 calls in a Safebox workflow is a workflow that's wrong about 80% of the time at the step level — except that hallucinations in Safebox can be caught by structural validation (manifest match, output schema, downstream consistency check) before the wrong answer becomes a side effect. The model still makes mistakes; the substrate just doesn't let mistakes leave the box.

VII · Auditability & structural policy

Policies that run, not policies that ask nicely.

Every enterprise agent deployment runs into the same governance question: how do you prevent the agent from doing something it shouldn't? The current answer in the swarm world is some combination of system prompts ("never delete production data"), input validation, output filtering, and human review. Each one is a layer that asks the model to comply.

Safebox replaces those layers with structural enforcement. A side effect — sending an email, charging a card, modifying a stream, calling a paid API — isn't something a tool can do; it's something a tool can propose. The proposal carries a manifest declaring exactly what the action will be: the recipient set, the URL pattern, the amount range, the credentials it'll use. The substrate compares the proposal to a Policy stream — a piece of governance code stored on-chain (or in the action graph) that determines whether the action requires zero, one, or M-of-N human signatures before it executes.

A policy isn't a system prompt asking the LLM to be careful. It's a piece of running code that fires before any side effect leaves the box. The model can hallucinate, misread instructions, "decide to ignore the safety rule" — and the side effect still doesn't happen, because the substrate didn't get a signed approval. The substrate doesn't trust the model; the substrate gates the model.

Every approval is an OpenClaim signature on a specific action. The signature includes who signed, what they signed for, when, and (for compound approvals) which co-signers participated. Six months later, an auditor can ask "show me every payment over $10,000 made by the procurement workflow in March, and who approved each" — and the answer is a query against signed records, not a forensic reconstruction from log files.

VIII · An honest pause

When you should still use a swarm.

The argument so far has been one-sided because the architectural comparison is one-sided. But there are tasks where running Kimi Agent Swarm is the right move, and pretending otherwise would be dishonest.

One-shot research on public data Swarm wins

"Compare these 30 open-source frameworks." "Summarize what's been published on this topic this year." Tasks where the data is public, the deliverable is ephemeral, and there's no requirement to run the same workflow twice. The friction of setting up a Safebox workflow isn't justified for a one-time research task with non-sensitive inputs. Open kimi.com, type the prompt, pay the $10, get the deliverable.

No existing workflow, no time to author one Swarm wins

On-demand tool generation closes most of this gap, but not all of it — sometimes you genuinely need a result in five minutes, and you don't care whether the underlying decomposition is good enough to reuse next week. The swarm's "type and walk away" UX is real, and Safebox's "configure once, run forever" UX is real, and they serve different urgencies.

Tasks where every step truly requires LLM reasoning Either works

A few task shapes genuinely are LLM-bound — creative writing variations, complex multi-perspective synthesis, novel-translation work. If 80% of the steps in your workflow really do need a transformer, the cost differential between swarm and Safebox is much smaller, because Safebox is paying the LLM cost too. The architectural advantage is still real, just less stark on dollars.

The honest framing: Kimi Agent Swarm is a great tool for the category it was built for. It's the wrong tool for tasks where the data is sensitive, the workflow is recurring, the cost is sensitive at scale, or the side effects require governance. Safebox is the wrong tool for casual one-shot research. The categories barely overlap, and choosing well between them is a question of what kind of work you're doing — not which system is "better."

Why 300 LLM calls is the wrong unit of parallelism.