Claude Code and Cursor are great when one developer is changing one file. They get harder to use when changes need another team to sign off, when refactors cross language boundaries, or when long-running sessions burn through your token budget. Here's a system built for those cases — what it does, how it works, and where you wouldn't use it.
Most AI coding tools collapse "discussing what to change" and "actually changing it" into a single agent loop. That works for one developer / one file. It breaks at team scale. The system here splits the two layers, indexes your codebase into a graph the LLM can query instead of re-reading, and routes everything through a governance check before any file gets written. You keep your IDE; this handles what your IDE was never built for.
You know the loop. Open a file. Describe what you want. Model proposes. You accept or fix. Move on. Ten iterations a minute when it's flowing. Every assumption that loop makes is true at its design point — one developer, one file, reversible edits, short session, git is the audit trail. Past those assumptions, four things break:
One file, one developer, quick verify. If wrong, fix inline. Five minutes, done.
Function and its callers fit in one editor window. You can read the diff yourself.
Needs agreement before any code moves. Currently: Slack thread, meeting, PR, follow-up. The conversation isn't yours to drive alone.
Too much for one context window. You batch it, stitch results yourself, lose track of which file you're on by symbol thirty.
Bindings aren't in the same repo or IDE session. The model can't know they exist. Cross-language is a discovery problem the loop doesn't solve.
Model doesn't remember. You paste history. The bill grows each session. Six sessions in, you've spent more re-anchoring than reasoning.
These aren't tooling complaints — they're the boundary of what agent-loop architecture is designed for. Everything past this paragraph is about what's on the other side.
Coding with AI is actually two different jobs glued together:
Talking — figuring out what should change. Reading code. Asking questions. Getting input from other people. None of this touches your codebase.
Doing — making the change. Writing code, running tests, pushing branches. This mutates state.
In Claude Code or Cursor, one model does both. It thinks AND acts, in free-form sequence. Fine for small reversible work. Wrong shape when the doing has to be agreed beforehand, audited afterward, or composed with other ongoing work.
The system this piece describes splits them. Four pieces:
The key property: the chat layer can't write to your codebase. Whatever it discusses or proposes only reaches code if the workflow engine runs it — and the workflow engine only runs when a proposal passes a governance check the chat layer can't bypass. You configure how strict that check is.
Failure modes get smaller. Chat hallucinates? The worst it does is generate a proposal that gets rejected — nothing reaches your code. Workflow fails mid-execution? The worst it does is leave a workspace clone behind that gets cleaned up. Compare to an agent loop where the same model that's confidently wrong is also holding the file-write tool.
Before either of the other layers is worth talking about, the system needs to know your codebase. Not have read access — actually know, at the structural level of what each function does, what calls what, and where the same operation lives in different languages.
Grokers builds this. Both the chat layer and the workflow engine read from it. Two comparisons that might help orient:
Point Grokers at a repo and it does a comprehension pass. Two phases are cheap; one uses the LLM.
Phase 1: tree-sitter parsing. Tree-sitter is a pure grammar parser — it walks the AST without running the code. Grokers ships parsers for ~10 languages (Python, JS, TS, Go, Rust, PHP, Ruby, Java, C++, Bash). It extracts symbols and call expressions. A 100k-line codebase parses in a few minutes. Zero LLM tokens.
Phase 2: relation derivation. Still no LLM. From the parsed trees: who calls who, which tests cover which symbols, which methods belong to which classes. Cycles in the call graph get tagged for phase 3 to handle separately.
Phase 3: contract extraction. The LLM phase, run in parallel across hundreds of symbols at once. For each symbol, the model generates four prose attributes: preconditions, postconditions, side effects, invariants. About 200 tokens per symbol — much smaller than the source. Dependencies are respected (function A's precondition referencing class B's invariant waits for B). Cycles fall back to partial-info passes and get flagged.
For parse_args(s, legacy_format=False):
preconditions: s is a non-empty string. If legacy_format, s uses
comma-separated form; otherwise --key=value form.
postconditions: Returns dict[str, str]. Raises ParseError on
malformed input.
sideEffects: None. Pure function.
invariants: Output dict's keys are always lowercase.
The chat layer reads this instead of the source when it's discussing the function. The workflow engine verifies against this after rewriting it. Same artifact, both layers.
Most code-comprehension tools either pretend to understand everything (the LLM fabricates plausible contracts) or silently skip what they can't analyze (you don't know what they missed). Grokers does neither.
When the model's confidence is low — dynamic dispatch, monkey patches, eval calls, paths that depend on runtime config — the symbol gets tagged with a "concept to investigate" entry describing the specific concern. These are queryable. A human or follow-on workflow picks them up, clarifies, and Grokers re-evaluates. Over time, the unknowns shrink. What you end up with is a graph whose remaining gaps are explicit and finite, not implicit and unbounded.
The hardest thing for a single-language tool is cross-language refactoring. Your Python function has a Go shim, a TS binding, a Ruby script that calls the same HTTP endpoint. None of this is visible to a tool reading one language at a time.
During indexing, when Grokers sees a symbol whose contract describes an HTTP endpoint, message-bus publish, FFI export, or other cross-process operation, it emits a bridge node. Symbols in other languages that reference the same operation point at the same bridge. The Python parse_args and its Go binding share a bridge.
"What depends on this Python function across all our languages?" becomes one graph walk. No LLM call. No similarity search. The discovery work was done once during indexing.
| Record type | What it holds |
|---|---|
Grokers/symbol |
One per function, class, or module. Source plus metadata. |
Grokers/contract |
The four-attribute contract. ~200 tokens per symbol. |
Grokers/calls |
Edges: source symbol → target symbol. Reverse-walkable. |
Grokers/covers |
Edges: test → symbols the test exercises. |
Grokers/extern |
Bridge nodes linking the same operation across languages. |
Grokers/concept-to-investigate |
Places the indexer wasn't confident. Explicit, queryable. |
Grokers isn't one-shot. New upstream commits trigger incremental re-indexing of affected symbols and their immediate neighbors. The graph stays current with the codebase. A useful side effect: when the workflow engine modifies a function, the re-grok pass updates the contract automatically. The next conversation about that symbol sees the new contract. There's no "refresh the index" step.
RAG retrieves chunks by embedding similarity. That works for prose. For code it fails three ways:
Contracts replace embeddings of source. Relations replace nearest-neighbor lookups. Cross-language bridges replace discovery work no embedding search could do (a Python function and its Go binding share nothing at the embedding level).
Knowing the codebase is what makes the rest cheap. The LLM doesn't re-derive structure every turn. The chat doesn't ship source. The workflow doesn't have to discover what breaks. Indexed once, queried forever.
Two teams. Engineering owns Python's parse_args. Api-consumers represents three downstream services that call it. Engineering wants to drop legacy_format=True — nobody internal uses it and it slows the parser.
With Claude Code: Slack message, meeting, PR, follow-up. Here's the system version:
A developer types: "I want to drop legacy_format from parse_args. Can we coordinate with the consumers?"
The chat is hosted in the platform itself, not Slack — a chat-shaped record other parts of the system can read. The engineering team has a bot watching: a sandboxed function that reads anything, calls the LLM, proposes actions — but can't write to your codebase directly.
Grokers indexed this repo earlier and has been re-indexing on every commit. parse_args has a contract record sitting ready: ~200 tokens. The bot fetches it. A graph query returns 20 internal callers. Following bridge edges gives consumers in Go and TS. Cost so far: zero tokens.
~3,000 input tokens, structured output. The model returns JSON:
{
"kind": "workflow",
"verb": "openNegotiation",
"args": { "parties": ["engineering", "api-consumers"],
"topic": "parse_args.legacy_format deprecation" },
"rationale": "Cross-team API change; needs agreement before code moves."
}
The bot proposes the action. Governance auto-approves (opening conversations writes nothing). A negotiation record gets created — think of it as a PR for an agreement rather than for code.
Each side's bot hashes its relevant context (contracts it'd cite, conventions, prior agreements with this counterparty). They exchange hashes, walk the trees, request what's missing. Two round-trips.
After this, citing a contract by hash costs ~50 tokens instead of restating its 200-token body. It's git's pack protocol applied to LLM context — and it compounds across every future session between these parties.
The api-consumers bot reads the proposal with its own context: stability policies, SLA docs. It finds that the team's stability convention requires a 12-month deprecation window. Its structured response: contest timeline; propose 12 months with a backward-compat shim.
Ambiguity surfaces: does "shim" count as backward-compat for the SLA? Rather than expanding the main negotiation (and paying token cost for both sides per turn), it becomes a side thread tagged as blocking. Resolution comes back as a single envelope referencing the answer.
Final terms: 9 months, shim provided, deprecation warning this quarter. Both sides (or their human operators, depending on each team's autonomy config) sign. Signatures accumulate as votes; the platform emits a lockCommitted event when the threshold clears.
The signed proposal carries one extra attribute: executesWorkflow: 'Code/code/modify'. The agreement isn't just a document — it's a trigger.
A lock-execution bot proposes code/modify with the lock URI as provenance. Governance has a specific rule: workflows with signed-negotiation provenance are pre-approved. The signing ceremony was the human approval; requiring a second signature now would be empty ritual.
The workflow creates a ZFS clone of the engineering repo (atomic, copy-on-write, instantly discardable if anything fails). It asks the model for the rewrite, grounded in the contract and the lock's terms. Result is verified against the original contract — this is class-B (precondition relaxed; old keyword still accepted via shim). Tests run inside the clone. They pass. A branch gets pushed to your upstream.
Grokers picks up the new commit. Re-derives parse_args's contract. Updates the record. Consumer bots read the new contract and verify it matches what was agreed. If the rewrite had drifted from the lock's terms, this is where it'd surface.
Three records of substance (dialog, negotiation, workflow). The model was called four to six times per side, each on a focused prompt under 5k tokens. Every step has a declared input, output, governance rule, and audit trail back to the original message.
Six months later, "why does parse_args look like this now?" isn't reconstructed from PR descriptions and Slack scrollback. It's a query.
Three scenarios. Smart baseline (RAG retrieval, structured prompting — not a strawman) vs. system-backed.
Single cross-team deprecation. ~150k tokens naive vs. ~70k system-backed.
Same teams over six months. Sync state compounds. ~750k vs. ~200k.
Python, Go, TS. Bridges substitute for huge discovery passes.
Four stacking sources:
The cheapest token is the one you don't send. The system doesn't make LLM calls more efficient — it makes most of the work not need an LLM call at all.
Governance lives at one specific moment: when a tool proposes an action. A small judgment function reads the tool's declared action types, looks up the policy for each, returns approve / abstain / reject. The judgment and the policies are themselves records — readable, modifiable through the same governance.
| Approval shape | When you use it |
|---|---|
| Auto-approve | Routine, reversible work. Docs updates, notifications, read-only audits. Class-A style fixes. |
| Single human signature | Default for code modifications. Faster than current PR review because the proposal is structured — you're approving a description, not reading a diff cold. |
| Quorum or specific provenance | Cross-team. Governance edits. Anything where unilateral action is the problem. |
One specific rule recognizes proposals carrying signed-negotiation provenance and treats them as pre-approved. Without it, every signed agreement would need a second signature at execution time — defeating the agreement. With it, the signing ceremony IS the approval. Teams who want stricter control disable the shortcut.
The point: humans stay in the loop where their judgment adds value, and stay out where requiring a signature would be empty ritual. Teams decide which is which, and can change later if they got it wrong.
A familiar worry: model writes code, tests pass, but the new code subtly weakens a guarantee some caller depended on. Function used to never return null; now it does in edge cases. Tests don't catch it — tests verify the new behavior. The caller breaks in production three weeks later.
The system addresses this directly. After every rewrite, the workflow engine re-derives the contract from the new source and compares against the old. Three classifications:
| Class | What it means | What happens |
|---|---|---|
| A — preserved | Equivalent claims. Same behavior, different expression. | Apply. Run tests. Done. |
| B — upward-compatible | Preconditions relaxed OR postconditions strengthened. Existing callers still satisfied. | Apply. Log the relaxation. Update docs. |
| C — incompatible | Preconditions strengthened OR postconditions weakened. Callers will break. | Walk callers via the call graph. For each, evaluate against the new contract. If broken, recurse — the caller becomes its own modification sub-workflow. |
Class C is where the architecture earns its complexity. Recursion is bounded because call graphs are DAGs; most ripples terminate within 2-3 levels. Each caller's sub-workflow goes through the same contract check, same governance, same tests. When the whole ripple completes, you have a coherent set of changes that succeed or fail together.
The model can rewrite a function. It can sometimes notice it broke a caller. It cannot, in a tight loop, methodically walk a call graph and recurse with discipline. The system can, because the system isn't the LLM — it's the orchestration around the LLM.
modify is the first. Same shape carries the rest.Once the basic shape works — fork a workspace, read context, propose via LLM, verify against contract, apply through governance — that shape carries a family of operations:
| Workflow | What it does |
|---|---|
code/modify | Change a function. Re-derive contract. Ripple if needed. |
code/rename | Rename a symbol everywhere. Contract preserved by construction. |
code/document | Generate or update documentation. Targeted edits, never wholesale regeneration. |
code/audit | Read-only scan. Find symbols matching a query — missing contracts, complexity above threshold, security patterns. |
code/test | Generate tests for under-covered symbols. Mutation-tested to confirm they actually exercise the target. |
code/cloneRepo | Clone a git repo at a specific commit. Triggers Grokers indexing. |
code/notify | Post change envelopes to affected parties — subscribed devs, cross-language consumers, plugin authors. |
Same machinery throughout: same workspace forking, same contract verification, same governance gating, same audit chain. The middle differs. None required new platform primitives.
Three places state lives, each with one owner:
| Layer | What it holds |
|---|---|
| Records (canonical) | Contracts, relations, conversations, proposals, approvals, audit chain. Source of truth. |
| ZFS clones | Filesystem realization of workspace forks. Where compilers and test runners operate. Atomic, copy-on-write, instant to discard. |
| Git branches | One-way projection for human review. Pushed only when a workflow succeeds. No merging back. |
Git is one-way deliberately. If the platform pulled, it would have to reconcile its record-graph view against text-diff changes upstream — wickedly hard when one side is structured and the other isn't. Forward-only avoids reconciliation. Cost: the platform's view can drift from upstream. Benefit: every branch is grounded in a known-good baseline, and the audit chain stays intact.
Why ZFS specifically: snapshots are atomic, clones are copy-on-write (parallel workflows don't multiply disk), destruction is instant. Btrfs works similarly. Plain copy works at the cost of disk space. Symlink-and-move tricks don't actually work — they pretend to be atomic and aren't.
For one developer / one file, forking workspaces and routing through governance is overhead with no payoff. IDE-integrated AI coding remains the right tool for in-the-flow editing. The two coexist; the system captures single-file commits if you push them through it, but it's not designed to be your primary editing surface.
Safety properties are structural, not behavioral. A tool can't write outside its declared action types, even if the LLM tries. Autonomy levels are explicit — auto-fire for documentation, single-signature for code, quorum for governance. There's no default-on autonomy. There's a knob with a clear position.
Git stays. The platform pushes branches to git; humans review PRs the way they always have. What changes is what produces the commits and what verifies them before they land. Platform-side verification supplements git-side review, not replaces it.
The pattern under all of this: the interesting infrastructure question isn't "how smart is the LLM" — it's "what state machinery surrounds the LLM." In this pipeline, the model's job is small and well-defined: read a focused context, produce a structured output, do that maybe a dozen times across a long negotiation. Everything else — conversation hosting, agreement tracking, contract verification, ripple, scheduling, audit — is the platform's job.
The platform doesn't have to be smarter than the model. It has to host the model so the outputs are cheap to verify, easy to govern, and queryable after the fact.
And the platform isn't new. It's the same machinery that hosts email, calendar, messaging, payments, governance — categories where multiple parties coordinate. Code-as-coordination fits the shape because most things fit the shape. For an architecture-centric treatment of the same system, see safebots.ai/coding.html.
You don't have to replace your editor. You have to replace the parts of your workflow the editor was never designed to handle.
The system is open-source under the Qbix umbrella; the four pieces (Safebots, Safebox, Grokers, Code) compose into a working install. If you want to see what configuration looks like for your team's autonomy preferences, the link below.