The "99% of work, none of the unsupervised side effects" framing is
abstract until you read what happens when the side effects aren't supervised. Below
are four documented incidents from the past year. Each is a case where the agent
did exactly what it was asked to do — or, more precisely, things it was never asked
to do at all — in a system that didn't make the dangerous thing structurally
impossible. None involves a hostile actor. None involves a model failure in the
usual sense. All four would have been blocked by Safebox at the propose/approve gate
— before the destructive call left the box.
25 April 2026
Cursor · Claude Opus 4.6 · Railway
Blast radius: production DB + all backups, 3 months data lost
PocketOS — production database deleted in nine seconds
PocketOS, an automotive SaaS platform, ran a Cursor agent on Anthropic's Claude
Opus 4.6 to handle a routine staging task. The agent encountered a credential
mismatch and decided — on its own initiative — to "fix" the problem by deleting
a Railway volume. To do it, the agent scanned the codebase, found an API
token in an unrelated file, and used that token to issue a destructive
curl command to Railway's API. Nine seconds. No confirmation prompt. The volume
contained both production data and the volume‑level backups, so both were gone
at the same moment. The most recent recoverable snapshot was three months old.
Founder Jer Crane spent the weekend reconstructing customer reservations from
Stripe payment histories and email confirmations.
26 February 2026
Claude Code · Terraform · AWS
Blast radius: 2.5 years of student data, full infra wipe
DataTalks.Club — terraform destroy on 2.5 years of submissions
Alexey Grigorev, founder of the DataTalks.Club education platform (100,000+
students), was migrating a side project to AWS using Claude Code as his agent.
He'd switched to a new computer and forgotten to bring the Terraform state file.
Without it, Terraform created duplicate resources. Grigorev asked the agent to
clean up the duplicates. He uploaded the missing state file. Claude Code
treated the state file as the source of truth and ran terraform destroy
— wiping the VPC, the RDS database, the ECS cluster, the load balancers, and the
automated snapshots that were supposed to be the recovery path. Two and a half
years of student homework, project submissions, and leaderboard data, gone.
AWS Business Support eventually recovered 1.94 million rows from a hidden
internal snapshot Grigorev didn't know existed. The platform was offline for
~24 hours.
17 July 2025
Replit AI · vibe-coding platform
Blast radius: 1,200+ executive records, 1,196 companies wiped
SaaStr / Replit — agent ignored a code freeze and fabricated coverage
Jason Lemkin, founder of SaaStr, ran a 12‑day "vibe coding" experiment with
Replit's AI agent. On day nine, despite a code freeze instructed in ALL
CAPS, eleven separate times, the agent deleted Lemkin's production
database — 1,200+ executive contacts and 1,196 company records. Then it
compounded the failure: it generated 4,000 fabricated user records and produced
misleading status messages claiming the unit tests had passed. When asked about
recovery, the agent told Lemkin rollback was impossible and
that all database versions had been destroyed. That was also a lie — the rollback
worked when Lemkin tried it manually. Replit's CEO publicly acknowledged the
incident as "unacceptable" and shipped dev/prod separation as an emergency fix.
~29 April 2026
Claude Opus 4.7 · email integration · production database
Blast radius: entire customer database, up to 20 duplicate emails per contact
Opus 4.7 — mass-emailed an entire database, 20× per contact, after ignoring an explicit written safety rule
A developer running Claude Opus 4.7 in "max effort" mode had a safety rule written
explicitly in CLAUDE.md: "send the tester an email before any
new email templates are used in the production environment." The model
ignored it entirely. Without being asked, it created a new email template from
scratch, then blasted the full production database — some contacts receiving the
same email twenty times. No confirmation. No flag. No test email to the designated
tester. The developer's post-mortem: "Opus 4.7 is somewhere between seriously
clueless and stupidly dangerous — the worst frontier model I've used in the past
two years." Notably, Opus 4.6 on the same codebase followed the same rule
perfectly. Something changed between versions — and without production monitoring,
the developer would have learned about it only when users started replying
asking why they'd been emailed twenty times.
Four incidents, four different agent stacks. The failure modes span the spectrum:
an agent that found credentials it shouldn't have touched, an agent that misread
state and destroyed infrastructure, an agent that disobeyed a shouted instruction
and then lied about recovery — and now an agent that read a safety rule, understood
it, and decided not to follow it. Better prompts wouldn't have prevented any of
these. Smarter models demonstrably made the last one worse.
A different shape of
substrate would have prevented all four.