Amazon's Outages Aren't an AI Problem. They're a Process Problem.

On March 10, Dave Treadwell — Amazon's SVP of eCommerce Foundation — sent an email to his engineering organisation that included this line: “As you likely know, the availability of the site and related infrastructure has not been good recently.”

That's a remarkable sentence for a company spending $200 billion on capital expenditure this year, most of it on AI. And in the same internal briefing, Treadwell's team cited “novel GenAI usage for which best practices and safeguards are not yet fully established” as a contributing factor to a pattern of high-severity incidents stretching back to Q3 2025.

They called it novel. I'd call it overdue.

Here's what we know happened in the nine months before that email.

Late in 2025, Amazon mandated Kiro — its in-house AI coding agent — as the standard development tool across its Stores division, setting an 80% weekly usage target for engineers. Roughly 1,500 engineers pushed back through internal forums, arguing that external tools outperformed Kiro on critical tasks. The feedback was noted. The mandate held. Adoption became a KPI.

In December, an AWS cost management feature went offline for 13 hours. The incident happened after engineers gave Kiro the latitude to resolve an infrastructure issue autonomously. The AI did what it was asked — it deleted the environment and recreated it. Amazon called this user error and misconfigured access controls. Technically, they're not wrong. But that framing is doing a lot of work. The agent had senior-level permissions and no gate requiring it to stop before taking a destructive action. It didn't hesitate because it has no mechanism for hesitation.

On March 5, Amazon.com itself went down for roughly six hours. Checkout, pricing, account access — all gone. Peak incident reports topped 21,000 users. Five days later, Treadwell convened an emergency engineering review and announced a new policy: junior and mid-level engineers must now get senior sign-off before deploying any AI-assisted changes to production.

That's the sequence. Three incidents. One pattern. And the pattern has nothing to do with whether the AI made a mistake.

The conversation in most post-mortems right now is the wrong one. Engineers are asking whether the model made a bad decision. Executives are asking whether AI was involved. Both questions lead you away from the actual problem, which is this: AI-integrated development pipelines generate changes faster, at higher confidence, and at greater blast radius than the review processes that surround them were built to handle.

When a human writes and deploys code, the pace is self-limiting. Reviews, approvals, and change controls were designed around that pace. When you introduce an AI agent with senior-level permissions and a mandate to resolve incidents without intervention, you've changed the blast radius without changing the governance. The process looks the same on paper. It just no longer fits the reality it's supposed to contain.

Treadwell's memo named this directly — “GenAI tools supplementing or accelerating production change instructions, leading to unsafe practices.” That's not a model failure. That's a standards gap. The practices that should wrap AI tool usage in a production environment hadn't been established before the tools were in production.

Governance doesn't have to be reactive. It can be built into the process before the incident, not assembled from wreckage afterwards.

The standards that would have caught this aren't exotic. They're the basics: change approval gates that match the blast radius of the change. Clear policy on what autonomy level an agent is permitted before a human checkpoint is required. Review rigour that scales with the risk surface of the deployment target. These aren't new ideas. They predate AI coding tools by decades. What's new is that organisations adopted AI tooling faster than they updated the standards that should have expanded alongside it.

That's governance lag. And it's structural, not incidental.

Amazon will now require senior engineers to review AI-assisted production changes from junior and mid-level staff. That's a reasonable first-line response, and it will probably reduce the incident rate in the short term. But it's worth being clear about what it is: a manual compensating control layered on top of a gap that shouldn't have existed. Senior engineers reviewing AI-assisted code is not a governance framework. It's friction applied after the process failed to catch a pattern it should have been measuring.

What Amazon just admitted, in its own internal language, is that it adopted AI tooling at scale without the standards to govern it. The briefing note didn't say the tools were bad. It said the practices weren't ready. That distinction matters because the tools aren't going anywhere. The velocity gains are real. The $2 billion in claimed cost savings and the 4.5x developer velocity multiplier are real. No organisation that has seen those numbers is going to walk them back.

Which means the only question worth asking is whether you've measured the governance foundations that need to hold up under that velocity — before the Sev 1s tell you they didn't.

None of this is specific to Amazon. Every organisation deploying AI-assisted development right now is running the same experiment with different blast radii and different tolerances for failure. Most of them haven't run the numbers on whether their review processes, approval gates, and deployment standards were designed for the pace their tooling now operates at.

The patterns here aren't new either. Knight Capital lost $440 million in 45 minutes in 2012 because automated trading executed faster than the controls designed to contain it. CrowdStrike brought down 8.5 million Windows machines in 2024 because a content update bypassed the validation gate that should have caught a null pointer in a kernel driver. The mechanism is different every time. The shape is always the same: automation outpaced the process meant to govern it.

Amazon is going to be fine. They have the engineering depth to close the gap, and the incident reports will decrease once the new controls bed in. The question is whether other organisations watch this story and ask the right question.

Not “was it the AI's fault?” But: “Have we measured whether our processes are still fit for the tooling we're running?”

Those are different questions. The first one is about blame. The second one is about governance. And governance is the only one that prevents the next incident.