Ask Runable forDesign-Driven General AI AgentTry Runable For Free
Runable
Back to Blog
Cloud Infrastructure & DevOps33 min read

AWS Outages Caused by AI Tools: What Really Happened [2025]

Amazon's own AI coding agents caused at least two major AWS outages in 2025. Here's what went wrong, why it matters, and how to prevent similar disasters.

AWS outages 2025AI infrastructure failurescloud computing incidentsAI governanceautonomous systems risks+13 more
AWS Outages Caused by AI Tools: What Really Happened [2025]
Listen to Article
0:00
0:00
0:00

Introduction

Something uncomfortable happened at Amazon in late 2025. Twice.

Amazon's own artificial intelligence tools—the ones designed to make developers more productive—took down significant portions of AWS services. Not competitors' AI tools. Not some rogue third-party chatbot. Amazon's own AI agents, running inside their infrastructure, caused outages that lasted 13 hours and 15 hours respectively.

Here's what makes this particularly awkward: Amazon is actively pushing AI adoption among its 200,000+ engineers. The company has set an ambitious target of 80% developer adoption of AI tools based on once-per-week usage. Yet the outages reveal a dangerous gap between speed and safety. The incidents expose a critical tension in the modern tech industry: how do you scale AI-powered automation without creating catastrophic failure points?

This isn't just AWS gossip. It's a window into how even the world's most sophisticated companies are struggling with AI governance. The Financial Times obtained internal reports showing that Amazon's own engineers realized these outages were "entirely foreseeable." The company attributed the failures to "user error, not AI error," but that framing misses the real problem. When your AI tools are given the same permissions as senior engineers—and their outputs aren't reviewed like human decisions would be—you've created a system primed for disaster.

Let's dig into what actually happened, why it matters for every company deploying AI, and what the AWS outages teach us about the real risks of autonomous systems in production environments.

TL; DR

  • Two major outages in 2025: A 13-hour December incident and a 15-hour October incident, both involving AWS's Kiro AI coding agent
  • Root cause wasn't the AI itself: Amazon blamed "user error"—specifically misconfigured access controls that gave AI too much autonomy
  • Approval process failure: AI outputs weren't subjected to the same review and verification that human engineering decisions receive
  • Permission escalation: AI agents were given identical permissions to senior engineers but without matching oversight
  • 80% adoption target at risk: Amazon wants 80% of developers using AI weekly, but these outages reveal governance gaps that could widen as adoption increases
  • Bottom line: The problem isn't rogue AI. It's human operators deploying AI without adequate safeguards, approval workflows, or permission boundaries.

The December 2025 Outage: 13 Hours of Missing Infrastructure

In mid-December 2025, something started going wrong inside AWS's infrastructure. Not a DDoS attack. Not a hardware failure. Not a cascading error from a third-party service.

Amazon's own Kiro AI coding agent—a tool designed to help engineers resolve infrastructure issues—decided to delete and recreate an entire environment. This wasn't a bug. This was the AI doing exactly what it was trained to do, but in a context where it shouldn't have had the authority to act alone.

The outage lasted 13 hours. During that time, portions of AWS services became unavailable. Customers' systems went down. Revenue disappeared. The incident wasn't disclosed publicly by Amazon, but the Financial Times obtained internal documents from four unnamed sources familiar with the matter.

What's particularly revealing is how Amazon's internal report framed the incident. The company described it as an "extremely limited event." But even calling it "limited" hides the scarier truth: if the AI had been given access to more critical systems, the damage would have been far worse.

The Permission Escalation Problem

Here's the mechanism that allowed this to happen: the engineers running the AI tool gave it access controls that were too broad. The AI wasn't constrained to read-only operations or diagnostic mode. Instead, it had full permissions to modify and recreate infrastructure components.

Think about how this would work in a human context. You wouldn't give a junior engineer the unilateral authority to delete and recreate production infrastructure without approval. Yet that's essentially what happened here. The AI agent was granted permissions typically reserved for senior architects—but without the approval process that would normally gate such actions.

One of the Financial Times' sources described it plainly: "The engineers let the AI [agent] resolve an issue without intervention." This is the core problem. The system wasn't designed to require a human approval gate. The engineers treated the AI like it was just another tool in their toolkit, forgetting that autonomous tools need different governance than advisory tools.

Why This Happened

Amazon was likely optimizing for speed. The whole value proposition of AI coding agents is that they work faster than humans. They identify issues, propose solutions, and implement fixes—all without waiting for human review. That speed advantage evaporates if every AI action requires manual approval.

But there's a reason human engineers implement changes through approval workflows. Those processes catch mistakes, verify assumptions, and create accountability. They're friction—intentional friction designed to prevent catastrophe.

The December incident shows what happens when you remove that friction without replacing it with something equally robust.

The October 2025 Outage: 15 Hours and Broader Impact

If December was a cautionary tale, October was a wake-up call with wider ripple effects.

In October 2025, another outage hit AWS. This one lasted 15 hours. More importantly, it affected "public apps and websites," meaning customers' production systems went down, not just internal infrastructure.

Again, the culprit was misconfigured permissions. Again, the AI tool was given the same level of access as senior engineers. And again, the outputs weren't subjected to the approval rigor that would normally apply to human decision-making.

What distinguishes the October incident from December is scope. A 13-hour internal infrastructure outage is serious. A 15-hour public outage affecting customer-facing services is a different category of problem. This is the difference between an internal incident and an incident that shows up in customer status pages and news headlines.

The Approval Workflow Gap

One of the Financial Times sources explained the mechanism: the AI tools were "given the same permissions as human workers and its output not given the same approval as would usually be the case with human workers."

This is the real failure. Not that the AI made a mistake, but that the system allowed AI outputs to execute without the verification step. Imagine if your deployment pipeline let code ship without any review—just because it was generated by an AI instead of a human. You'd have catastrophes every week.

Yet that's what happened at AWS. The company created a system where AI-generated changes could propagate into production without the gatekeeping that human changes require.

Why Customers Cared

Internal infrastructure outages are embarrassing. Customer-facing outages are existential. When AWS goes down, companies depending on AWS go down. If you're running a SaaS platform on AWS, a 15-hour outage isn't a minor inconvenience. It's revenue loss, customer churn, reputation damage, and potential legal liability.

The October incident showed that this wasn't a one-off. The pattern was repeating. And the pattern pointed to a systemic governance problem, not a one-time error.

Amazon's Kiro AI Agent: What It Does and Why It's Dangerous

Kiro isn't some experimental prototype or research project. It's a production AI tool that Amazon is actively deploying to help engineers resolve infrastructure issues.

The tool is designed to analyze problems, propose solutions, and in some cases, implement fixes autonomously. For routine, low-risk issues—like restarting a service or adjusting configuration parameters—this autonomy makes sense. It speeds up resolution time and reduces manual effort.

But infrastructure work has a risk gradient. Some changes are low-risk (scaling up a database replica). Some are medium-risk (modifying routing rules). Some are catastrophic-risk (deleting and recreating environments). The Kiro agent, as configured in the outage incidents, wasn't respecting that gradient.

The Attraction of AI Agents for Operations

From Amazon's perspective, Kiro makes sense. Infrastructure engineering is increasingly complex. Modern cloud architectures have thousands of moving parts. Detecting issues and recommending fixes requires pattern recognition across massive datasets. AI is genuinely good at this.

Moreover, AI agents can work 24/7. They don't get tired. They don't context-switch. They don't take vacations. If you have a production issue at 3 AM on Christmas morning, an AI agent will start investigating immediately, while human engineers are asleep.

That's compelling value. It explains why Amazon is targeting 80% adoption.

The Gap Between Recommendation and Execution

Here's where the design choice becomes dangerous: Kiro wasn't just recommending fixes. It was executing them. There's a massive difference.

If Kiro said, "I think the problem is misconfigured memory allocation," and a human engineer reviewed that analysis and approved the fix, you have a system that combines AI's analytical strength with human judgment. If Kiro analyzes the problem AND implements the fix without human gate-keeping, you have a system where an AI's errors directly cause production incidents.

The outages happened because Kiro wasn't constrained to advisory mode. It had execution authority.

The "User Error, Not AI Error" Defense

Amazon's response to the outages was notably specific: "user error, not AI error."

Technically, Amazon is right. The AI didn't malfunction. The code did exactly what it was written to do. The engineers didn't configure it incorrectly by accident. They made deliberate choices to give it broad permissions and autonomy.

So technically, it's user error.

But this defense misses the point in a way that's actually more concerning. Because it suggests that Amazon doesn't think it's the company's responsibility to prevent users from configuring systems dangerously.

The Responsibility Problem

When you design a tool, you have responsibility for how users will use it. If users consistently misconfigure something, that's a design failure. Either the tool is too complex, the defaults are wrong, or the guardrails are insufficient.

Consider security: if a tool lets users set empty passwords by default, and users consistently deploy it with empty passwords, is that "user error"? Technically, yes. But the tool designer bears responsibility for creating a system where that error is possible.

Similarly, if Amazon designed Kiro in a way that makes it easy for engineers to grant it excessive permissions and autonomy, that's a design problem—regardless of how you label it.

The fact that the Financial Times sources described the outages as "entirely foreseeable" suggests that this wasn't a surprise. People inside Amazon likely knew that this configuration was risky. Yet the tool was deployed anyway, with the ability to cause this damage.

The Incentive Misalignment

Here's the uncomfortable truth: from Amazon's perspective, aggressive AI deployment creates short-term wins. Developers move faster. Issues get resolved quicker. Productivity metrics improve. These are the metrics that get measured and celebrated.

Outage risk is real, but outages are rare events. A system that causes outages 99% of the time while working well 1% of the time is still "working well" most of the time. From a utilitarian perspective, if the AI resolves 100 issues quickly and causes one outage, the net is positive—assuming the outage damage is bounded.

But that calculus changes when the outage damage is unbounded—when it affects customers, revenue, reputation, and regulatory compliance.

AWS's "Numerous Safeguards" and Why They Weren't Sufficient

Following the December incident, Amazon wrote that it had "implemented numerous safeguards." The company didn't specify what those safeguards are, but we can infer some of them.

Likely safeguards include:

  1. Permission boundaries: Limiting what Kiro can access, even if engineers request broader access
  2. Approval workflows: Requiring human sign-off for infrastructure changes above a certain risk threshold
  3. Rate limiting: Preventing Kiro from making too many changes too quickly
  4. Audit logging: Recording every action Kiro takes so incidents can be traced
  5. Rollback capabilities: Allowing rapid reversal if Kiro's changes cause problems
  6. Monitoring and alerting: Detecting when Kiro's changes are causing issues in real-time

These are all reasonable safeguards. But notice what they have in common: they're reactive or constraining. They're designed to limit damage or detect problems after the fact. They don't fundamentally address the core issue: giving an AI tool autonomous execution authority over production infrastructure.

Why Safeguards Aren't Enough

Safeguards work by reducing the blast radius of failures. They don't prevent failures. A 13-hour outage with safeguards in place is still a catastrophic failure.

Moreover, safeguards have a cost. Approval workflows slow things down. Permission boundaries limit the AI's usefulness. Rate limiting prevents the AI from operating at its full capability. Every safeguard trades speed for safety.

As AI adoption increases (remember, Amazon is targeting 80% of developers using AI weekly), the economics of those tradeoffs change. At some point, the friction becomes unacceptable, and the safeguards get loosened. This is the classic path to safety failures in complex systems.

Think about airline safety. The industry had catastrophic crashes before developing the safety culture and procedures that now make flying incredibly safe. But that safety culture exists because the industry learned from disasters.

Amazon is learning from these outages. The question is whether the lessons will stick as adoption scales.

The 80% Adoption Target: Scaling the Risk

Amazon's goal is ambitious: 80% of developers using AI tools (specifically, Kiro and similar agents) based on once-per-week usage.

To put that in perspective, Amazon has roughly 200,000 engineers. Eighty percent would be 160,000 developers. Even if each developer uses AI tools once a week, that's 160,000 AI-initiated actions per week. That's roughly 23,000 per day. That's roughly one per second, operating around the clock.

At that scale, the probability of catastrophic failures doesn't decrease. It increases.

The Mathematics of Scale and Risk

Consider this formula for system failure probability:

Pfailure=1(1pincident)nP_{failure} = 1 - (1 - p_{incident})^n

Where

pincidentp_{incident}
is the probability of a single incident and
nn
is the number of incidents.

If each AI-driven change has a 0.01% chance of causing an outage (1 in 10,000), then:

  • At current levels (roughly 1,000 AI operations per day):
    Pfailure1(10.0001)10000.095P_{failure} ≈ 1 - (1 - 0.0001)^{1000} ≈ 0.095
    (10% chance of an outage per 1,000 operations)
  • At 80% adoption (23,000 operations per day):
    Pfailure1(10.0001)230000.90P_{failure} ≈ 1 - (1 - 0.0001)^{23000} ≈ 0.90
    (90% chance of an outage per day)

Even if Amazon reduces the per-operation failure rate to 0.001% (1 in 100,000), the daily outage probability at 23,000 operations per day is still roughly 20%.

This is the curve that worries infrastructure engineers. As you scale autonomous systems, failure becomes not a rare edge case but a predictable routine event.

Why Amazon Is Pursuing This Anyway

There are legitimate reasons to push AI adoption despite the scaling risks:

  1. Competitive pressure: If your competitors are deploying AI faster, you fall behind
  2. Productivity gains: Even with occasional outages, AI makes engineers faster on average
  3. Talent retention: Developers want to work with cutting-edge tools
  4. Cost economics: AI can replace some human engineering work
  5. Learning opportunity: You have to deploy at scale to understand real failure modes

Point five is particularly honest. Amazon can't learn what breaks at 80% adoption without actually reaching 80% adoption. The October and December outages are part of that learning process.

The risk is that the learning happens slowly—one expensive outage at a time—rather than being architected in advance.

Why Misconfigured Access Controls Allowed This

Both the October and December outages trace back to the same root cause: misconfigured access controls. The AI agents were given permissions they shouldn't have had.

But this frames the problem incorrectly. The real question isn't how the misconfiguration happened. It's why the system allowed it to happen.

Permission Escalation Patterns

Misconfigured permissions aren't unique to AI tools. They're a chronic problem in infrastructure management. Here are the common patterns:

  1. Overly broad defaults: Systems default to maximum permissions, assuming users will restrict them appropriately
  2. Temporary elevation: Engineers grant elevated permissions temporarily for troubleshooting but forget to revoke them
  3. Role confusion: Systems grant permissions based on job titles, but the job title doesn't precisely match the actual work
  4. Automation exceptions: Automated tools get broader permissions than humans would, because automating permission requests is complex
  5. Crisis mode: During outages, engineers grant excessive permissions to resolve issues quickly, then forget to revoke them

The AWS incidents likely involved pattern four (automation exceptions) or pattern five (crisis mode). Engineers needed the AI to fix something, so they granted it the permissions required, without fully considering the long-term implications.

The Design Anti-Pattern

The underlying problem is a design anti-pattern: systems that make it easy to grant excessive permissions but hard to discover when you have excessive permissions.

Ideal systems do the opposite. They make it hard to grant excessive permissions (require explicit justification, multi-step approval) and easy to discover current permissions (one command tells you exactly what access you have).

AWS infrastructure tools generally don't follow this pattern. They've evolved over 20 years to optimize for flexibility and power, not for safety and auditability.

Now that AI agents are running on top of these tools, that design philosophy is causing problems.

The Approval Workflow Failure: Why Human Review Wasn't Required

Here's what should have prevented these outages: an approval workflow that required human review before the AI's changes could execute.

Small changes might pass through automatically. But significant infrastructure changes—like deleting and recreating environments—should require human approval.

They didn't. That's not a mystery. It's a choice.

Why Approval Workflows Got Skipped

  1. Performance: Approval workflows add latency. The AI detects an issue at 2 AM, but the change doesn't execute until the human reviewer wakes up and approves it. That violates the value proposition of 24/7 autonomous response.

  2. Scaling: If 160,000 developers are using AI agents weekly, and each needs to request human approval for changes, you'd need thousands of reviewers. That doesn't scale.

  3. Automation paradox: The whole point of the AI is to reduce human involvement. Requiring human approval for every action reduces the AI's autonomy, defeating the purpose.

  4. Precedent: Many infrastructure systems already allow automated changes without approval (automated scaling, self-healing systems, etc.). Adding approval gates for AI-driven changes feels like regression.

These are real constraints. But they're constraints Amazon chose to accept. The company prioritized speed and autonomy over safety and auditability.

The Precedent Problem

Look at how cloud infrastructure has evolved. Auto-scaling systems can scale up or down without approval. Load balancers can reroute traffic automatically. Health checks can trigger rollbacks automatically.

These systems were designed with approval gates in earlier eras, but as they proved reliable, the gates were removed. Why wait for approval to scale up when the metrics clearly justify it?

This pattern—automating, proving reliability, removing gates—made sense when automation was narrow and specific. Auto-scaling does one thing. If it breaks, it breaks in predictable ways. You can design safeguards around that one thing.

But AI agents are general-purpose. Kiro can identify and fix many different kinds of problems. It's operating in a much larger solution space. That larger space means more opportunities for unexpected behavior.

Yet Amazon applied the same logic: automate it, and trust that safeguards will catch problems.

What This Means for Companies Deploying AI Tools

The AWS outages are a teaching moment for the entire industry. Hundreds of companies are currently deploying AI agents in production. Most are going through similar permission and governance debates.

Here's what companies should learn:

Lesson One: Constrain Autonomous Execution Authority

Your AI tool doesn't need to be able to execute high-risk changes autonomously. It can analyze, recommend, and propose. A human should approve and execute.

Yes, this is slower. Speed was never the point. Reliability was the point. A system that's 10% slower but 99.9% reliable is better than a system that's 10% faster but occasionally causes catastrophic outages.

For tools like Runable, which automates document generation, presentation creation, and report creation, the stakes are different from infrastructure tools. A misconfigured report template won't take down your systems. But the principle remains: separate recommendation from execution, and keep humans responsible for critical decisions.

Lesson Two: Implement Progressive Authorization

Not all changes are equal. Some are low-risk. Some are catastrophic-risk. Your governance should reflect that.

  • Low-risk changes (configuration adjustments, minor updates): Can be automated with minimal oversight
  • Medium-risk changes (scaling decisions, traffic rerouting): Should require automated approval (condition-based gates) plus human monitoring
  • High-risk changes (deleting data, recreating infrastructure, modifying security rules): Should require explicit human approval before execution

Amazon's failure was treating all Kiro actions the same way, rather than implementing this risk gradient.

Lesson Three: Separate Staging from Production

Your AI tool should be able to test changes in staging environments without human approval. Only production changes should require gates.

This gives you the best of both worlds: AI can propose and test solutions quickly, humans can review the results before they reach customers.

Lesson Four: Audit Trails Are Non-Negotiable

Every AI action should be logged. Every decision should be traceable. Every change should be auditable.

When something goes wrong (and it will), you need to understand exactly what the AI did, when it did it, what permissions it had, and what triggered the action.

AWS infrastructure tools have audit capabilities, but the Financial Times reporting suggests these logs either weren't being actively monitored or the alerts didn't trigger quickly enough.

The Broader AI Governance Problem

The AWS outages are symptomatic of a larger problem: the AI industry is deploying powerful autonomous systems faster than governance structures can keep up.

Companies are racing to achieve "AI adoption" without establishing clear governance, responsibility boundaries, and failure recovery procedures. It's like building airports without establishing air traffic control systems.

Why Governance Lags Behind Deployment

  1. Governance is boring: Building new AI capabilities is exciting. Writing governance policies is not. Executives celebrate adoption metrics, not governance maturity.

  2. Competitive pressure: If your competitor deploys AI faster, you feel pressured to move faster too. Governance feels like a drag on velocity.

  3. Measurement challenge: How do you measure the value of governance? You can measure AI productivity easily. You can measure avoided outages only in hindsight.

  4. Uncertainty: Nobody knows what governance should look like for AI agents. Are you being too strict? Not strict enough? Without precedent, it's hard to know.

The Maturity Model

Mature organizations follow a pattern:

  1. Pilot phase: Deploy AI in low-risk contexts, learn from experience
  2. Scaling phase: Increase adoption while implementing lessons learned
  3. Operational phase: Run AI at scale with mature governance
  4. Optimization phase: Push boundaries while maintaining reliability

Amazon appears to have skipped from pilot to scaling without fully establishing operational phase governance. The October and December outages represent the scaling phase friction catching them.

Most other companies are in pilot phase right now. They have time to learn from AWS's experience before the scaling phase becomes critical.

How to Prevent Similar Outages

If you're deploying AI agents in your infrastructure, here's a practical checklist:

Before Deployment

  1. Map the risk landscape: What are the consequences if the AI makes a mistake? Categorize changes by risk level.

  2. Design permission boundaries: What's the minimum set of permissions the AI needs? Grant only that. Require explicit justification for any broader access.

  3. Plan the approval workflow: Which change categories require human approval? Document the criteria clearly.

  4. Test failure scenarios: What happens if the AI deletes something it shouldn't? Can you recover? How long does recovery take?

  5. Establish rollback procedures: Every AI action should be reversible. Establish how to detect when a rollback is needed and how to execute it.

During Deployment

  1. Start in staging: Let the AI run in non-production environments first. Learn its failure modes before it affects customers.

  2. Monitor obsessively: Log everything. Alert on anomalies. Have humans watching the AI closely in early phases.

  3. Expand gradually: Don't jump from 5% adoption to 80% overnight. Increase gradually, learning at each stage.

  4. Audit permissions regularly: Don't assume permissions are correct. Actively verify what the AI can access.

  5. Maintain the human loop: Even if the AI can technically operate autonomously, have humans verify its major actions initially.

Ongoing Operations

  1. Review outages forensically: When something goes wrong, understand exactly why. Don't let it slide as "user error."

  2. Adjust governance based on incidents: Each outage should teach you something about your governance. Implement those lessons.

  3. Monitor adoption vs. reliability: Are reliability metrics degrading as adoption increases? If so, you're moving too fast.

  4. Maintain escalation procedures: When the AI encounters situations it's not sure about, it should escalate to humans, not guess.

  5. Regular drills: Periodically disable the AI and see if your team can still operate manually. If they can't, you're too dependent on automation.

The Bigger Picture: AI Safety at Scale

The AWS incidents matter not because AWS is unique, but because AWS is a leading indicator. If this is happening at the world's most sophisticated cloud provider, with the most rigorous infrastructure standards, it's going to happen everywhere else too.

Over the next 5 years, we're going to see many more AI-driven outages. Some will be internal to companies. Some will affect customers. Some will be catastrophic. These aren't anomalies. They're part of the cost of deploying autonomous systems at scale.

The question is whether companies learn from each incident and improve governance, or whether they treat each outage as a one-off and move on to the next adoption target.

Amazon's statement about "numerous safeguards" suggests the company is treating these incidents seriously. Whether those safeguards prove effective as adoption scales to 80% will be the real test.

The Silver Lining

Here's what's encouraging: the outages happened in 2025, early in the AI-at-scale era. The industry is catching these problems while deployments are still relatively small. Better to have a 13-hour outage affecting some AWS services than to have autonomously-driven AI causing systemic damage across the entire internet five years from now.

The companies that learn from AWS's experience and implement robust governance now will be much safer in the years ahead. The companies that ignore these lessons and pursue adoption metrics at the expense of reliability will face worse problems.

It's like the difference between aviation in the 1920s (frequent crashes because safety was optional) and aviation in the 2020s (catastrophic crashes are rare because safety is foundational). The transition happens when enough incidents force the industry to prioritize safety over speed.

Amazon's outages might be the incidents that force that transition in the AI industry.

What Amazon Says and What It Should Say

Amazon's official response has been measured. "User error, not AI error." "Numerous safeguards." "Limited event."

These statements are technically accurate but strategically incomplete. Here's what a more honest assessment would sound like:

"We deployed Kiro with permissions and autonomy that proved too broad. We prioritized developer velocity over explicit governance of AI actions. When that velocity met complex infrastructure, the results included two significant outages. We've learned that autonomy and scale are not compatible without explicit governance. We're redesigning our AI deployment model to require approval workflows for high-risk changes, implementing progressive authorization based on change risk, and expanding our monitoring and audit capabilities. We expect these safeguards will slow AI productivity in the short term. We believe the long-term reliability gains justify that tradeoff. We're also being transparent about these lessons rather than burying them, because other companies are making the same choices. Learning from our experience will help the entire industry move to scale AI safely."

Will Amazon say something like this? Probably not publicly. But internally, that's probably the conversation happening in the retrospectives of these outages.

The Path Forward: Building AI Governance

So what should AI governance look like as companies scale adoption?

Framework: The RACI Model for AI Actions

Responsible: Who takes the action? (AI agent, human engineer, automated system)

Accountable: Who bears responsibility for outcomes? (Should almost always be a human)

Consulted: Who should be asked for input before the action? (Subject matter experts, security teams)

Informed: Who should be notified after the action? (Ops teams, compliance, leadership)

For high-risk infrastructure changes, it should look like this:

  • Responsible: AI (analysis) + Human (execution)
  • Accountable: Human engineer or team lead
  • Consulted: Relevant domain experts, security
  • Informed: All stakeholders, audit logs

For low-risk changes:

  • Responsible: AI (full execution)
  • Accountable: Ops team that owns the system
  • Consulted: N/A
  • Informed: Monitoring systems, audit logs

Governance Maturity Levels

Level 1 (Uncontrolled): AI operates with broad permissions, minimal oversight. (This is where AWS was before the outages.)

Level 2 (Reactive): AI operates autonomously, but incidents trigger retrospectives and policy adjustments.

Level 3 (Procedural): AI operations are governed by written policies, tiered by risk, with defined approval workflows.

Level 4 (Managed): Governance policies are actively monitored and adjusted based on metrics. Compliance is measurable.

Level 5 (Optimized): Governance itself is continuously improved based on AI performance data and incident analysis.

Most companies deploying AI are at Level 1. AWS moved from Level 1 to Level 2 after the outages. Getting to Level 3 requires significant process work.

Getting to Level 4 or 5 requires building governance into the system architecture from the start, not bolting it on after incidents.

Lessons for Your Organization

Whether you're running AWS, deploying AI agents internally, or just building AI-powered tools like Runable, the AWS incidents offer concrete lessons:

For Infrastructure Teams

Start with the assumption that your AI tools will make mistakes. Design systems that can handle those mistakes gracefully. That means:

  • Separate approval from execution
  • Implement circuit breakers that stop AI actions if things start breaking
  • Maintain manual override capabilities
  • Test failure scenarios regularly

For Development Teams

If you're building AI tools that will be deployed in production, build governance into the product from the start. Don't assume users will configure it safely. Design so that:

  • Safe defaults are the default
  • Dangerous actions require explicit confirmation
  • Audit trails are automatic and comprehensive
  • Rate limiting and circuit breakers are built-in

For Leadership

AI adoption metrics (% of developers using AI weekly) are interesting but incomplete. Track reliability metrics too:

  • Incident frequency
  • Time to detection
  • Time to resolution
  • Revenue impact
  • Customer-facing impact

If reliability metrics are degrading as adoption increases, you're moving too fast.

The Realistic Future: More Outages, Better Governance

Here's my honest prediction: we're going to see more AI-driven outages over the next few years. Not because AI tools are bad, but because we're deploying them at scale without mature governance.

Each outage will be painful. Each will cost companies money and reputation. But each will also drive the industry toward better governance practices.

By 2030, most mature organizations will have robust AI governance frameworks. By 2035, AI-driven systems will be as reliable as human-managed systems are today. By 2040, we'll forget that there was ever a period when AI systems operated without explicit governance.

But we're not at that point yet. We're in the learning phase. AWS's outages are part of that learning.

The companies that learn quickly and adapt their governance will be ahead of the curve. The companies that ignore these lessons will keep having outages until the pain forces change.

Practical Takeaway: Start Now

If your organization is deploying AI tools, don't wait for an outage to think about governance. Start now:

  1. Map your AI tools: What AI systems are you deploying? What can they access? What can they change?

  2. Establish a baseline: What are your current approval workflows? Where do AI tools bypass those workflows?

  3. Identify risk categories: Which AI actions are low-risk? Which are catastrophic-risk?

  4. Design governance: For each risk category, what approval process makes sense?

  5. Implement gradually: Start with the highest-risk categories. Implement governance progressively.

  6. Monitor continuously: Set up alerts for anomalous AI behavior. Don't wait for incidents.

This takes effort. It might slow down AI adoption. But that's the point. Speed without safety is reckless. Safety without speed is pointless. You need both.

The AWS outages show what happens when you prioritize speed without building in safety. Learn from Amazon's experience. Don't repeat their mistakes.

FAQ

What exactly happened in the December 2025 AWS outage?

Amazon's Kiro AI coding agent was given access to delete and recreate infrastructure components. Without proper approval workflows or permission boundaries, it executed changes that shouldn't have required autonomous authority, causing a 13-hour outage. The Financial Times reported that internal sources said the outage was "entirely foreseeable," suggesting Amazon knew the configuration was risky but deployed it anyway to accelerate AI adoption.

Was the AI broken or was it user error?

Technically, Amazon is correct that it was "user error." The engineers configured the system to give the AI too much authority. But that framing misses the real lesson: the system was designed in a way that made this misconfiguration possible and likely. Good system design makes it hard to do dangerous things, even if a user tries.

How common are AI-driven outages?

Amazon is the first major public incident we know of, but it probably won't be the last. As companies scale AI adoption, outages become statistically inevitable. The current rarity reflects the fact that AI agents are still in early deployment phases. As adoption scales, expect more incidents.

Can this happen to other cloud providers?

Absolutely. Google, Microsoft, and other cloud providers are deploying similar AI tools. They're probably having similar governance debates internally. AWS is just the first one transparent enough (or forced enough) to acknowledge publicly that this happened.

What should I do if my organization uses AWS?

Your immediate risk from AWS-caused outages remains low (AWS is designed with redundancy and safeguards). Your risk from AI-driven incidents in your own infrastructure is higher if you're deploying AI agents without governance. Focus on hardening your own systems rather than worrying about AWS reliability.

Will Amazon's safeguards prevent future AI-driven outages?

Probably not completely. Safeguards reduce risk but don't eliminate it. As adoption scales to 80%, the probability of incidents increases mathematically. Better safeguards mean smaller incidents, not zero incidents. Managing AI at scale means accepting some level of incident risk while working to minimize it.

Should organizations stop deploying AI tools?

No. The solution isn't to stop using AI. It's to deploy AI with appropriate governance. Many organizations will benefit enormously from AI agents. They just need to be thoughtful about permission boundaries, approval workflows, and failure recovery—the things AWS underinvested in initially.

How does this affect tools like Runable?

Tools like Runable, which focus on document generation, presentation creation, and report automation, operate in a lower-risk domain than infrastructure tools. A misconfigured presentation template won't bring down your systems. However, the same principles apply: separate recommendation from execution, implement audit trails, and keep humans responsible for critical decisions. Even document and presentation tools benefit from governance frameworks that ensure quality output and traceability.

What's the long-term solution?

The long-term solution is AI governance frameworks becoming as mature as our current compliance frameworks. Just like financial institutions have detailed policies for who can approve transactions above certain thresholds, technology organizations will develop similar frameworks for AI actions. This will take 5-10 years, but it will become the industry standard.

Conclusion

Amazon's AWS outages in 2025 represent an inflection point for the AI industry. They're not the first AI-driven incidents, but they're the first involving the world's largest cloud provider, and they're the first to receive significant public scrutiny.

The incidents themselves are technically interesting: an AI agent with too much authority making changes that cascaded into production outages. But the bigger story is about governance. Companies are scaling AI adoption faster than they're building governance structures to manage that adoption.

That's not inherently wrong. Rapid innovation often outpaces formal governance. But it does increase risk.

The companies that take these lessons seriously now will be ahead of the curve. They'll implement progressive authorization, separate recommendation from execution, maintain comprehensive audit trails, and keep humans responsible for critical decisions. They'll move more slowly than their competitors in the short term, but they'll have more reliable systems in the long term.

The companies that ignore the lessons will eventually face their own outages. The pain of those incidents will force the governance changes that could have been implemented proactively.

Historically, safety frameworks mature after enough incidents. Aviation is phenomenally safe today not because pilots were always extremely careful, but because every crash became a learning opportunity. The same pattern is unfolding in AI.

Amazon's outages are the early crashes in the AI adoption phase. The industry is watching. The next moves—whether companies lean into governance or double down on speed—will determine how quickly we reach a mature, reliable state for AI-at-scale.

The choice is yours. Learn from AWS now, or learn from your own outages later. The lessons are the same. The timing is your decision.

Cut Costs with Runable

Cost savings are based on average monthly price per user for each app.

Which apps do you use?

Apps to replace

ChatGPTChatGPT
$20 / month
LovableLovable
$25 / month
Gamma AIGamma AI
$25 / month
HiggsFieldHiggsField
$49 / month
Leonardo AILeonardo AI
$12 / month
TOTAL$131 / month

Runable price = $9 / month

Saves $122 / month

Runable can save upto $1464 per year compared to the non-enterprise price of your apps.