AWS Major Outage Analysis - October 20, 2025

In-depth analysis of the 15-hour AWS outage (US-EAST-1, Oct 20, 2025) that crippled global services. We compare media reports with AWS's official statement, detailing the DNS failure cascade, recurring outage patterns, and essential multi-cloud disaster recovery strategies.

Introduction

Yesterday (October 20th) around 4 PM, I was tinkering with my blog setup using HUGO, AWS S3, and CloudFront when suddenly the ACM page threw an error. I was like, “Ugh! This damn internet!” and started cursing my innocent ISP haha. Turns out it was actually an AWS outage.

After that, I saw AWS trending on Google Trends and the Virginia region lighting up red on the status map. I thought, “Well, this must be a pretty big outage.”

AWS experienced another major outage after 2 years. This one was quite significant. Services around the world were down for about 15 hours.

So, let me break down how this outage happened, what news outlets reported, what AWS officially said, and what we should learn from all this.

1. News Coverage

Korean Media Reports

Korean media outlets focused mainly on the scale of the damage.

When It Happened

  • October 20th, around 4 PM Korean time (about 3 AM in the US)
  • Started in the US-EAST-1 region in Northern Virginia
  • One of AWS’s largest data center regions

Affected Services

There were way more than I expected.

Global Services:

  • Games like Snapchat, Fortnite, Roblox
  • AI search services like Perplexity
  • Financial services like Coinbase, Robinhood
  • Even Venmo, Signal, Duolingo

Domestic services too:

  • Samsung Wallet, Samsung.com
  • PUBG (Krafton)

Even airlines:

  • Delta Airlines and United Airlines websites and apps went down
  • Some airports had to process boarding manually

In the UK too:

  • Lloyds Bank, Bank of Scotland
  • Telecom companies like Vodafone, BT Group
  • Even the HMRC (tax authority) website

DownDetector (a site that tracks outages) received 50,000-65,000 reports. Over 1,000 websites and services worldwide were affected… The scale is massive.

US Media Reports

US media outlets focused more on the technical aspects.

Root Cause

The root cause was a DNS (Domain Name System) resolution issue. Simply put, the system that converts website addresses to IP addresses broke down. Specifically, there was a DNS resolution problem with the API endpoints of a database service called DynamoDB.

But that wasn’t the end. Even after fixing the DNS issue, they couldn’t launch new EC2 instances. EC2 had dependencies on DynamoDB, so the problems cascaded. On top of that, the Network Load Balancer health checks failed, causing issues with services like Lambda and CloudWatch.

Recovery Timeline

  • 3:11 AM (ET): AWS first acknowledged “there’s a problem”
  • 6:35 AM: DNS issue resolved
  • 12:28 PM: Mostly recovered
  • 6:00 PM: Almost back to normal
  • Total: about 15 hours

Experts said “it doesn’t look like a cyberattack” and expressed concerns that “modern internet relies too heavily on a few cloud companies.” They’re right.

2. AWS Official Statement

AWS also made an official announcement. They were surprisingly transparent with the information.

Official Timeline

October 19, 23:49 ~ October 20, 02:24 (PDT)

  • Error rates and latency spiked in the US-EAST-1 region
  • Amazon.com, subsidiaries, and even AWS Support team were affected

October 20, 00:11

  • Confirmed increased error rates across multiple services
  • Identified as a DNS resolution issue with DynamoDB API endpoints

October 20, 00:26

  • Clearly identified the cause - DNS resolution failure of regional DynamoDB service endpoints

October 20, 02:24

  • DynamoDB DNS issue fully resolved
  • But some internal systems still had problems

October 20, 12:28

  • Many customers and services significantly recovered
  • Slowly releasing throttling on new EC2 instance launches

October 20, 15:01

  • All services returned to normal
  • Some services still processing backlogged messages

Summary of Root Causes

Here’s the breakdown:

  1. Primary Cause: DynamoDB DNS resolution failure
  2. Secondary Issue: Even after fixing DNS, couldn’t launch EC2 instances because EC2 depends on DynamoDB
  3. Tertiary Impact: Network Load Balancer health check failures caused cascading issues with Lambda, CloudWatch, etc.

3. News Coverage vs AWS Official Statement

Where They Align

Time and Location

  • Both cite around 3 AM ET, US-EAST-1 region
  • Recovery time also matches at around 6 PM

Cause

  • Both news and AWS say it was a DNS problem
  • Both mention DynamoDB-related issues

Impact

  • Hundreds to thousands of services worldwide affected
  • Various sectors: aviation, finance, gaming, telecom

Differences

Technical Details

News outlets simplified it to “DNS problem” for general audiences, but AWS provided more specifics:

  • DNS resolution issues with DynamoDB API endpoints
  • EC2 internal subsystem dependencies on DynamoDB
  • Network Load Balancer health check cascading failures

Recovery Process Complexity

According to AWS’s statement, even after resolving the DNS issue, they deliberately throttled EC2 instance launches for gradual recovery. They proceeded cautiously to avoid triggering another failure. This wasn’t covered in the news.

Internal Impact

News coverage only discussed external service disruptions, but AWS’s statement revealed that Amazon.com, subsidiaries, and even AWS Support team were affected. CNBC actually reported that Amazon warehouse workers couldn’t use internal systems and had to wait in break rooms.

Assessment

Overall, AWS was pretty transparent with information. But there are some disappointments:

  1. Why it happened: No explanation of the root cause of the DNS problem (software update error? Configuration mistake?)
  2. Compensation: No mention of compensation policy, which has been an issue before
  3. Prevention: Missing specific measures for preventing recurrence

4. Has This Happened Before?

Yes. This isn’t AWS’s first major outage.

Major Outage History

December 7, 2021 - US-EAST-1 Outage

  • The most serious recent major outage
  • Lasted over 5 hours
  • Airline reservations, car dealerships, payment apps, video streaming all down
  • Amazon Kinesis Data Streams issue was the cause

June 13, 2023 - US-EAST-1 Outage

  • Websites offline for several hours
  • Affected The Boston Globe, NYC MTA, Associated Press
  • Issues with Lambda, EventBridge, SQS, CloudWatch

Others

  • 2020: Multiple outages
  • November 22, 2018: Korea Seoul region down (DNS issue!)
  • 2017: S3 disaster
  • 2015: DynamoDB outage

Pattern Analysis

This October 20, 2025 Outage

  • About 2 years and 4 months since the June 2023 outage
  • About 3 years and 10 months since the December 2021 major outage

Patterns Found:

  1. US-EAST-1 region keeps being at the center of problems
  2. DNS-related issues repeat (2018 Korea, 2025 US)
  3. Major outages occur roughly every 2-3 years
  4. DynamoDB repeatedly problematic (2015, 2025)

5. How Was AWS’s Response?

What They Did Well

Quick Initial Response

  • Official acknowledgment and announcement within about 22 minutes of the problem
  • Root cause identified within about 1 hour 15 minutes

Transparent Communication

  • Real-time updates via Health Dashboard
  • Disclosed technical details
  • Provided specific timeline

Careful Recovery

  • Stable recovery through EC2 throttling after DNS resolution
  • Phased approach to prevent additional failures

Seems like an open and good response.

Room for Improvement

Repeated Issues

  • DNS problems also occurred in Korea in 2018
  • Same problem again after 7 years
  • DynamoDB too (2015, 2025)

Single Region Dependency

  • US-EAST-1 alone went down and the whole world stopped
  • Global services still overly dependent on single regions

Compensation Policy

  • Past cases where Korean companies didn’t receive proper compensation due to unfair contracts
  • Still unclear what they’ll do this time

Lack of Preventive Measures

  • Major outages repeat every 2-3 years
  • Same US-EAST-1 keeps having issues
  • Insufficient testing perhaps…

Other Clouds Have Similar Issues

It’s not just AWS:

  • Microsoft Azure: Teams, Outlook, Microsoft 365 went down in January 2023
  • Microsoft 365: October 2025 outage (Google used this for marketing lol)
  • Google Cloud: June 2025 extended outage hit OpenAI, Shopify
  • CrowdStrike: July 2024 software update mistake caused $5.4 billion in losses for Fortune 500 companies

Could be a structural issue for the entire cloud industry, or maybe this is the best we can do for now.

6. What Should We Learn?

Risks of Cloud Dependency

The Trap of Centralization

AWS accounts for about 30-33% of the global cloud market. Add Microsoft and Google… we’re relying too heavily on a few companies. Classic Single Point of Failure problem.

How Can We Improve?

Multi-Cloud Strategy

Don’t depend on a single cloud provider. Distribute critical systems across multiple regions. At minimum, build failover systems for core functions.

Design Differently from the Start

  • Design for Failure
  • Failure isolation mechanisms like Circuit Breaker patterns
  • Implement Graceful degradation (services don’t completely die, just limit functionality)

From a Societal Perspective?

Regulations and Policies

  • Need discussions on designating cloud services as critical infrastructure
  • Strengthen regulations on SLA and compensation policies
  • Manage cloud dependency of public services

Diversification

  • Consider open-source alternatives or self-hosting
  • Foster regional and national cloud ecosystems
  • Prevent excessive centralization

Conclusion

The October 20, 2025 AWS outage clearly showed our significant cloud dependency. For about 15 hours, millions of people worldwide couldn’t use everyday services, and companies suffered massive losses.

AWS was reasonably transparent about the problem and responded quickly. But looking at major outages repeating every 2-3 years, especially recurring issues in the same US-EAST-1 region… Structural improvements seem necessary. It’s concerning that similar failures recur in core services like DNS and DynamoDB.

Remember this:

No cloud, including AWS, can be 100% perfect. Outages aren’t a matter of ‘if’ but ‘when’. So we all need to prepare for cloud outages.

The convenience and scalability of cloud are genuinely great and undeniable. But risks clearly exist too. Smart cloud usage starts with recognizing and preparing for these risks.

No system is perfect. What matters is how well you prepare for outages, how quickly you recover when they happen, and what you learn to prevent recurrence.

This AWS outage reminded us all of this lesson once again.

AWS can fail too, so always be prepared ;-)

comments powered by Disqus
Built with Hugo
Theme Stack designed by Jimmy