AWS Major Outage Analysis - October 20, 2025

Introduction

Yesterday (October 20th) around 4 PM, I was tinkering with my blog setup using HUGO, AWS S3, and CloudFront when suddenly the ACM page threw an error. I was like, “Ugh! This damn internet!” and started cursing my innocent ISP haha. Turns out it was actually an AWS outage.

After that, I saw AWS trending on Google Trends and the Virginia region lighting up red on the status map. I thought, “Well, this must be a pretty big outage.”

AWS experienced another major outage after 2 years. This one was quite significant. Services around the world were down for about 15 hours.

So, let me break down how this outage happened, what news outlets reported, what AWS officially said, and what we should learn from all this.

1. News Coverage

Korean Media Reports

Korean media outlets focused mainly on the scale of the damage.

When It Happened

October 20th, around 4 PM Korean time (about 3 AM in the US)
Started in the US-EAST-1 region in Northern Virginia
One of AWS’s largest data center regions

Affected Services

There were way more than I expected.

Global Services:

Games like Snapchat, Fortnite, Roblox
AI search services like Perplexity
Financial services like Coinbase, Robinhood
Even Venmo, Signal, Duolingo

Domestic services too:

Samsung Wallet, Samsung.com
PUBG (Krafton)

Even airlines:

Delta Airlines and United Airlines websites and apps went down
Some airports had to process boarding manually

In the UK too:

Lloyds Bank, Bank of Scotland
Telecom companies like Vodafone, BT Group
Even the HMRC (tax authority) website

DownDetector (a site that tracks outages) received 50,000-65,000 reports. Over 1,000 websites and services worldwide were affected… The scale is massive.

US Media Reports

US media outlets focused more on the technical aspects.

Root Cause

The root cause was a DNS (Domain Name System) resolution issue. Simply put, the system that converts website addresses to IP addresses broke down. Specifically, there was a DNS resolution problem with the API endpoints of a database service called DynamoDB.

But that wasn’t the end. Even after fixing the DNS issue, they couldn’t launch new EC2 instances. EC2 had dependencies on DynamoDB, so the problems cascaded. On top of that, the Network Load Balancer health checks failed, causing issues with services like Lambda and CloudWatch.

Recovery Timeline

3:11 AM (ET): AWS first acknowledged “there’s a problem”
6:35 AM: DNS issue resolved
12:28 PM: Mostly recovered
6:00 PM: Almost back to normal
Total: about 15 hours

Experts said “it doesn’t look like a cyberattack” and expressed concerns that “modern internet relies too heavily on a few cloud companies.” They’re right.

2. AWS Official Statement

AWS also made an official announcement. They were surprisingly transparent with the information.

Official Timeline

October 19, 23:49 ~ October 20, 02:24 (PDT)

Error rates and latency spiked in the US-EAST-1 region
Amazon.com, subsidiaries, and even AWS Support team were affected

October 20, 00:11

Confirmed increased error rates across multiple services
Identified as a DNS resolution issue with DynamoDB API endpoints

October 20, 00:26

Clearly identified the cause - DNS resolution failure of regional DynamoDB service endpoints

October 20, 02:24

DynamoDB DNS issue fully resolved
But some internal systems still had problems

October 20, 12:28

Many customers and services significantly recovered
Slowly releasing throttling on new EC2 instance launches

October 20, 15:01

All services returned to normal
Some services still processing backlogged messages

Summary of Root Causes

Here’s the breakdown:

Primary Cause: DynamoDB DNS resolution failure
Secondary Issue: Even after fixing DNS, couldn’t launch EC2 instances because EC2 depends on DynamoDB
Tertiary Impact: Network Load Balancer health check failures caused cascading issues with Lambda, CloudWatch, etc.

3. News Coverage vs AWS Official Statement

Where They Align

Time and Location

Both cite around 3 AM ET, US-EAST-1 region
Recovery time also matches at around 6 PM

Cause

Both news and AWS say it was a DNS problem
Both mention DynamoDB-related issues

Impact

Hundreds to thousands of services worldwide affected
Various sectors: aviation, finance, gaming, telecom

Differences

Technical Details

News outlets simplified it to “DNS problem” for general audiences, but AWS provided more specifics:

DNS resolution issues with DynamoDB API endpoints
EC2 internal subsystem dependencies on DynamoDB
Network Load Balancer health check cascading failures

Recovery Process Complexity

According to AWS’s statement, even after resolving the DNS issue, they deliberately throttled EC2 instance launches for gradual recovery. They proceeded cautiously to avoid triggering another failure. This wasn’t covered in the news.

Internal Impact

News coverage only discussed external service disruptions, but AWS’s statement revealed that Amazon.com, subsidiaries, and even AWS Support team were affected. CNBC actually reported that Amazon warehouse workers couldn’t use internal systems and had to wait in break rooms.

Assessment

Overall, AWS was pretty transparent with information. But there are some disappointments:

Why it happened: No explanation of the root cause of the DNS problem (software update error? Configuration mistake?)
Compensation: No mention of compensation policy, which has been an issue before
Prevention: Missing specific measures for preventing recurrence

4. Has This Happened Before?

Yes. This isn’t AWS’s first major outage.

Major Outage History

December 7, 2021 - US-EAST-1 Outage

The most serious recent major outage
Lasted over 5 hours
Airline reservations, car dealerships, payment apps, video streaming all down
Amazon Kinesis Data Streams issue was the cause

June 13, 2023 - US-EAST-1 Outage

Websites offline for several hours
Affected The Boston Globe, NYC MTA, Associated Press
Issues with Lambda, EventBridge, SQS, CloudWatch

Others

2020: Multiple outages
November 22, 2018: Korea Seoul region down (DNS issue!)
2017: S3 disaster
2015: DynamoDB outage

Pattern Analysis

This October 20, 2025 Outage

About 2 years and 4 months since the June 2023 outage
About 3 years and 10 months since the December 2021 major outage

Patterns Found:

US-EAST-1 region keeps being at the center of problems
DNS-related issues repeat (2018 Korea, 2025 US)
Major outages occur roughly every 2-3 years
DynamoDB repeatedly problematic (2015, 2025)

5. How Was AWS’s Response?

What They Did Well

Quick Initial Response

Official acknowledgment and announcement within about 22 minutes of the problem
Root cause identified within about 1 hour 15 minutes

Transparent Communication

Real-time updates via Health Dashboard
Disclosed technical details
Provided specific timeline

Careful Recovery

Stable recovery through EC2 throttling after DNS resolution
Phased approach to prevent additional failures

Seems like an open and good response.

Room for Improvement

Repeated Issues

DNS problems also occurred in Korea in 2018
Same problem again after 7 years
DynamoDB too (2015, 2025)

Single Region Dependency

US-EAST-1 alone went down and the whole world stopped
Global services still overly dependent on single regions

Compensation Policy

Past cases where Korean companies didn’t receive proper compensation due to unfair contracts
Still unclear what they’ll do this time

Lack of Preventive Measures

Major outages repeat every 2-3 years
Same US-EAST-1 keeps having issues
Insufficient testing perhaps…

Other Clouds Have Similar Issues

It’s not just AWS:

Microsoft Azure: Teams, Outlook, Microsoft 365 went down in January 2023
Microsoft 365: October 2025 outage (Google used this for marketing lol)
Google Cloud: June 2025 extended outage hit OpenAI, Shopify
CrowdStrike: July 2024 software update mistake caused $5.4 billion in losses for Fortune 500 companies

Could be a structural issue for the entire cloud industry, or maybe this is the best we can do for now.

6. What Should We Learn?

Risks of Cloud Dependency

The Trap of Centralization

AWS accounts for about 30-33% of the global cloud market. Add Microsoft and Google… we’re relying too heavily on a few companies. Classic Single Point of Failure problem.

How Can We Improve?

Multi-Cloud Strategy

Don’t depend on a single cloud provider. Distribute critical systems across multiple regions. At minimum, build failover systems for core functions.

Design Differently from the Start

Design for Failure
Failure isolation mechanisms like Circuit Breaker patterns
Implement Graceful degradation (services don’t completely die, just limit functionality)

From a Societal Perspective?

Regulations and Policies

Need discussions on designating cloud services as critical infrastructure
Strengthen regulations on SLA and compensation policies
Manage cloud dependency of public services

Diversification

Consider open-source alternatives or self-hosting
Foster regional and national cloud ecosystems
Prevent excessive centralization

Conclusion

The October 20, 2025 AWS outage clearly showed our significant cloud dependency. For about 15 hours, millions of people worldwide couldn’t use everyday services, and companies suffered massive losses.

AWS was reasonably transparent about the problem and responded quickly. But looking at major outages repeating every 2-3 years, especially recurring issues in the same US-EAST-1 region… Structural improvements seem necessary. It’s concerning that similar failures recur in core services like DNS and DynamoDB.

Remember this:

No cloud, including AWS, can be 100% perfect. Outages aren’t a matter of ‘if’ but ‘when’. So we all need to prepare for cloud outages.

The convenience and scalability of cloud are genuinely great and undeniable. But risks clearly exist too. Smart cloud usage starts with recognizing and preparing for these risks.

No system is perfect. What matters is how well you prepare for outages, how quickly you recover when they happen, and what you learn to prevent recurrence.

This AWS outage reminded us all of this lesson once again.

AWS can fail too, so always be prepared ;-)