Introduction
Yesterday (October 20th) around 4 PM, I was tinkering with my blog setup using HUGO, AWS S3, and CloudFront when suddenly the ACM page threw an error. I was like, “Ugh! This damn internet!” and started cursing my innocent ISP haha. Turns out it was actually an AWS outage.
After that, I saw AWS trending on Google Trends and the Virginia region lighting up red on the status map. I thought, “Well, this must be a pretty big outage.”
AWS experienced another major outage after 2 years. This one was quite significant. Services around the world were down for about 15 hours.
So, let me break down how this outage happened, what news outlets reported, what AWS officially said, and what we should learn from all this.
1. News Coverage
Korean Media Reports
Korean media outlets focused mainly on the scale of the damage.
When It Happened
- October 20th, around 4 PM Korean time (about 3 AM in the US)
- Started in the US-EAST-1 region in Northern Virginia
- One of AWS’s largest data center regions
Affected Services
There were way more than I expected.
Global Services:
- Games like Snapchat, Fortnite, Roblox
- AI search services like Perplexity
- Financial services like Coinbase, Robinhood
- Even Venmo, Signal, Duolingo
Domestic services too:
- Samsung Wallet, Samsung.com
- PUBG (Krafton)
Even airlines:
- Delta Airlines and United Airlines websites and apps went down
- Some airports had to process boarding manually
In the UK too:
- Lloyds Bank, Bank of Scotland
- Telecom companies like Vodafone, BT Group
- Even the HMRC (tax authority) website
DownDetector (a site that tracks outages) received 50,000-65,000 reports. Over 1,000 websites and services worldwide were affected… The scale is massive.
US Media Reports
US media outlets focused more on the technical aspects.
Root Cause
The root cause was a DNS (Domain Name System) resolution issue. Simply put, the system that converts website addresses to IP addresses broke down. Specifically, there was a DNS resolution problem with the API endpoints of a database service called DynamoDB.
But that wasn’t the end. Even after fixing the DNS issue, they couldn’t launch new EC2 instances. EC2 had dependencies on DynamoDB, so the problems cascaded. On top of that, the Network Load Balancer health checks failed, causing issues with services like Lambda and CloudWatch.
Recovery Timeline
- 3:11 AM (ET): AWS first acknowledged “there’s a problem”
- 6:35 AM: DNS issue resolved
- 12:28 PM: Mostly recovered
- 6:00 PM: Almost back to normal
- Total: about 15 hours
Experts said “it doesn’t look like a cyberattack” and expressed concerns that “modern internet relies too heavily on a few cloud companies.” They’re right.
2. AWS Official Statement
AWS also made an official announcement. They were surprisingly transparent with the information.
Official Timeline
October 19, 23:49 ~ October 20, 02:24 (PDT)
- Error rates and latency spiked in the US-EAST-1 region
- Amazon.com, subsidiaries, and even AWS Support team were affected
October 20, 00:11
- Confirmed increased error rates across multiple services
- Identified as a DNS resolution issue with DynamoDB API endpoints
October 20, 00:26
- Clearly identified the cause - DNS resolution failure of regional DynamoDB service endpoints
October 20, 02:24
- DynamoDB DNS issue fully resolved
- But some internal systems still had problems
October 20, 12:28
- Many customers and services significantly recovered
- Slowly releasing throttling on new EC2 instance launches
October 20, 15:01
- All services returned to normal
- Some services still processing backlogged messages
Summary of Root Causes
Here’s the breakdown:
- Primary Cause: DynamoDB DNS resolution failure
- Secondary Issue: Even after fixing DNS, couldn’t launch EC2 instances because EC2 depends on DynamoDB
- Tertiary Impact: Network Load Balancer health check failures caused cascading issues with Lambda, CloudWatch, etc.
3. News Coverage vs AWS Official Statement
Where They Align
Time and Location
- Both cite around 3 AM ET, US-EAST-1 region
- Recovery time also matches at around 6 PM
Cause
- Both news and AWS say it was a DNS problem
- Both mention DynamoDB-related issues
Impact
- Hundreds to thousands of services worldwide affected
- Various sectors: aviation, finance, gaming, telecom
Differences
Technical Details
News outlets simplified it to “DNS problem” for general audiences, but AWS provided more specifics:
- DNS resolution issues with DynamoDB API endpoints
- EC2 internal subsystem dependencies on DynamoDB
- Network Load Balancer health check cascading failures
Recovery Process Complexity
According to AWS’s statement, even after resolving the DNS issue, they deliberately throttled EC2 instance launches for gradual recovery. They proceeded cautiously to avoid triggering another failure. This wasn’t covered in the news.
Internal Impact
News coverage only discussed external service disruptions, but AWS’s statement revealed that Amazon.com, subsidiaries, and even AWS Support team were affected. CNBC actually reported that Amazon warehouse workers couldn’t use internal systems and had to wait in break rooms.
Assessment
Overall, AWS was pretty transparent with information. But there are some disappointments:
- Why it happened: No explanation of the root cause of the DNS problem (software update error? Configuration mistake?)
- Compensation: No mention of compensation policy, which has been an issue before
- Prevention: Missing specific measures for preventing recurrence
4. Has This Happened Before?
Yes. This isn’t AWS’s first major outage.
Major Outage History
December 7, 2021 - US-EAST-1 Outage
- The most serious recent major outage
- Lasted over 5 hours
- Airline reservations, car dealerships, payment apps, video streaming all down
- Amazon Kinesis Data Streams issue was the cause
June 13, 2023 - US-EAST-1 Outage
- Websites offline for several hours
- Affected The Boston Globe, NYC MTA, Associated Press
- Issues with Lambda, EventBridge, SQS, CloudWatch
Others
- 2020: Multiple outages
- November 22, 2018: Korea Seoul region down (DNS issue!)
- 2017: S3 disaster
- 2015: DynamoDB outage
Pattern Analysis
This October 20, 2025 Outage
- About 2 years and 4 months since the June 2023 outage
- About 3 years and 10 months since the December 2021 major outage
Patterns Found:
- US-EAST-1 region keeps being at the center of problems
- DNS-related issues repeat (2018 Korea, 2025 US)
- Major outages occur roughly every 2-3 years
- DynamoDB repeatedly problematic (2015, 2025)
5. How Was AWS’s Response?
What They Did Well
Quick Initial Response
- Official acknowledgment and announcement within about 22 minutes of the problem
- Root cause identified within about 1 hour 15 minutes
Transparent Communication
- Real-time updates via Health Dashboard
- Disclosed technical details
- Provided specific timeline
Careful Recovery
- Stable recovery through EC2 throttling after DNS resolution
- Phased approach to prevent additional failures
Seems like an open and good response.
Room for Improvement
Repeated Issues
- DNS problems also occurred in Korea in 2018
- Same problem again after 7 years
- DynamoDB too (2015, 2025)
Single Region Dependency
- US-EAST-1 alone went down and the whole world stopped
- Global services still overly dependent on single regions
Compensation Policy
- Past cases where Korean companies didn’t receive proper compensation due to unfair contracts
- Still unclear what they’ll do this time
Lack of Preventive Measures
- Major outages repeat every 2-3 years
- Same US-EAST-1 keeps having issues
- Insufficient testing perhaps…
Other Clouds Have Similar Issues
It’s not just AWS:
- Microsoft Azure: Teams, Outlook, Microsoft 365 went down in January 2023
- Microsoft 365: October 2025 outage (Google used this for marketing lol)
- Google Cloud: June 2025 extended outage hit OpenAI, Shopify
- CrowdStrike: July 2024 software update mistake caused $5.4 billion in losses for Fortune 500 companies
Could be a structural issue for the entire cloud industry, or maybe this is the best we can do for now.
6. What Should We Learn?
Risks of Cloud Dependency
The Trap of Centralization
AWS accounts for about 30-33% of the global cloud market. Add Microsoft and Google… we’re relying too heavily on a few companies. Classic Single Point of Failure problem.
How Can We Improve?
Multi-Cloud Strategy
Don’t depend on a single cloud provider. Distribute critical systems across multiple regions. At minimum, build failover systems for core functions.
Design Differently from the Start
- Design for Failure
- Failure isolation mechanisms like Circuit Breaker patterns
- Implement Graceful degradation (services don’t completely die, just limit functionality)
From a Societal Perspective?
Regulations and Policies
- Need discussions on designating cloud services as critical infrastructure
- Strengthen regulations on SLA and compensation policies
- Manage cloud dependency of public services
Diversification
- Consider open-source alternatives or self-hosting
- Foster regional and national cloud ecosystems
- Prevent excessive centralization
Conclusion
The October 20, 2025 AWS outage clearly showed our significant cloud dependency. For about 15 hours, millions of people worldwide couldn’t use everyday services, and companies suffered massive losses.
AWS was reasonably transparent about the problem and responded quickly. But looking at major outages repeating every 2-3 years, especially recurring issues in the same US-EAST-1 region… Structural improvements seem necessary. It’s concerning that similar failures recur in core services like DNS and DynamoDB.
Remember this:
No cloud, including AWS, can be 100% perfect. Outages aren’t a matter of ‘if’ but ‘when’. So we all need to prepare for cloud outages.
The convenience and scalability of cloud are genuinely great and undeniable. But risks clearly exist too. Smart cloud usage starts with recognizing and preparing for these risks.
No system is perfect. What matters is how well you prepare for outages, how quickly you recover when they happen, and what you learn to prevent recurrence.
This AWS outage reminded us all of this lesson once again.
AWS can fail too, so always be prepared ;-)