Manage episode 514964306 series 3691354
Episode 9: When the Cloud Goes Dark - Observability After the AWS Outage
Yesterday's AWS outage cost hundreds of billions and took down Snapchat, Coinbase, Ring, even Amazon's own retail site. 15+ hours of chaos exposed a critical truth: most organizations are doing observability completely wrong.
THE RECEIPTS:
- October 20, 2025, 3:11 AM ET - DNS resolution failure in US-EAST-1
- 15 hours 12 minutes to full recovery
- 50,000+ simultaneous Downdetector reports at peak
- 70+ AWS services affected
- $2M/hour median cost for enterprises (New Relic 2025 Forecast)
- Organizations with proper observability: 50% cost reduction
WHAT FAILED:
DNS couldn't resolve DynamoDB endpoints → EC2 launch failures → Network Load Balancer health checks failed → 70+ services cascaded down. Even AWS's own monitoring systems went offline.
REAL IMPACT:
- Coinbase locked out during trading hours
- 8Sleep smart mattresses stuck in "relax mode"
- Disabled users lost Alexa-controlled lights
- Students couldn't submit assignments (Canvas down)
- Ring doorbells blind during security incidents
- Amazon warehouse workers sent to break rooms
THE THREE PILLARS OF OBSERVABILITY:
1. Metrics (Prometheus, CloudWatch, Azure Monitor)
2. Logs (ELK stack, Splunk, centralized logging)
3. Traces (OpenTelemetry, Jaeger for distributed systems)
CRITICAL LESSON: If your observability stack lives in the same cloud region you're monitoring, it goes down when you need it most. CloudWatch was down during the AWS outage.
5 LESSONS FROM THE OUTAGE:
1. Multi-region is the new minimum (multi-AZ didn't save anyone)
2. Observability must be independent (Datadog, New Relic, Dynatrace)
3. DR plans are useless if untested (monthly drills, not yearly)
4. Dependency mapping is critical (know what fails when X fails)
5. Control plane resilience matters (AWS support system went offline)
YOUR ACTION PLAN:
□ Audit observability stack independence
□ Map all cloud dependencies by region
□ Test DR plan THIS WEEK
□ Set up degradation alerts (not just "down" alerts)
□ Practice chaos engineering
"The prudent see danger and take refuge, but the simple keep going and pay the penalty." - Proverbs 27:12
NEXT EPISODE: CI/CD Pipeline Security - SBOM, artifact signing, secrets management
SERIES ARC: This builds on our DevSecOps → Kubernetes → Multi-Cloud → Platform Engineering foundation.
FIND US:
🌐 FaithFreedomTech.com
📝 DevSecOpsWithScott.com
📝 scottwhoughton.medium.com
🐦 @FaithFT_Podcast (X)
📱 @FaithFreedomTech (everywhere else)
Available on all podcast apps - Apple, Spotify, Google, Amazon Music, and more.
#DevSecOps #CloudArchitecture #SiteReliability #AWS #Observability #MultiCloud
Connect: IG/TikTok/FB/TruthSocial: @FaithFreedomTech | X: @faithft_podcast | FaithFreedomTech.com | Email: [email protected]
13 episodes