Artwork
iconShare
 
Manage episode 453791221 series 2882480
Content provided by Tobias Schlottke - alphalist CTO Podcast. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by Tobias Schlottke - alphalist CTO Podcast or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://staging.podcastplayer.com/legal.
Planning AND Practice: The Secret to Incident Response

Plan and PRACTICE for better incident response with insights from Tim Armandpour, CTO of PagerDuty. Learn the secrets to resilience from the team that mitigated the impact of a major outage—handling a 250% traffic surge while delivering on their SLA.

Listen to find out:

  • 🛠️ Why planning AND practice are both critical for incident response.
  • 🚧 How to practice for incident response (e.g Failure Fridays with Chaos Engineering)
  • 🧑‍🤝‍🧑 Ownership: Why tech AND business teams must join post-mortems.
  • ☁️ How to mitigate the impact of your cloud provider’s lower SLA.
  • ⚓ Which architectural patterns are more resilient?
  • ⚖️ WARNING: “bend” the CAP theorem at your own risk

Listen here

TimeStamps: (00:00:00) Introduction to Alphalist Podcast (00:01:00) Meet Tim Armanpour (00:01:56) Tim's Early Career (00:06:22) Handling Major Incidents at PagerDuty (00:09:21) The Importance of Preparedness (00:13:54) Practicing Failure Scenarios (00:18:16) Resilient Infrastructure and Architectural Patterns (00:22:44) Standardization and Data Management (00:25:48) Exploring Infrastructure Resilience (00:26:20) Achieving High Availability with Lower SLA Cloud Platforms (00:29:38) Defining Meaningful SLIs (00:32:15) Assessing Incident Readiness (00:35:15) The Importance of Ownership (00:41:30) Continuous Improvement (00:43:53) Lessons from a Yogurt Business (00:48:18) Final Thoughts and Takeaways

  continue reading

130 episodes