Learning SRE, one day at a time.
…
continue reading
A real-play show playing the multiversal adventure game Troika. Starring Tom Wardle, Paul Harrop, and Stephen Townshend.
…
continue reading
A show about performance testing and engineering (and human beings). Hosted on Acast. See acast.com/privacy for more information.
…
continue reading
Send us a text Owen and Ridge get put in a waiting room, and respond accordingly. Starring Tom Wardle (GM), Paul Harrop (Owen Sunderland), and Stephen Townshend (Ridge Mackelbie). Troika was created by the Melsonain Arts Council: https://www.melsonia.com/ You can find out more about Troika here: https://www.troikarpg.com/ All music is by Harrison S…
…
continue reading

1
The Root Cause Fallacy with Andrew Hatch (Episode 98)
32:22
32:22
Play later
Play later
Lists
Like
Liked
32:22Send us a text This week I'm joined by SRE leader Andrew Hatch from Cisco ThousandEyes to talk about a dirty word in the resilience community... root cause. In this excellent conversation we explore... 🌌 Is the root cause of every incident the big bang? 🦖 How the value of root cause degrades as complexity increases 🫣 That if the culture is not blam…
…
continue reading
Send us a text Owen and Ridge face down impossible odds in the grand arena, and Owen comes face to face with one of his own kind. Starring Tom Wardle (GM), Paul Harrop (Owen Sunderland), and Stephen Townshend (Ridge Mackelbie). Troika was created by the Melsonain Arts Council: https://www.melsonia.com/ You can find out more about Troika here: https…
…
continue reading

1
Synthetic Monitoring with David Dick (Episode 97)
33:04
33:04
Play later
Play later
Lists
Like
Liked
33:04Send us a text This week I'm joined by David Dick from 2 Steps to (finally!) discuss synthetic monitoring. We cover... 🤖 What is synthetic monitoring? 🦾 What are the benefits and drawbacks to using it? ☢️ Non-web based synthetics (the tough stuff) 🍹 Combining RUM and synthetics 🫢 Does synthetics need an OTEL-like framework? ...and much more. You ca…
…
continue reading
Send us a text Owen and Ridge board the imposing Crimson Citadel and experience crippling bureaucracy at its finest. Starring Tom Wardle (GM), Paul Harrop (Owen Sunderland), and Stephen Townshend (Ridge Mackelbie). Troika was created by the Melsonain Arts Council: https://www.melsonia.com/ You can find out more about Troika here: https://www.troika…
…
continue reading
Send us a text Owen and Ridge are given a quest by the mysterious Professor Eska Howell, and our two adventurers make the way to rescue his assistant, Kennick. Starring Tom Wardle (GM), Paul Harrop (Owen Sunderland), and Stephen Townshend (Ridge Mackelbie). Troika was created by the Melsonain Arts Council: https://www.melsonia.com/ You can find out…
…
continue reading
Send us a text Join Tom, Paul, and Stephen as they play the science fiction RPG "Troika" and embark on a multiversal adventure across time, space, and reality. Starring Tom Wardle (GM), Paul Harrop (Owen Sunderland), and Stephen Townshend (Ridge Mackelbie). Troika was created by the Melsonain Arts Council: https://www.melsonia.com/ You can find out…
…
continue reading

1
Tech Leadership with Milan Brown (Episode 96)
31:27
31:27
Play later
Play later
Lists
Like
Liked
31:27Send us a text This week I'm joined by Cin7 Engineering Director Milan Brown to unpack the challenges of technology management and leadership. We discuss... ✖️ Theory X vs Theory Y management 🗣️ Intention based leadership and communication 🏢 Conditions in an org for people to thrive 😵💫 How do you learn to manage and lead? 🫤 Managing people when yo…
…
continue reading

1
Finding Tech Work with Leon Adato (Episode 95)
36:26
36:26
Play later
Play later
Lists
Like
Liked
36:26Send us a text This week Leon Adato and I break down the state of applying for roles in tech. We cover... 📝 What a resume or CV is and is not 🤝 Leveraging your connections rather than relying on applying cold 🪄 How most job descriptions are works of fiction 🦾 White-fonting to game AI resume assessment 🧪 Experimental ways we could recruit ...and our…
…
continue reading

1
Getting a Start in SRE with Priyam Kumar (Episode 94)
31:09
31:09
Play later
Play later
Lists
Like
Liked
31:09Send us a text This week Priyam Kumar shares his story of moving from a massive organisation to a startup and the challenges and growth that came from that. We discuss... 🪖 War stories and examples of production incidents 🩹 The "hacks" we build to keep things running (and how maybe that's just normal) 😎 Keeping it simple... YAGNI (You Ain't Gonna N…
…
continue reading

1
SRE Leadership with Michelle Casey (Episode 93)
39:29
39:29
Play later
Play later
Lists
Like
Liked
39:29Send us a text This week Michelle Casey shares her insights as a 'head of' engineering manager in the SRE context. This was one of my favourite conversations on the podcast so far. We cover topics such as... 🤷🏽 Why move into leadership? 👁️ Learning from other leaders 💎 What is unique about SRE leadership? 👑 Women in engineering leadership ...and we…
…
continue reading

1
Observability Maturity with Ádám Tóth (Episode 92)
30:09
30:09
Play later
Play later
Lists
Like
Liked
30:09Send us a text This week Adam and I get philosophical about what constitutes maturity in the field of observability. We tackle questions such as... 💸 Does your org treat observability as a cost centre or a value add? 🔥 Are you using observability reactively to solve problems? Or proactively to build better products and services? 👤 Is your observabi…
…
continue reading
Send us a text In this episode I explore the challenges of achieving unified observability when integrating with SaaS products and services. I cover: 🌊 The new wave of mega-complex SaaS ⚗️ Challenges integrating SaaS with our observability pipelines 👩🦯 How the lack of SaaS autonomy limits the effectiveness of OpenTelemetry 💰 Paying twice to ingest…
…
continue reading

1
Non-Prod Reliability Engineering + 2024 Wrap (Episode 90)
18:13
18:13
Play later
Play later
Lists
Like
Liked
18:13Send us a text This week I check in and give an update on work, life, and my attempts at bringing to life SRE practices in the world of non-production environment management. You can find the official Slight Reliability podcast website at: https://slightreliability.com/ You can find Stephen at: LinkedIn: https://www.linkedin.com/in/stephentownshend…
…
continue reading

1
Slight Reliability Episode 89 - Blameless Post-mortems with Karanveer Anand
26:06
26:06
Play later
Play later
Lists
Like
Liked
26:06Send us a text This week I'm joined by Karanveer Anand, SRE Technical Program Manager at Google to discuss blameless post-mortems. We cover: 🦅 The recent Crowdstrike outage and their public post-mortem 🚑 When do we do a blameless post-mortem? 😕 How do we do a blameless post-mortem? ✅ How do we make sure action items are followed through? 📰 The powe…
…
continue reading

1
Slight Reliability Episode 88 - OpenTelemetry Revisited with Zach Michel
26:51
26:51
Play later
Play later
Lists
Like
Liked
26:51Send us a text This week Zach Michel from https://middleware.io/ and I discuss the state of OpenTelemetry and what it means to adopt it. We cover: 🌩️ Achieving observability in a SaaS world 🥫 Context propagation - the magic sauce of OTEL 🚪 The telemetry gateway concept and leveraging the OTEL collector 🪵 The state of OpenTelemetry logging 🫂 Making …
…
continue reading

1
Slight Reliability Episode 87 - Measuring the value of SRE with Artem Yakimenko
35:33
35:33
Play later
Play later
Lists
Like
Liked
35:33Send us a text In Episode 80 Niall Murphy talked about the need for SREs to be better at articulating the value of our work. In this episode I'm joined by ex-Googler and Engineering Director (SRE) at Culture Amp Artem Yakimenko about how we might achieve this. We discuss both quantifiable and qualitative approaches including leveraging the untapped…
…
continue reading

1
Slight Reliability Episode 86 - Evolving SLOs with Dom Finn
25:57
25:57
Play later
Play later
Lists
Like
Liked
25:57Send us a text In the world of SRE we constantly talk about defining SLOs, but what about evolving them over time? This week I chat with SRE Tech Lead Dom Finn about just that. We cover the relationship between reliability and user analytics, latency classes as a way to speak SLOs with business stakeholders, the role of NFRs and how the thresholds …
…
continue reading

1
Slight Reliability Episode 85 - Feeling SaaSsy
11:08
11:08
Play later
Play later
Lists
Like
Liked
11:08Send us a text This week I talk about the impact of SaaS-first technology strategies on the work of an SRE. I pose questions about observability, ownership, on-call, and how much control we have over reliability. You can find the Bleeding Tech blog on Medium: https://medium.com/@stownshend You can find Stephen at: LinkedIn: https://www.linkedin.com…
…
continue reading

1
Slight Reliability Episode 84 - Clinical Troubleshooting with Dan Slimmon
27:40
27:40
Play later
Play later
Lists
Like
Liked
27:40Send us a text This week I chat with Dan Slimmon about applying the approach doctors use to treat patient symptoms during incident response. You can find Dan's blog at https://blog.danslimmon.com/ or connect with him on LinkedIn here: https://www.linkedin.com/in/danslimmon/ You can find the official Slight Reliability podcast website at: https://sl…
…
continue reading

1
Slight Reliability Episode 83 - An Unfulfilled Promise with Itiel Shwartz
30:32
30:32
Play later
Play later
Lists
Like
Liked
30:32Send us a text This week I hear about all things Kubernetes from Komodor CTO and co-founder Itiel Shwartz. We chat about the promise that was made when Kubernetes first entered the industry, the challenge of getting developers engaged and capable of working in Kubernetes, my hate/hate relationship with Helm but its important contribution to the Kub…
…
continue reading

1
Slight Reliability Episode 82 - CI/CD with Amin Astaneh
25:47
25:47
Play later
Play later
Lists
Like
Liked
25:47Send us a text This week I sit down and have a discussion with Amin Astaneh (from Certo Modo) about CI/CD. We cover the power of the standard change as a way to navigate ITIL while still implementing DevOps practices, what to monitor to make your CI/CD observable, single piece flow, testing in production, and so much more. You can find Amin on his …
…
continue reading

1
Slight Reliability Episode 81 - Incident Management in Non-Prod Environments
10:09
10:09
Play later
Play later
Lists
Like
Liked
10:09Send us a text "Environment issues are just incidents that happened to occur in a non-production environment"... so why do we treat them so differently? In this first episode of the 2024 season I reflect on how we handle incidents in non-prod environments. (Note: Had a few issues with noise suppression in OBS Studio cutting off the start of some wo…
…
continue reading

1
Slight Reliability Episode 80 - What's Been Bugging Niall Murphy
36:45
36:45
Play later
Play later
Lists
Like
Liked
36:45Send us a text This week I speak with co-author of the original SRE book + the SRE workbook, and renowned speaker Niall Murphy. We chat about the state of SRE in the current macro-economic climate and how we're not yet doing a very good job at articulating the value of SRE to leaders, the relationship that velocity and reliability have, the value o…
…
continue reading

1
Slight Reliability Episode 76 - Sampling Distributed Traces with Paige Cruz
45:27
45:27
Play later
Play later
Lists
Like
Liked
45:27Send us a text Paige Cruz (from Chronosphere) is back. This week we discuss sampling. What is sampling? Why do it? What kinds of sampling are there? You can check out Chronosphere's cloud native observability platform here: https://chronosphere.io/ You can find Paige on: LinkedIn: https://www.linkedin.com/in/paigerduty/ X: https://twitter.com/paige…
…
continue reading

1
Slight Reliability Episode 79 - Incident Story Time with Valeska Victoria
37:51
37:51
Play later
Play later
Lists
Like
Liked
37:51Send us a text This week Valeska Victoria returns to share some of her experiences working as an SRE at eBay. We look at the cascading effect of production issues in complex integrated environments (how there's often no single root cause), developer literacy of how infrastructure works, the importance of ownership and accountability of reliability,…
…
continue reading

1
Slight Reliability Episode 78 - Developer Experience with Ankit Jain
32:21
32:21
Play later
Play later
Lists
Like
Liked
32:21Send us a text This week I chat with Ankit Jain from aviator.co about developer experience. We define developer experience and developer productivity, and how this applies to SRE. We discuss the growing expectation on developers and how this leads to frustration and burnout. We also explore how to measure developer experience and how to start worki…
…
continue reading
Send us a text A brief mid-week update on my changing circumstances and the future of the podcast.By Stephen Townshend
…
continue reading

1
Slight Reliability Episode 77 - SRE to DevRel with Liz Fong-Jones
31:53
31:53
Play later
Play later
Lists
Like
Liked
31:53Send us a text This week I had the privilege of interviewing Liz Fong-Jones from honeycomb.io about DevRel, Developer Advocacy, and how that applies to SRE. We discuss the difference between Developer Relations (DevRel) and Developer Advocacy, how Liz got into advocacy, how DevRel helps companies and the community, and some tips on how to get tract…
…
continue reading

1
Slight Reliability Episode 75 - Enterprise SRE with Steve McGhee
39:00
39:00
Play later
Play later
Lists
Like
Liked
39:00Send us a text This week I had the honour of chatting with Steve McGhee (former Google SRE, current Google Reliability Advocate, and co-author of Enterprise Roadmap to SRE). We discuss the evolution of SRE from where it began at Google and how it is being adopted by enterprises around the world now (and why this is happening). We talk about getting…
…
continue reading

1
Slight Reliability Episode 74 - The Hidden Side of Vendor Lock-In
8:55
8:55
Play later
Play later
Lists
Like
Liked
8:55Send us a text This week on Slight Reliability Stephen discusses observability vendor lock-in. What is it? What does OpenTelemetry do to help? What areas are yet to be solved? You can find the official Slight Reliability podcast website at: https://slightreliability.com/ You can find Stephen at: LinkedIn: https://www.linkedin.com/in/stephentownshen…
…
continue reading

1
Slight Reliability Episode 73 - Enterprise SLOs with Brian Singer
32:18
32:18
Play later
Play later
Lists
Like
Liked
32:18Send us a text This week we sit down and talk about SLOs with CPO and co-founder of Nobl9 Brian Singer. We talk about the importance of reviewing operational effectiveness, getting buy in from leadership, using SLOs to reduce noise, how to implement SLOs within different cultures and structures, the parallels between security and reliability... and…
…
continue reading

1
Slight Reliability Episode 72 - Rapid Incident Response with Valeska Victoria
42:19
42:19
Play later
Play later
Lists
Like
Liked
42:19Send us a text This week Stephen chats with Valeska Victoria about her time working as an SRE at eBay. Valeska shares her data driven approach to SRE, having a voice as a less experienced engineer, handling incidents under high pressure, leveraging large language models to rapidly find the information you need during an incident, and much more. You…
…
continue reading

1
Slight Reliability Episode 71 - Implementing SRE with Dr. Vlad Ukis
29:25
29:25
Play later
Play later
Lists
Like
Liked
29:25Send us a text This week Stephen chats with Dr. Vlad Ukis about his journey discovering, and then implementing SRE practices at Siemens Healthineers (which led to him writing a book). They discuss how the evolution of infrastructure necessitates a shift in how we operate, the power of selling SRE practices, the SRE infrastructure used to build SLOs…
…
continue reading

1
Slight Reliability Episode 70 - Meta SRE with Amin Astaneh
42:24
42:24
Play later
Play later
Lists
Like
Liked
42:24Send us a text Amin Astaneh (from Certo Modo) is back to discuss his experience working as a production engineer (SRE equivalent) at Meta. Stephen and Amin discuss what it's like interviewing for big tech, "you build it, you own it", different SRE engagement models, SRE at different sizes of organisation, socialising your SRE success as a way to ge…
…
continue reading

1
Slight Reliability Episode 69 - Developer to SRE with Praveen Kasam
30:10
30:10
Play later
Play later
Lists
Like
Liked
30:10Send us a text This week Stephen talks to Praveen Kasam from Diconium Digital Solutions about how he led SRE transformations. Praveen shares his experience transitioning from development to SRE and how leveraging automation and bringing application knowledge to the ops team provided quick wins. He also covers how he later applied SRE concepts to up…
…
continue reading

1
Slight Reliability Episode 68 - Dashboards and Modern Observability with Eric Schabell
32:31
32:31
Play later
Play later
Lists
Like
Liked
32:31Send us a text This week Stephen asks Eric Schabell (Director of Technical Marketing & Evangelism @ Chronosphere) about how dashboards fit into modern observability. They discuss how untamed observability can lead to unexpectedly high cloud bills, the similarities between dashboards and documentation, the "know > triage > understand" workflow, and …
…
continue reading

1
Slight Reliability Episode 67 - Single Pane of Glass with Jamie Allen and Adam Kinniburgh
34:36
34:36
Play later
Play later
Lists
Like
Liked
34:36Send us a text This week Stephen chats with Jamie Allen (Cheif Technologist AWS & SRE @ EPAM Systems) and Adam Kinniburgh (VP Innovation @ SquaredUp) about the concept of a single pane of glass (SPOG) for SRE. Is it performance art or something actionable? Can alerting replace the need for dashboards? And are metrics drowning in the wake of distrib…
…
continue reading

1
Slight Reliability Episode 66 - Building Digital Assistants for SRE with Kyle Forster
29:51
29:51
Play later
Play later
Lists
Like
Liked
29:51Send us a text This week Stephen brings back Kyle Forster from RunWhen to talk about the purple elephant in the room… “AI”. What makes it GenAI, LLM, Advanced Statistics, or ML? Kyle shares his experience surrounding building AI powered search engines for SRE troubleshooting commands and how to incorporate a (paid) open source community of experts …
…
continue reading

1
Slight Reliability Episode 65 - The Truth About Incidents with Courtney Nash
41:04
41:04
Play later
Play later
Lists
Like
Liked
41:04Send us a text This week Stephen chats with the internet incident librarian herself, Courtney Nash. They explore what Courtney has learned through meta-analysis of the over ten thousands incidents in the Verica Open Incident Database (VOID). They cover why MTTR needs to go in the garbage, joint cognitive systems, the value of looking at near misses…
…
continue reading

1
Slight Reliability Episode 64 - Observability During Development with Martin Thwaites
36:18
36:18
Play later
Play later
Lists
Like
Liked
36:18Send us a text This week Stephen chats with Martin Thwaites from Honeycomb about how developers can leverage observability to understand what they're building better, solve bugs quicker, and have more time for coding. They also discuss OpenTelemetry (the protocol and semantic conventions), manual versus automatic instrumentation, and how keeping ev…
…
continue reading

1
Slight Reliability Episode 63 - The Power of Summary
9:20
9:20
Play later
Play later
Lists
Like
Liked
9:20Send us a text Observability is a necessary adaptation to make sense of software systems in the Digital Age, but how can we unlock its power for non-engineer stakeholders (such as executives, product owners, etc)? Perhaps we need a layer of abstraction sitting on top of our detailed observability to get the most out of it. You can find the official…
…
continue reading

1
Slight Reliability Episode 62 - On-Call with Matt Brown
36:57
36:57
Play later
Play later
Lists
Like
Liked
36:57Send us a text This week Stephen chats with former-Google SRE Matt Brown about being on-call. They cover how to up-lift junior engineers so they can be on-call, what a fair on-call schedule looks like, run-books, and much more. As you heard, Matt believes flexibility is key to a healthy on-call rotation. Matt is exploring ideas for improvements to …
…
continue reading

1
Slight Reliability Episode 61 - SRE VS DevOps VS Platform Eng... (Yawn)
6:07
6:07
Play later
Play later
Lists
Like
Liked
6:07Send us a text The internet is full of people who want to tell you about SRE, DevOps, and Platform Engineering and how different and similar they are... and will give you the impression that these things compete with each other. But do they? And is it a helpful question to ask in the first place? You can find the official Slight Reliability podcast…
…
continue reading

1
Slight Reliability Episode 60 - From Zero to SRE with Amin Astaneh
42:46
42:46
Play later
Play later
Lists
Like
Liked
42:46Send us a text In this episode Amin Astaneh from Certo Modo discusses his experience undertaking an SRE transformation over several years. Stephen and Amin cover a lot of ground including making ops work visible, measuring toil, the power of calculating the $ value of work, getting developers on-call, the embedded model for SRE, SLOs, culture chang…
…
continue reading

1
Slight Reliability Episode 59 - Bad API Observability with Sonja Chevre
40:23
40:23
Play later
Play later
Lists
Like
Liked
40:23Send us a text In this episode Stephen Townshend and Sonja Chevre from Tyk discuss making APIs observable, and some anti-patterns to avoid. They cover GraphQL, OpenTelemetry and semantic conventions, correlation IDs, observability pipelines, and much more. You can find Sonja on LinkedIn: https://www.linkedin.com/in/sonjachevre/ and Twitter: https:/…
…
continue reading

1
Slight Reliability Episode 58 - Tackling Cloud Cost with Harinder Seera
36:54
36:54
Play later
Play later
Lists
Like
Liked
36:54Send us a text In this episode Stephen Townshend and Harinder Seera explore how to monitor and manage the cost of cloud. They discuss FinOps as a cultural practice, anti-patterns for implementing in the cloud, keeping cost down through resources, pricing, and architecture... and much more. You can find Harinder on LinkedIn: https://www.linkedin.com…
…
continue reading

1
Slight Reliability Episode 57 - A Tale of Three Conferences
16:10
16:10
Play later
Play later
Lists
Like
Liked
16:10Send us a text In this episode Stephen shares his experiences traveling overseas to the UK and Singapore AWS Summit, SREcon APAC, and the internal SquaredUp conference "SqUpCon". You can find the official Slight Reliability podcast website at: https://slightreliability.com/ You can find Slight Reliability artwork on Instagram: https://www.instagram…
…
continue reading
Send us a text A quick update on Stephen's whereabouts and when the next episode will be released.By Stephen Townshend
…
continue reading

1
Slight Reliability Episode 56 - Dashbored
14:06
14:06
Play later
Play later
Lists
Like
Liked
14:06Send us a text In this episode Stephen discusses the role of dashboards within the context of the Digital Era. What are they *not* appropriate for? What can they help with? What kinds of things are suitable to present? If you want to get involved in the SquaredUp dashboard competition head along to: https://squaredup.com/blog/dashboard-competition/…
…
continue reading