The Sprint Review That Became an Actual Incident Response

Jan 5, 2025

It was supposed to be a routine meeting. Sprint review, Q&A with product, a quick demo of the new alerting dashboard. Nothing complicated. We were three engineers deep into the release notes, sharing a screen and waiting for the last stakeholder to join. The invitation had promised thirty minutes and “a chance to see the new monitoring features.” I should have deployed the hotfix first.

Our team—officially named Platform Reliability—maintains the backend systems for CloudCore, a platform that delivers infrastructure monitoring and observability to enterprise clients. We own the ingestion pipeline, the alerting engine, and the configuration management layer we inherited from a previous team. Standard work for a mid-sized SRE function.

The stakeholders on the call were mostly from product and commercial: product managers, a solutions engineer, and the kind of people who use phrases like “reliability as a differentiator” and genuinely mean it. They had joined to see a demo of the new threshold alerting feature we had spent the last two weeks building. And for the first eight minutes, everything was fine.

Then the dashboards started going red.

Not our demo dashboard—the real one. The ingestion pipeline latency alert fired first, followed thirty seconds later by a processing backlog notification that should not have been possible given the load we were seeing. Someone on the call noticed before I did. “Is that… supposed to be happening?” one of the product managers asked, gesturing at the screen share that had, unfortunately, the production monitoring dashboard visible in the background.

I switched tabs. The ingestion service was degraded. Latency was climbing. Three enterprise clients had already triggered their own alerting thresholds and were presumably opening support tickets.

This is the part of the story where preparation matters.

Because we had runbooks. Because we had practiced incident response. Because our on-call rotation meant someone else was already being paged while I was on this call, I was able to say, calmly: “We have an active incident. I need five minutes to hand off to the on-call engineer and then I can walk you through what we are seeing.”

The product manager nodded. The solutions engineer, to their credit, said, “Take whatever time you need.”

I muted myself, opened a new incident channel, confirmed the on-call engineer had acknowledged the page, shared the relevant runbook link, and rejoined the call in under four minutes.

What followed was not a disaster. It was a case study in why incident response culture matters more than incident prevention.

We kept the sprint review running. I was transparent: a production issue had been triggered by a configuration change that had passed staging but exposed an edge case under real load. We did not have full root cause yet, but we had a mitigation path and the on-call engineer was executing it. We estimated 20 minutes to resolution.

The stakeholders were not angry. They were, honestly, impressed. Not by the incident—no one is impressed by downtime—but by the response. Calm, structured, transparent. No scrambling. No blame. Just process.

Seventeen minutes later, the on-call engineer posted the all-clear in the incident channel. Service restored, edge case identified, a post-incident review scheduled for the following morning.

We resumed the demo.

There are a few things I took away from that afternoon.

First: never demo from a screen that shows production metrics. Create a dedicated demo environment or at minimum use a filtered view. This is not optional.

Second: incident response is a skill, and it degrades without practice. We ran quarterly incident drills, and that sprint review was the first time I was genuinely grateful for every hour we had spent on them.

Third: stakeholders respond to confidence and honesty. They do not expect perfection. They expect to be treated like adults. When something breaks, telling them clearly what is happening, what you are doing about it, and when you expect resolution is almost always better received than minimizing or deflecting.

The sprint review eventually happened. The new alerting features got their demo. The post-incident review identified a configuration validation gap that we fixed within the week.

And I deployed the hotfix before the next sprint review.

Because in infrastructure, the scariest thing is not a production incident.

It is a production incident with no runbook and no plan.