Quick answers
- What is an incident RCA?
- A post-incident meeting to understand what happened, why, and how to prevent it.
- What do you cover in an RCA?
- Timeline, impact, root cause, and action items to prevent recurrence.
- How do I describe the root cause?
- Be specific and factual. State what failed and why, without blame.
What it is
After an outage or incident, teams run an RCA (or postmortem). The format is often: What happened? What was the impact? What was the root cause? What are we doing to prevent recurrence? The focus is on learning, not blame.
Why it matters
Clear communication in RCAs helps the whole team learn. Describing timelines, causes, and remediation in plain language is important. Blame-free language and owning your part—without over-apologizing—builds trust.
Instead of → Say
| Instead of | Say |
|---|---|
| It was my fault | I missed the rate limit in the config. I've updated the runbook. |
| The system broke | The database connection pool was exhausted under load. We've increased the pool size. |
| We fixed it | We rolled back the deploy, restored from backup, and validated the fix in staging. |
| It won't happen again | We're adding a pre-deploy check for config changes and alerting on pool usage. |
| I don't know why | The root cause appears to be the combination of [X] and [Y]. I'll dig deeper and update the doc. |
Example dialogue
Facilitator: Can you walk us through the timeline?
You: The deploy went out at 2 p.m. Metrics looked normal until 2:15, when we saw a spike in 5xx errors. We rolled back at 2:25 and restored service by 2:40.
Facilitator: What was the root cause?
You: The new feature introduced a query that wasn't indexed. Under load it caused a cascade. I've added the index and we're adding a performance test to catch this.
Facilitator: Any other action items?
You: I'll update the runbook with the rollback steps. The process worked, but we can document it better.
Common mistakes
- Focusing on blame instead of process
- Being vague about the root cause
- Skipping action items or follow-ups
- Over-apologizing—own the issue, then focus on solutions
- Using too much jargon without context
Frequently asked questions
- How do I describe my role in an incident without blaming myself?
- Stick to facts: "I made a change to [X]. That led to [Y]. I've [remediation]." Focus on what you learned and what you'll do differently.
- What if I don't know the root cause yet?
- Say so: "We're still investigating. Our current hypothesis is [X]. I'll update the doc when we have more data."
- How do I suggest process improvements?
- "To prevent this, I suggest we [action]. That would have caught [problem]."
- Should I mention others' mistakes?
- Focus on the system and process. If someone's action is relevant, describe it factually without naming: "The deploy included a config change that wasn't in the runbook."
- How detailed should the timeline be?
- Include key moments: deploy, first alert, mitigation start, restore. Minutes matter for incidents.
Ready to practice?
Start practicing with our available scenarios and get instant feedback.
Start practicing