The core idea
The postmortem should leave behind an evidence-backed fix plan, not just a story about what went wrong.
Most incident reviews fail for the same reason: they explain what happened, but not enough about why this system failed this way and what specifically will stop it from happening again.
LLM and RAG incidents need a stricter postmortem than a normal app outage. You are usually dealing with probabilistic behavior, multiple model or prompt versions, retrieval and context construction layers, and failure modes that only show up in one cohort or one document class.
This template is designed to keep the review mechanical: collect evidence, classify the failure, isolate the affected layer, and turn the incident into a short list of measurable fixes.
Rule of thumb
If the incident review cannot point to a trace, a version, an affected cohort, and a measurable impact, it is not finished yet.
Part of the Reliability hub. Related guides: Reliability Audit, AI Observability, RAG Wrong Answers Triage.
Why AI incidents need a different postmortem
In a traditional service outage, the failure path is often obvious: a dependency failed, the queue backed up, or a deploy introduced a regression. In AI systems, the failure path can span retrieval, reranking, prompt assembly, generation, tool execution, validation, and serving.
- The answer can be wrong while the request still returns HTTP 200.
- The best document can be present in retrieval but never reach the final context.
- A model change can look harmless overall but break one high-value cohort.
- Latency can explode because of reranking, retries, or tool loops instead of the model itself.
- Cost spikes can come from hidden fallback behavior, not traffic growth.
The postmortem has to prove which layer failed first, which layers amplified the issue, and which measurement would have caught it earlier.
The postmortem template
Use this as the minimum document structure. Keep it short, specific, and evidence-backed.
| Section | What belongs here | Good output |
|---|---|---|
| Summary | What happened, when, and who was affected | One paragraph, no speculation |
| Impact | User impact, cohort, severity, duration | Measurable business or user outcome |
| Timeline | Detection, mitigation, recovery, follow-up | UTC timestamps and owner names |
| Root cause | The first failing layer and why it failed | One primary cause, not a list of guesses |
| Contributing factors | What made the issue worse | Version drift, missing gates, weak alerts, bad defaults |
| Actions | Fixes, owners, deadlines, verification method | Three to five measurable follow-ups |
The best review documents also include a short evidence appendix so a future engineer can reproduce the problem without asking around in Slack.
Incident title:
Date:
Severity:
Owner:
Summary:
- What happened:
- Who was affected:
- How long it lasted:
Impact:
- User-visible impact:
- Cohorts affected:
- Business impact:
Timeline:
- Detection:
- Mitigation:
- Recovery:
- Follow-up:
Root cause:
- First failing layer:
- Why it failed:
- Why it was not caught earlier:
Contributing factors:
- Version drift:
- Retrieval or context issue:
- Tooling or serving issue:
- Monitoring gap:
Actions:
- Immediate containment:
- Prevent recurrence:
- Validation or regression gate:
- Owner:
- Due date:
- Success metric:
What evidence to attach for LLM and RAG incidents
The incident document should not depend on memory. Attach the minimum evidence needed to classify the failure.
- request ID, trace ID, and timestamp range
- user cohort, tenant, locale, intent, or workflow label
- prompt version, model version, and feature-flag state
- retrieved document IDs, chunk IDs, and reranker scores
- selected context window or final prompt snapshot
- tool-call names, arguments, retries, and outcomes
- latency by stage, not only end-to-end latency
- token counts by stage and total cost for the incident window
- validation results, refusal outcome, and fallback behavior
- before/after comparison for the same request class if available
Good evidence beats good opinions
The fastest way to shorten the review is to make each bad answer reproducible from logs, traces, and versions. That turns debate into diagnosis.
A filled example: wrong answer, stale retrieval, and a slow reranker
Example incident: a support copilot answered with an outdated policy after a knowledge base update. The model was blamed first, but the actual failure chain was: the new documents were partially reindexed, the candidate set still included stale chunks, and reranking latency pushed the system into a fallback path for one cohort.
| Field | Filled example |
|---|---|
| Impact | 12% of policy questions in one tenant received outdated guidance for 47 minutes. |
| Detection | Support ticket and a rise in low-confidence answer flags, but no alert on the specific cohort. |
| Primary cause | Partial reindex left stale chunks eligible for retrieval after a content update. |
| Amplifier | Reranker latency triggered fallback behavior that reduced context quality for the affected path. |
| Fix | Block mixed-version chunks, add reindex completion checks, and gate fallback behavior on retrieval freshness. |
| Verification | Replay the impacted cohort against a frozen trace set and confirm answer correctness, latency, and cost hold steady. |
That is the level of specificity you want. The document should let a second engineer answer: what broke, what changed, what we verified, and what we will watch next.
How to write action items that actually close the loop
Action items should be fewer than the number of hypotheses. If you have seven possible explanations, you probably have not finished the investigation.
- Keep each action item small enough to finish in one release cycle.
- Assign one owner and one due date.
- Attach a success metric or verification step.
- Prefer prevention over cleanup, but keep immediate containment explicit.
- Separate fixes that reduce recurrence from fixes that improve observability.
Examples of good action items:
- Add a reindex-complete check before new chunks can serve traffic.
- Segment alerts by tenant and workflow so this cohort cannot hide in a global average.
- Introduce a regression gate for the affected intent class in CI.
- Track fallback-path usage separately from the happy path.
Examples of weak action items:
- Improve monitoring.
- Look into retrieval.
- Review prompt quality.
- Make the system better.
Common mistakes
- Blaming the model too early: many failures are retrieval, freshness, or serving problems.
- Leaving out versions: if prompt, model, and index versions are missing, root cause is hard to prove.
- Using only averages: cohort-specific failures disappear in global metrics.
- Writing too many action items: the document turns into a backlog instead of a recovery plan.
- Ignoring cost and latency: AI incidents often hurt more than correctness alone.
- Skipping verification: if you do not replay the incident or compare before/after, recurrence risk stays high.
A useful habit
Treat the postmortem as the bridge between incident response and regression prevention. The document should feed the next eval, the next alert, or the next release gate.
Next steps
If you already have the template but keep seeing the same class of incident, the issue is probably not documentation. It is missing control points: evaluation, tracing, freshness checks, release gates, or ownership.
Need the incident loop to stay closed?
We help teams turn AI incidents into repeatable fixes with baselines, traces, and regression gates. If the same failure keeps returning, the system needs operating discipline, not another one-off review.
FAQ
Questions readers usually ask next
When should we write an AI incident postmortem?
Write one whenever an incident creates user impact, a trust break, an SLO miss, a cost spike, a safety concern, or a repeated failure class. If the same issue can reappear, it deserves a postmortem.
Who should own the postmortem?
One person should own the document, but ownership should be cross-functional. For LLM and RAG systems, that usually means the product owner, the platform or infrastructure owner, and whoever owns evaluation or retrieval quality.
What makes AI incidents different from normal incidents?
The root cause is often distributed across versions, prompts, retrieval, context construction, tool behavior, model choice, and cohort-specific behavior. You need evidence that identifies the failing layer before you can fix it safely.
How many action items is too many?
Usually more than three high-confidence action items is a sign the document is mixing root cause analysis with a backlog dump. Keep the postmortem focused on the few changes that reduce recurrence the most.
Minimum standard
Include impact, timeline, root cause, contributing factors, evidence, and a short list of actions with owners and dates.
Best practice
Re-run the incident against a frozen trace set or replay path so the team can verify the fix, not just discuss it.
Turn incidents into regression gates
If the same issue keeps returning, the fastest win is a mix of baselines, observability, and release controls.See the Reliability Retainer.
Last updated
May 7, 2026





