The core idea
AI observability is the discipline of making LLM behavior diagnosable: trend-level metrics, request-level traces, and failure classes tied to real fixes.
Traditional app monitoring tells you whether the system is up. AI observability needs to tell you whether the system is right, grounded, affordable, and safe. Those are not the same question.
A production LLM system can have healthy infrastructure metrics and still be failing users through wrong answers, weak citations, tool loops, silent cost growth, or regressions that only appear in one cohort. That is why AI observability has to go beyond uptime and latency.
The practical model is simple: use metrics to see trends, traces to inspect one request end-to-end, and a failure taxonomy to turn messy incidents into repeatable categories you can fix systematically.
Context
Part of the LLM Audit hub. Related: Audit Readiness, LLM Observability case study, RAG Wrong Answers Triage, OpenAI Bill Audit.
Why normal app monitoring is not enough
Standard monitoring is built for deterministic systems. LLM systems are probabilistic pipelines. The request may succeed technically and still fail the user.
Examples:
- HTTP 200 but answer is unsupported
- latency looks normal on average, but one tool span drives P95
- retrieval succeeded, but final context mixed contradictory chunks
- token cost rose because retries and tool loops increased, not because traffic grew
- safety and policy regressions affect only a narrow cohort, so aggregate dashboards stay green
If your observability stops at infrastructure health, you are blind to the failure modes users care about most.
The three pillars of AI observability
| Pillar | What it answers | Why it matters |
|---|---|---|
| Metrics | Are trends moving in the wrong direction? | They show quality, cost, latency, and safety drift over time. |
| Traces | Where did this specific request go wrong? | They connect retrieval, prompt build, generation, tools, and validation in one view. |
| Failure taxonomy | What kind of failure keeps happening? | It turns incidents into repeatable classes with owners, metrics, and fixes. |
Most teams have one of these. Production-grade teams have all three connected to the same request identity.
1) Metrics: what to count, compare, and alert on
Good AI metrics cover both user impact and system behavior. Start with a small scorecard that spans quality, operations, and risk.
| Category | Examples | Bad sign |
|---|---|---|
| Outcome quality | task success, grounded answer rate, citation validity, escalation rate | HTTP success stays high while user trust drops |
| Retrieval and context | selection recall, context precision, contradiction rate, freshness coverage | right docs appear in candidates but final answers stay unsupported |
| Operations | P95 latency, time to first token, retry rate, timeout rate, cache hit rate | tail latency spikes without obvious infra saturation |
| Cost | cost per successful task, token split by stage, tool-call volume | spend rises faster than usage and no one can explain why |
| Safety and policy | violation rate, refusal correctness, PII redaction coverage | aggregate quality improves while risky cohorts quietly regress |
Metrics should always be segmented by cohort. If you only look at one global score, you will miss the fact that a narrow but valuable group got worse.
2) Traces: how to see one request end-to-end
Metrics tell you that something drifted. Traces tell you how one bad request actually unfolded.
A useful trace for an LLM system usually includes spans like:
- request received
- query rewrite or intent classification
- retrieval candidate generation
- reranking or final context selection
- prompt assembly
- LLM generation
- tool calls and tool retries
- post-processing, validation, or citation binding
- delivery and final user outcome
The key is not just timing. Each span should also carry the minimum attributes needed for diagnosis: model version, prompt version, candidate IDs, selected chunk IDs, token counts, tool name, retry count, validation result, and final failure label if the request was bad.
Trace design rule
Every bad answer should be reproducible as a trace: input, retrieved evidence, prompt version, model output, tool steps, validation outcome, and user-visible result.
3) Failure taxonomy: how to classify what broke
A failure taxonomy gives the team a shared language for incidents. Without it, everything becomes "the model was weird."
A practical taxonomy often includes:
- retrieval miss: right source never surfaced
- ranking or selection failure: right source surfaced but did not reach final context
- context construction failure: noise, contradiction, or broken chunk boundaries degraded grounding
- unsupported generation: answer went beyond evidence
- tool failure: wrong tool, bad args, retry loop, or tool-side error
- serving bottleneck: queueing, rate limits, concurrency, cache miss, or slow downstream stage
- safety or policy violation: refusal, redaction, or guardrail failure
- coverage gap: the knowledge base or toolset did not contain the required information at all
The point is not taxonomy purity. The point is operational leverage. Once failures are classed consistently, you can count them, alert on them, assign owners, and make them part of release review.
What a minimum viable observability stack looks like
You do not need a massive platform before observability becomes useful. A practical minimum stack usually looks like this:
- request ID carried through every stage
- stage-level spans for retrieval, generation, tools, and validation
- structured logs tied to the same request ID
- a small scorecard for quality, cost, latency, and safety
- a failure taxonomy with human-review examples
- cohort labels such as intent, tenant, locale, or workflow type
If you need the exact field list and privacy-safe schema, use Audit Readiness as the implementation companion to this article.
What to alert on in production
The goal of alerting is not to wake someone up for every odd response. It is to catch meaningful drift early.
Good alert candidates:
- grounded answer rate drops beyond threshold
- cost per successful task spikes for a cohort
- P95 latency rises because one span regressed sharply
- tool failure or retry-loop rate climbs
- unsupported claim rate increases after a prompt or model change
- safety or refusal correctness regresses in monitored high-risk slices
Alerting works best when it is tied to the failure taxonomy. Alerting on "something feels weird" is operationally useless.
Common observability mistakes in LLM systems
- Only tracking provider latency: hides retrieval, reranking, tool, and validation bottlenecks.
- Only tracking aggregate quality: hides cohort regressions and failure classes.
- No traceable request identity: makes incident review storytelling instead of diagnosis.
- Logging too much raw sensitive data: creates compliance risk and blocks adoption.
- No stable taxonomy: every bad answer becomes a one-off anecdote.
- No link between offline evals and production metrics: release gates and live behavior drift apart.
Good observability is not maximal logging. It is decision-useful evidence with clear boundaries.
How observability changes incident response
With weak observability, incident response sounds like this: "Maybe retrieval broke. Maybe the model got worse. Maybe the tool timed out." Teams argue, patch blindly, and often fix the wrong layer.
With observability in place, incident response becomes much more mechanical:
- identify the affected cohorts and KPI movement
- pull example traces for the failure class
- locate the dominant failing stage or span
- classify the failure into the taxonomy
- ship the smallest fix that addresses that class
- validate before and after on the same cohort
That is the real value of AI observability. It reduces guesswork, shortens root-cause analysis, and makes improvements provable.
Need observability before another incident?
We help teams define the scorecard, add request-level traces, and build a failure taxonomy that makes production AI diagnosable. If the system already feels unpredictable, observability is usually the first infrastructure upgrade worth paying for.
FAQ
Questions readers usually ask next
What should we measure first for AI observability?
Start with a small but complete stack: task success or groundedness, cost per successful task, P95 latency, fallback or escalation rate, retrieved context IDs, prompt or model version, tool-call outcomes, and request-level trace IDs. That gives you a baseline for quality, operations, and diagnosis.
Why are traces important if we already have dashboards?
Dashboards show aggregate symptoms. Traces show how one bad request moved through retrieval, prompt construction, model generation, tools, and validation. Without traces, teams argue about bottlenecks and failure layers instead of proving them.
Do we need full OpenTelemetry before we can start?
No. You need consistent request IDs, a few stage-level spans, and structured logs tied to the same request. Full OTel adoption helps, but the real goal is pipeline-level evidence, not observability tool completeness.
What is a failure taxonomy for LLM systems?
A failure taxonomy is a stable set of categories that classify what went wrong: retrieval miss, ranking failure, context construction issue, unsupported generation, tool failure, serving bottleneck, safety violation, and so on. It turns vague complaints into measurable recurring patterns.
Most useful first move
Tie every request to a stable ID, a few stage-level spans, and a failure label. That is usually enough to move from guessing to diagnosis.
Most common blind spot
Dashboards that show latency and cost but do not connect bad answers to retrieved evidence, tool steps, or guardrail outcomes.
Need diagnosable AI systems?
We baseline the scorecard, add tracing, and classify failure modes so teams can fix the right layer first. Start with an AI Production Audit.
Last updated
March 9, 2026





