The core idea
P95 latency in RAG is often retrieval/rerank-bound—caching and batching move the needle without sacrificing quality.
A RAG pipeline had P95 latency and timeouts that killed user experience. Users abandoned; support tickets spiked. We diagnosed rerank/embedding bottlenecks, cold starts, and queueing—then implemented caching, streaming, batching, and timeout budgets. P95 dropped 40–70% with no quality loss.
Anonymized but real
Names and identifying details are removed. The process and outcomes are preserved.
Executive summary
The client had a RAG pipeline in production. P95 latency was high; timeouts were frequent. User experience suffered. We diagnosed rerank/embedding bottlenecks, cold starts, concurrency/queueing, and retries. Fixes: retrieval + rerank caching, streaming, batching, connection pooling, timeout budgets, fallback paths. P50/P95/P99 improved; timeout rate dropped; throughput (req/s) increased; quality held steady.
This is the production pattern behind high P95 latency in LLM systems. Use Latency & Serving to map the full serving model before tuning isolated spans.
Baseline (before)
Before optimization:
- P95 latency: High—users experienced slow responses
- Timeout rate: Significant—requests often timed out
- Throughput: Limited—system couldn't handle peak load
- Error budget/SLO: Consumed—SLO violations frequent
Diagnosis
We traced the pipeline and found:
- Rerank/embedding bottleneck: Reranker and embedding calls dominated latency
- Cold starts: First requests after idle were very slow
- Concurrency/queueing: Requests queued behind slow operations
- Retries: Retries amplified load and latency under failure
The fix
We implemented LLM latency optimization and serving optimizations:
- Caching: Retrieval cache + rerank cache for repeated/similar queries
- Streaming: Stream LLM output to reduce TTFT
- Batching: Batch embedding and rerank calls where possible
- Connection pooling: Reuse connections to embedding/LLM services
- Timeout budgets: Per-stage timeouts to fail fast and avoid cascading delays
- Fallback paths: Skip rerank or use cached results when under pressure
Metrics
Before/After (validated)
| Metric | Before | After | Change |
|---|---|---|---|
| P50 latency | High | Lower | ↓ |
| P95 latency | High | 40–70% lower | ↓ |
| P99 latency | Very high | Improved | ↓ |
| Timeout rate | Significant | Dropped | ↓ |
| Throughput (req/s) | Limited | Increased | ↑ |
| Quality | Baseline | Held steady | — |
Why this worked
Caching reduced redundant work. Streaming improved perceived latency. Batching and connection pooling reduced overhead. Timeout budgets prevented cascading delays. Fallback paths kept the system responsive under load. Quality was preserved because we didn't sacrifice retrieval or rerank quality—we optimized the path.
Next steps
If your RAG or LLM pipeline has high P95 or timeouts, an AI system audit can baseline the serving path, and LLM serving optimization can implement the fix plan. If you need to align the symptom first, start with the latency pain page.
Latency hurting UX?
We diagnose RAG/LLM pipeline bottlenecks and implement caching, streaming, batching, and timeout budgets—with before/after validation. High P95 latency covers the buyer-visible symptom.
Lead magnet
Latency Budget Worksheet — A practical worksheet to allocate latency budgets across retrieval, embedding, rerank, and LLM stages. Request it.
What made this hard
Optimizing latency without degrading quality—caching and fallbacks had to be carefully designed.
What made this work
Trace-driven diagnosis, per-stage timeout budgets, and before/after latency distributions.
Need LLM latency optimization?
If your RAG/LLM pipeline has high P95 or timeouts, our AI audit diagnoses the serving path and leads into fixes with before/after validation once the team is aligned on the latency symptom.
Last updated
February 1, 2026
