Case Study3 min read

Cutting P95 Latency by 40–70% in a RAG Pipeline (No Quality Drop)

RAG pipeline P95 and timeouts were killing UX. We found rerank/embedding bottlenecks, cold starts, and queueing. Fixes: retrieval + rerank caching, streaming, batching, connection pooling, timeout budgets, fallback paths. P50/P95/P99 and throughput improved without quality loss.

Case StudyRAGLatencyP95ThroughputOptimization

Share this article

The core idea

P95 latency in RAG is often retrieval/rerank-bound—caching and batching move the needle without sacrificing quality.

A RAG pipeline had P95 latency and timeouts that killed user experience. Users abandoned; support tickets spiked. We diagnosed rerank/embedding bottlenecks, cold starts, and queueing—then implemented caching, streaming, batching, and timeout budgets. P95 dropped 40–70% with no quality loss.

Anonymized but real

Names and identifying details are removed. The process and outcomes are preserved.

Executive summary

The client had a RAG pipeline in production. P95 latency was high; timeouts were frequent. User experience suffered. We diagnosed rerank/embedding bottlenecks, cold starts, concurrency/queueing, and retries. Fixes: retrieval + rerank caching, streaming, batching, connection pooling, timeout budgets, fallback paths. P50/P95/P99 improved; timeout rate dropped; throughput (req/s) increased; quality held steady.

This is the production pattern behind high P95 latency in LLM systems. Use Latency & Serving to map the full serving model before tuning isolated spans.

Baseline (before)

Before optimization:

  • P95 latency: High—users experienced slow responses
  • Timeout rate: Significant—requests often timed out
  • Throughput: Limited—system couldn't handle peak load
  • Error budget/SLO: Consumed—SLO violations frequent

Diagnosis

We traced the pipeline and found:

  • Rerank/embedding bottleneck: Reranker and embedding calls dominated latency
  • Cold starts: First requests after idle were very slow
  • Concurrency/queueing: Requests queued behind slow operations
  • Retries: Retries amplified load and latency under failure

The fix

We implemented LLM latency optimization and serving optimizations:

  • Caching: Retrieval cache + rerank cache for repeated/similar queries
  • Streaming: Stream LLM output to reduce TTFT
  • Batching: Batch embedding and rerank calls where possible
  • Connection pooling: Reuse connections to embedding/LLM services
  • Timeout budgets: Per-stage timeouts to fail fast and avoid cascading delays
  • Fallback paths: Skip rerank or use cached results when under pressure

Metrics

Before/After (validated)

Metric Before After Change
P50 latency High Lower
P95 latency High 40–70% lower
P99 latency Very high Improved
Timeout rate Significant Dropped
Throughput (req/s) Limited Increased
Quality Baseline Held steady

Why this worked

Caching reduced redundant work. Streaming improved perceived latency. Batching and connection pooling reduced overhead. Timeout budgets prevented cascading delays. Fallback paths kept the system responsive under load. Quality was preserved because we didn't sacrifice retrieval or rerank quality—we optimized the path.


Next steps

If your RAG or LLM pipeline has high P95 or timeouts, an AI system audit can baseline the serving path, and LLM serving optimization can implement the fix plan. If you need to align the symptom first, start with the latency pain page.

Latency hurting UX?

We diagnose RAG/LLM pipeline bottlenecks and implement caching, streaming, batching, and timeout budgets—with before/after validation. High P95 latency covers the buyer-visible symptom.

Lead magnet

Latency Budget Worksheet — A practical worksheet to allocate latency budgets across retrieval, embedding, rerank, and LLM stages. Request it.

What made this hard

Optimizing latency without degrading quality—caching and fallbacks had to be carefully designed.

What made this work

Trace-driven diagnosis, per-stage timeout budgets, and before/after latency distributions.

Need LLM latency optimization?

If your RAG/LLM pipeline has high P95 or timeouts, our AI audit diagnoses the serving path and leads into fixes with before/after validation once the team is aligned on the latency symptom.

Last updated

February 1, 2026