The core idea
Caching is a correctness problem before it’s a performance trick. Cache stable artifacts (embeddings, candidates, rerank scores) first. If you cache responses, do it only for low-risk public FAQs—and always validate permissions and freshness before serving.
A correctness-first caching playbook for LLM/RAG: reduce cost without silently shipping wrong answers. Safe keys, invalidation beyond TTL, and validation gates you can ship.
Caching in production is a cost lever—but the classic failure mode is saving tokens while serving stale, wrong, or permission-violating answers. Caches drift when docs update, prompts change, or tenant boundaries are loose. This playbook shows how to design caching that reduces spend without breaking correctness: concrete layers, safe keys, invalidation strategies, and validation gates.
Context
Part of Cost Optimization: Cost Optimization Playbooks. Related: Reduce OpenAI Bill Without Hurting Quality, Case Study: Inference Cost Reduction, Token Spend Decomposition, Hybrid Search + Reranking, AI Optimization Services.
1) Where cost actually comes from (and where caching helps)
A typical RAG or agent flow: normalize request → retrieval → reranking → context construction → generation (biggest token cost) → tools → post-processing.
Caching can reduce cost at four places:
- Prompt tokens (long system prompts, tool schemas, repeated context)
- Retrieval/rerank compute (vector search + rerank aren't free)
- Tool calls (often repeated for the same query)
- Full responses (FAQ-ish repeats)
But each cache has different correctness risks. The core principle is: cache stable artifacts that are easy to validate before caching final answers.
2) Savings ballpark: what to expect by layer
Reference ranges for planning. Actual savings depend on query repeat rate, doc churn, and hit rates.
| Layer | Typical savings (when hit) | Hit-rate reality | Correctness risk |
|---|---|---|---|
| Embedding (doc + query) | 70–90% of embedding cost | High (queries repeat, docs stable) | Low |
| Retrieval (query → chunk IDs) | 50–80% of retrieval compute | Medium–high (FAQ-ish queries) | Medium (stale docs) |
| Rerank cache | 60–90% of rerank cost | Medium | Low |
| Prompt prefix | 10–50% of prompt tokens | High (same system/tools) | Low (if versioned) |
| Response cache | 90%+ of full call cost | Low–medium (exact repeats only) | High |
Most systems get 60–80% of cache ROI from embeddings + retrieval + rerank—before touching response cache.
3) The 3 cache layers (and what each one should store)
Layer A — Prompt cache (token cost)
Goal: avoid paying repeatedly for identical prompt prefix tokens or identical assembled prompts.
Cache: canonicalized system prompt + tool schemas + stable instructions (the "prefix").
Risk: prompt-version drift, hidden personalization differences, tool schema changes.
Layer B — Retrieval cache (recall/cost stability)
Goal: avoid rerunning vector/BM25 search + reranking when the same meaningfully identical query repeats.
Cache: normalized query + filters → candidate chunk IDs (+ ranks/scores), rerank scores, embeddings.
Risk: stale docs, permission drift, query normalization collisions.
Layer C — Response cache (max token savings, max risk)
Goal: reuse the final generated answer (text + citations + structured output).
Best for: public FAQs, low-risk non-personalized answers, template-style responses.
Risk: wrong user/tenant permissions, staleness, context mismatch, hallucination amplification.
Rule of thumb
Response caching should be the last caching layer you add. Most systems get 60–80% of cache ROI from embeddings + retrieval + rerank.
4) Cache keys: correctness starts with key design
A cache is only as safe as its key. The key must represent everything that could change the correct output. Minimum "safe key" dimensions for most systems:
- tenant_id (always)
- user_scope (role/permissions tier)
- locale/region (policies differ)
- model_id + version + decoding params
- prompt_version (hash(system + templates + tool schemas))
- retrieval_version (index snapshot + chunking version)
- policy_version (safety/compliance ruleset)
- tooling_version (tool schemas/allowed tools)
- normalized_query + filters (intent/entities/filter hash)
- date anchor for time-relative questions ("today")
If any of these change, you should miss the cache or revalidate aggressively before serving. Canonicalization matters—but don't over-normalize IDs/codes/versions.
import crypto from "crypto";
type CacheLayer = "prompt_prefix" | "retrieval" | "rerank" | "response" | "embedding";
type CacheKeyInput = {
layer: CacheLayer;
// scope + risk boundaries
tenantId: string;
userScope: string; // role/permissions tier (NOT raw user id unless required)
locale?: string;
region?: string;
// model + decoding
modelId: string;
modelVersion?: string;
temperature?: number;
topP?: number;
maxTokens?: number;
// version fences (force invalidation on deploy/config changes)
promptVersion: string; // hash(system + templates + tool schemas)
toolingVersion: string; // tool schemas + allowed tools
retrievalVersion: string; // index snapshot + chunking version
policyVersion: string; // safety/compliance ruleset
// request identity
normalizedQuery?: string; // for retrieval/rerank/response caches
intent?: string; // if you have intent routing
filterHash?: string; // metadata filters (product, plan, version, collection)
conversationStateHash?: string; // ONLY if caching across multi-turn context
// time anchoring (prevents “today” collisions)
dateAnchor?: string; // e.g. "2026-02-27" in user timezone
};
function stableJson(x: unknown) {
return JSON.stringify(x, Object.keys(x as any).sort());
}
function sha256(s: string) {
return crypto.createHash("sha256").update(s, "utf8").digest("hex");
}
export function buildCacheKey(input: CacheKeyInput) {
// Minimum safe dimensions: if any of these change, correctness might change.
const keyMaterial = {
layer: input.layer,
tenantId: input.tenantId,
userScope: input.userScope,
locale: input.locale ?? "default",
region: input.region ?? "default",
modelId: input.modelId,
modelVersion: input.modelVersion ?? "unknown",
temperature: input.temperature ?? 0,
topP: input.topP ?? 1,
maxTokens: input.maxTokens ?? "auto",
promptVersion: input.promptVersion,
toolingVersion: input.toolingVersion,
retrievalVersion: input.retrievalVersion,
policyVersion: input.policyVersion,
normalizedQuery: input.normalizedQuery ?? "",
intent: input.intent ?? "",
filterHash: input.filterHash ?? "",
conversationStateHash: input.conversationStateHash ?? "",
dateAnchor: input.dateAnchor ?? "",
};
// Keep the raw key readable for debugging, but use the hash as the actual cache key.
const raw = stableJson(keyMaterial);
const digest = sha256(raw).slice(0, 32); // short key is fine; collision risk is extremely low
const prefix = `cache:${input.layer}`;
return { key: `${prefix}:${digest}`, raw };
}
/**
* Notes:
* - DO NOT over-normalize. IDs/codes/versions are semantic.
* - If your system supports time-relative questions (“today”), include a date anchor in the key.
* - For response cache: include everything you’d need to prove the answer is safe to reuse.
*/5) Invalidation: you need more than TTL
TTL-only caching is how you ship stale answers. Use a multi-pronged strategy:
- TTL as a baseline
- Version bump invalidation on deploy (prompt/retrieval/tooling versions)
- Event-driven invalidation when docs update (especially policies/pricing/security)
- Partition invalidation as a practical compromise (invalidate "policies" partition)
# Invalidation strategies: TTL + version bumps + doc update events.
# Event-driven invalidation requires reverse mappings: doc_id -> [cache_keys_that_used_it].
from collections import defaultdict
from datetime import datetime, timedelta
# Example stores (backed by Redis / DB in real life)
CACHE = {} # cache_key -> {"value": ..., "written_at": datetime, "meta": {...}}
DOC_TO_CACHEKEYS = defaultdict(set) # doc_id -> set(cache_key)
def cache_write(cache_key, value, meta):
CACHE[cache_key] = {"value": value, "written_at": datetime.utcnow(), "meta": meta}
# If this cache entry cites docs, store reverse mappings for event invalidation.
cited_doc_ids = meta.get("cited_doc_ids", [])
for doc_id in cited_doc_ids:
DOC_TO_CACHEKEYS[doc_id].add(cache_key)
def invalidate_by_doc_update(doc_id):
# Called when a document is updated (policy/pricing/security docs = high priority).
keys = list(DOC_TO_CACHEKEYS.get(doc_id, []))
for k in keys:
CACHE.pop(k, None)
DOC_TO_CACHEKEYS.pop(doc_id, None)
return len(keys)
def invalidate_partition(partition_name):
# Practical compromise when reverse mappings are hard:
# partition caches by collection tag (policies/pricing/api-docs) and nuke that namespace.
keys = [k for k, v in CACHE.items() if v["meta"].get("partition") == partition_name]
for k in keys:
CACHE.pop(k, None)
return len(keys)
def ttl_sweep(now=None):
now = now or datetime.utcnow()
deleted = 0
for k in list(CACHE.keys()):
ttl_sec = CACHE[k]["meta"].get("ttl_sec", 0)
if ttl_sec and (CACHE[k]["written_at"] + timedelta(seconds=ttl_sec)) < now:
CACHE.pop(k, None)
deleted += 1
return deleted
# Version bump invalidation is simplest and most reliable:
# - include prompt_version / retrieval_version / tooling_version in the cache key
# - deploying changes automatically creates a new cache namespace (old entries become unreachable).6) Validation gates: how to cache without breaking correctness
Caching needs a validation contract—fast checks that prevent incorrect cached serves:
- Permission & scope check (non-negotiable) — verify access to every cited chunk/doc
- Freshness check — bypass if cited docs updated after cache write time (strict for high-risk)
- Schema/format check — for cached JSON, validate parse + required keys/types
- Groundedness clause check — citations exist and clauses still match (fingerprint/substr/fuzzy)
- Stale-while-revalidate — only for low-risk public FAQs (optional)
type CachedRetrieval = {
candidateChunkIds: string[];
citedDocIds?: string[];
writtenAtMs: number;
// optional for correctness checks
citedClauseFingerprints?: Array<{ docId: string; chunkId: string; fingerprint: string }>;
};
type ValidationInput = {
tenantId: string;
userScope: string;
nowMs: number;
// doc metadata / permission services
canAccessChunk: (chunkId: string, tenantId: string, userScope: string) => Promise<boolean>;
docLastUpdatedMs: (docId: string) => Promise<number>;
chunkContainsFingerprint: (chunkId: string, fingerprint: string) => Promise<boolean>;
// policy knobs
freshnessMode: "strict" | "best_effort";
highRiskCollections: Set<string>;
citedDocCollection?: (docId: string) => Promise<string | null>;
};
export async function validateCachedItem(
cached: CachedRetrieval,
v: ValidationInput
): Promise<{ ok: true } | { ok: false; reason: string }> {
// Gate 1 — permission & scope check (non-negotiable)
for (const chunkId of cached.candidateChunkIds) {
const ok = await v.canAccessChunk(chunkId, v.tenantId, v.userScope);
if (!ok) return { ok: false, reason: `permission_denied_for_chunk:${chunkId}` };
}
// Gate 2 — freshness check (doc-age aware)
if (cached.citedDocIds?.length) {
for (const docId of cached.citedDocIds) {
const updatedAt = await v.docLastUpdatedMs(docId);
if (updatedAt > cached.writtenAtMs) {
// High-risk collections: always bypass on doc changes.
const coll = (await v.citedDocCollection?.(docId)) ?? null;
const highRisk = coll ? v.highRiskCollections.has(coll) : false;
if (v.freshnessMode === "strict" || highRisk) {
return { ok: false, reason: `stale_doc:${docId}` };
}
}
}
}
// Gate 4 — groundedness / contradiction check (cheap version)
// Verify the cited clause still exists (fingerprint can be substring hash / fuzzy token signature).
if (cached.citedClauseFingerprints?.length) {
for (const c of cached.citedClauseFingerprints) {
const ok = await v.chunkContainsFingerprint(c.chunkId, c.fingerprint);
if (!ok) return { ok: false, reason: `clause_mismatch:${c.docId}:${c.chunkId}` };
}
}
return { ok: true };
}
/**
* Add Gate 3 for structured outputs (JSON/schema validation) at response-serve time.
* Add Gate 5 (stale-while-revalidate) only for low-risk public endpoints.
*/7) Provider prompt caching (OpenAI, Anthropic, etc.)
OpenAI, Anthropic, and others offer prompt caching (prefix caching): identical prompt prefixes are cached server-side and billed at a lower rate. Use it when your system prompt + tool schemas + stable context are large and repeated across calls.
When to rely on provider caching vs your own: Provider caching is automatic and requires no key design—but you don't control TTL or invalidation. For multi-tenant or policy-sensitive systems, your own prompt prefix cache (with tenant/version in the key) gives control. Use provider caching for single-tenant or low-risk workloads; add your own cache when you need strict versioning and tenant isolation.
8) Playbook per cache layer (what to do, what to avoid)
A) Prompt cache (cost saver, low risk if versioned)
Do: cache stable prefix; key includes prompt_version + tooling_version; separate tenants/policies.
Avoid: caching full prompts that include private user context unless keys are strict and scope-limited.
B) Retrieval cache (best ROI for RAG)
Do: cache normalized query + filters → chunk IDs + doc versions; cache embeddings; cache rerank scores.
Avoid: over-normalizing queries; ignoring filters; caching without doc/chunk versions.
C) Response cache (highest savings, highest risk)
Do: use only for public FAQs and low-risk content; include all versions; validate citations + freshness.
Avoid: caching personalized or account-state answers; caching "current status" answers without date anchors.
Safer pattern
Cache analysis artifacts (intent, candidate set, rerank scores) and regenerate final text cheaply, instead of caching the final answer for anything non-trivial.
9) Cache poisoning and security (don't skip this)
Caching introduces new attack surfaces: prompt injection that gets cached and amplified, and cross-tenant leakage from loose keys.
- Never cache responses that contain policy-triggering content without validation
- Cache only after safety checks pass
- Always include tenant_id (and strict user_scope where needed)
- Consider encryption at rest if cache infra is shared
10) Metrics that prove caching helped (without harming quality)
Track caching like a production feature: savings, hit rates, rejection/override rates, freshness violations, and offline eval regressions.
__CODE_BLOCK_METRICS__11) Recommended rollout order (lowest risk → highest payoff)
- Embedding cache (doc + query embeddings)
- Retrieval candidates cache (query→chunk IDs)
- Rerank cache (query/chunk → score)
- Prompt prefix caching (stable scaffolding)
- Response caching only for strict low-risk public FAQs
At each step: add offline eval gates, permission checks, and doc-version freshness checks.
12) Shipping checklist (copy into your PR)
Keying & versioning
- Keys include tenant/scope/model/prompt/retrieval/policy versions
- Query canonicalization avoids collisions (don't over-normalize IDs)
- Date anchors included for time-relative queries ("today")
Invalidation
- TTL set per layer
- Version bump invalidation on deploy
- Doc update invalidation for policy/pricing (event-driven or partition)
Validation
- Permission check before serving cached retrieval/response
- Doc freshness check for cited docs
- JSON/schema validation for structured outputs
- Citation existence + clause fingerprint check for RAG answers
Observability
- Hit rate, savings, and override rate tracked per layer
- Cache-related incidents are loggable and searchable
- Canary rollout + rollback plan
13) Closing: correctness-first caching
If you remember one rule, make it this: cache artifacts that are stable and easy to validate (embeddings, candidates, rerank scores) before you cache final answers. That's how you get meaningful cost reduction without turning caching into a reliability bug factory.
See our case study on inference cost reduction for a real-world example: caching, routing, and retrieval policy changes drove 25–60% cost reduction with before/after benchmarks.
Want TTLs + keys tailored to your stack?
Tell us your cache layer (Redis/Postgres), retrieval stack (OpenSearch/Elastic/pgvector/vector DB), and doc update frequency. We'll propose safe defaults (TTLs, key schema, invalidation strategy, validation gates) and ship it via{" "} AI Optimization{" "} or validate end-to-end in an{" "} AI System Audit.
# Correctness-first cache defaults (adjust by doc churn + risk)
prompt_prefix_cache:
enabled: true
ttl: "24h"
notes:
- "Key must include prompt_version + tooling_version + model_id + decoding params"
- "Strong tenant/policy separation"
embedding_cache:
enabled: true
ttl: "7d"
notes:
- "Cache doc embeddings on ingestion"
- "Cache query embeddings if queries repeat"
retrieval_cache:
enabled: true
ttl: "30m"
freshness:
mode: "strict" # strict for policies/pricing/security
invalidate_on_doc_update_collections:
- "policies"
- "pricing"
- "security"
notes:
- "Store chunk_ids + doc versions + filter hash"
- "Always permission-check before reuse"
rerank_cache:
enabled: true
ttl: "2h"
key:
- "query_hash"
- "chunk_id"
- "reranker_version"
response_cache:
enabled: false # enable last
ttl: "10m"
scope: "public_faq_only"
validation:
- "permission_check"
- "doc_freshness_check"
- "citation_exists"
- "clause_fingerprint_check"
notes:
- "Do not cache personalized/account-state answers"
- "Include date anchors for time-relative questions ('today')"
gates:
- "must_not_increase_cached_answer_override_rate > 2x baseline"
- "must_not_reduce offline_recall@10 on golden set"
- "must_not_increase p95_retrieval_latency > 15% without approval"OpenAI, Anthropic, and others offer prompt caching (prefix caching): identical prompt prefixes are cached server-side and billed at a lower rate. Use it when your system prompt + tool schemas + stable context are large and repeated across calls.
When to rely on provider caching vs your own: Provider caching is automatic and requires no key design—but you don't control TTL or invalidation. For multi-tenant or policy-sensitive systems, your own prompt prefix cache (with tenant/version in the key) gives control. Use provider caching for single-tenant or low-risk workloads; add your own cache when you need strict versioning and tenant isolation.
8) Playbook per cache layer (what to do, what to avoid)
A) Prompt cache (cost saver, low risk if versioned)
Do: cache stable prefix; key includes prompt_version + tooling_version; separate tenants/policies.
Avoid: caching full prompts that include private user context unless keys are strict and scope-limited.
B) Retrieval cache (best ROI for RAG)
Do: cache normalized query + filters → chunk IDs + doc versions; cache embeddings; cache rerank scores.
Avoid: over-normalizing queries; ignoring filters; caching without doc/chunk versions.
C) Response cache (highest savings, highest risk)
Do: use only for public FAQs and low-risk content; include all versions; validate citations + freshness.
Avoid: caching personalized or account-state answers; caching "current status" answers without date anchors.
Safer pattern
Cache analysis artifacts (intent, candidate set, rerank scores) and regenerate final text cheaply, instead of caching the final answer for anything non-trivial.
9) Cache poisoning and security (don't skip this)
Caching introduces new attack surfaces: prompt injection that gets cached and amplified, and cross-tenant leakage from loose keys.
- Never cache responses that contain policy-triggering content without validation
- Cache only after safety checks pass
- Always include tenant_id (and strict user_scope where needed)
- Consider encryption at rest if cache infra is shared
10) Metrics that prove caching helped (without harming quality)
Track caching like a production feature: savings, hit rates, rejection/override rates, freshness violations, and offline eval regressions.
-- Cache metrics that actually prove ROI (cost + correctness)
-- Hit rate by layer
-- cache_hit_total{layer="retrieval"} / cache_req_total{layer="retrieval"}
-- Token savings (prompt vs completion)
-- prompt_tokens_saved_total, completion_tokens_saved_total
-- Correctness proxy: cache override (validation rejected cache)
-- cache_override_total{layer="response"} / cache_hit_total{layer="response"}
-- Freshness violations (should be ~0 for high-risk collections)
-- cached_response_cites_updated_doc_total / cached_response_total
-- Cost per Successful Task (CPS) should improve, not just cost/call
-- CPS = total_tokens_spent / successful_tasks
-- Tail latency by stage (secondary)
-- p95_retrieval_ms, p95_rerank_ms, p95_generation_ms11) Recommended rollout order (lowest risk → highest payoff)
- Embedding cache (doc + query embeddings)
- Retrieval candidates cache (query→chunk IDs)
- Rerank cache (query/chunk → score)
- Prompt prefix caching (stable scaffolding)
- Response caching only for strict low-risk public FAQs
At each step: add offline eval gates, permission checks, and doc-version freshness checks.
12) Shipping checklist (copy into your PR)
Keying & versioning
- Keys include tenant/scope/model/prompt/retrieval/policy versions
- Query canonicalization avoids collisions (don't over-normalize IDs)
- Date anchors included for time-relative queries ("today")
Invalidation
- TTL set per layer
- Version bump invalidation on deploy
- Doc update invalidation for policy/pricing (event-driven or partition)
Validation
- Permission check before serving cached retrieval/response
- Doc freshness check for cited docs
- JSON/schema validation for structured outputs
- Citation existence + clause fingerprint check for RAG answers
Observability
- Hit rate, savings, and override rate tracked per layer
- Cache-related incidents are loggable and searchable
- Canary rollout + rollback plan
13) Closing: correctness-first caching
If you remember one rule, make it this: cache artifacts that are stable and easy to validate (embeddings, candidates, rerank scores) before you cache final answers. That's how you get meaningful cost reduction without turning caching into a reliability bug factory.
See our case study on inference cost reduction for a real-world example: caching, routing, and retrieval policy changes drove 25–60% cost reduction with before/after benchmarks.
Want TTLs + keys tailored to your stack?
Tell us your cache layer (Redis/Postgres), retrieval stack (OpenSearch/Elastic/pgvector/vector DB), and doc update frequency. We'll propose safe defaults (TTLs, key schema, invalidation strategy, validation gates) and ship it via{" "} AI Optimization{" "} or validate end-to-end in an{" "} AI System Audit.
FAQ
Questions readers usually ask next
Which cache layer should I ship first?
Start with embeddings (doc + query), then retrieval candidates (query→chunk IDs), then rerank score cache, then prompt prefix caching. Add response caching last and only for strict low-risk public FAQs, with short TTL and validation gates.
What's the most common correctness failure with caching?
Serving cached responses across the wrong tenant/scope/policy or after doc updates. The fix is safe key design (tenant + scope + versions), plus permission/freshness checks and event-driven invalidation for high-risk collections like pricing and policies.
Do I need event-driven invalidation or is TTL enough?
TTL is a baseline, not a safety mechanism. If you have freshness-critical docs (policy/pricing/security), you need version bump invalidation on deploy and event-driven invalidation (or at least partition invalidation) on doc updates.
Redis vs Postgres for LLM/RAG caching?
Redis is faster and better for high-throughput, short-TTL caches (retrieval, rerank). Postgres works for audit trails, longer retention, or when you need transactional consistency with other data. Use Redis for hot path; Postgres when you need durability or complex queries over cache metadata.
When is it safe to enable response cache?
Only for low-risk, non-personalized content: public FAQs, static help text, template-style answers. Never for account-state, personalized recommendations, or anything that depends on 'current' data (pricing, policies, inventory) unless you have strict validation gates and short TTL.
Want defaults tailored to your stack?
Share your caching layer and doc update frequency. We’ll propose safe TTLs + key dimensions + invalidation strategy and ship it via AI Optimization or validate end-to-end in an AI System Audit.
Last updated
March 7, 2026





