Cost Optimization7 min read

Caching for Cost & Correctness: Prompt / Retrieval / Response Cache (With Validation Gates)

Correctness-first caching for LLM/RAG: reduce cost without shipping wrong answers. Cache layers, savings ballpark, safe keys, invalidation beyond TTL, provider caching, and validation gates you can ship.

cachingcost-spikeretrievalwrong-answersobservabilityplaybook

Share this article

The core idea

Caching is a correctness problem before it’s a performance trick. Cache stable artifacts (embeddings, candidates, rerank scores) first. If you cache responses, do it only for low-risk public FAQs—and always validate permissions and freshness before serving.

A correctness-first caching playbook for LLM/RAG: reduce cost without silently shipping wrong answers. Safe keys, invalidation beyond TTL, and validation gates you can ship.

Caching in production is a cost lever—but the classic failure mode is saving tokens while serving stale, wrong, or permission-violating answers. Caches drift when docs update, prompts change, or tenant boundaries are loose. This playbook shows how to design caching that reduces spend without breaking correctness: concrete layers, safe keys, invalidation strategies, and validation gates.

1) Where cost actually comes from (and where caching helps)

A typical RAG or agent flow: normalize request → retrieval → reranking → context construction → generation (biggest token cost) → tools → post-processing.

Caching can reduce cost at four places:

  • Prompt tokens (long system prompts, tool schemas, repeated context)
  • Retrieval/rerank compute (vector search + rerank aren't free)
  • Tool calls (often repeated for the same query)
  • Full responses (FAQ-ish repeats)

But each cache has different correctness risks. The core principle is: cache stable artifacts that are easy to validate before caching final answers.

2) Savings ballpark: what to expect by layer

Reference ranges for planning. Actual savings depend on query repeat rate, doc churn, and hit rates.

Layer Typical savings (when hit) Hit-rate reality Correctness risk
Embedding (doc + query) 70–90% of embedding cost High (queries repeat, docs stable) Low
Retrieval (query → chunk IDs) 50–80% of retrieval compute Medium–high (FAQ-ish queries) Medium (stale docs)
Rerank cache 60–90% of rerank cost Medium Low
Prompt prefix 10–50% of prompt tokens High (same system/tools) Low (if versioned)
Response cache 90%+ of full call cost Low–medium (exact repeats only) High

Most systems get 60–80% of cache ROI from embeddings + retrieval + rerank—before touching response cache.

3) The 3 cache layers (and what each one should store)

Layer A — Prompt cache (token cost)

Goal: avoid paying repeatedly for identical prompt prefix tokens or identical assembled prompts.

Cache: canonicalized system prompt + tool schemas + stable instructions (the "prefix").

Risk: prompt-version drift, hidden personalization differences, tool schema changes.

Layer B — Retrieval cache (recall/cost stability)

Goal: avoid rerunning vector/BM25 search + reranking when the same meaningfully identical query repeats.

Cache: normalized query + filters → candidate chunk IDs (+ ranks/scores), rerank scores, embeddings.

Risk: stale docs, permission drift, query normalization collisions.

Layer C — Response cache (max token savings, max risk)

Goal: reuse the final generated answer (text + citations + structured output).

Best for: public FAQs, low-risk non-personalized answers, template-style responses.

Risk: wrong user/tenant permissions, staleness, context mismatch, hallucination amplification.

Rule of thumb

Response caching should be the last caching layer you add. Most systems get 60–80% of cache ROI from embeddings + retrieval + rerank.

4) Cache keys: correctness starts with key design

A cache is only as safe as its key. The key must represent everything that could change the correct output. Minimum "safe key" dimensions for most systems:

  • tenant_id (always)
  • user_scope (role/permissions tier)
  • locale/region (policies differ)
  • model_id + version + decoding params
  • prompt_version (hash(system + templates + tool schemas))
  • retrieval_version (index snapshot + chunking version)
  • policy_version (safety/compliance ruleset)
  • tooling_version (tool schemas/allowed tools)
  • normalized_query + filters (intent/entities/filter hash)
  • date anchor for time-relative questions ("today")

If any of these change, you should miss the cache or revalidate aggressively before serving. Canonicalization matters—but don't over-normalize IDs/codes/versions.

{}Safe cache key builder (includes tenant/scope/versions)
import crypto from "crypto";

type CacheLayer = "prompt_prefix" | "retrieval" | "rerank" | "response" | "embedding";

type CacheKeyInput = {
  layer: CacheLayer;

  // scope + risk boundaries
  tenantId: string;
  userScope: string; // role/permissions tier (NOT raw user id unless required)
  locale?: string;
  region?: string;

  // model + decoding
  modelId: string;
  modelVersion?: string;
  temperature?: number;
  topP?: number;
  maxTokens?: number;

  // version fences (force invalidation on deploy/config changes)
  promptVersion: string;     // hash(system + templates + tool schemas)
  toolingVersion: string;    // tool schemas + allowed tools
  retrievalVersion: string;  // index snapshot + chunking version
  policyVersion: string;     // safety/compliance ruleset

  // request identity
  normalizedQuery?: string;  // for retrieval/rerank/response caches
  intent?: string;           // if you have intent routing
  filterHash?: string;       // metadata filters (product, plan, version, collection)
  conversationStateHash?: string; // ONLY if caching across multi-turn context

  // time anchoring (prevents “today” collisions)
  dateAnchor?: string; // e.g. "2026-02-27" in user timezone
};

function stableJson(x: unknown) {
  return JSON.stringify(x, Object.keys(x as any).sort());
}

function sha256(s: string) {
  return crypto.createHash("sha256").update(s, "utf8").digest("hex");
}

export function buildCacheKey(input: CacheKeyInput) {
  // Minimum safe dimensions: if any of these change, correctness might change.
  const keyMaterial = {
    layer: input.layer,

    tenantId: input.tenantId,
    userScope: input.userScope,
    locale: input.locale ?? "default",
    region: input.region ?? "default",

    modelId: input.modelId,
    modelVersion: input.modelVersion ?? "unknown",
    temperature: input.temperature ?? 0,
    topP: input.topP ?? 1,
    maxTokens: input.maxTokens ?? "auto",

    promptVersion: input.promptVersion,
    toolingVersion: input.toolingVersion,
    retrievalVersion: input.retrievalVersion,
    policyVersion: input.policyVersion,

    normalizedQuery: input.normalizedQuery ?? "",
    intent: input.intent ?? "",
    filterHash: input.filterHash ?? "",
    conversationStateHash: input.conversationStateHash ?? "",
    dateAnchor: input.dateAnchor ?? "",
  };

  // Keep the raw key readable for debugging, but use the hash as the actual cache key.
  const raw = stableJson(keyMaterial);
  const digest = sha256(raw).slice(0, 32); // short key is fine; collision risk is extremely low
  const prefix = `cache:${input.layer}`;
  return { key: `${prefix}:${digest}`, raw };
}

/**
 * Notes:
 * - DO NOT over-normalize. IDs/codes/versions are semantic.
 * - If your system supports time-relative questions (“today”), include a date anchor in the key.
 * - For response cache: include everything you’d need to prove the answer is safe to reuse.
 */

5) Invalidation: you need more than TTL

TTL-only caching is how you ship stale answers. Use a multi-pronged strategy:

  • TTL as a baseline
  • Version bump invalidation on deploy (prompt/retrieval/tooling versions)
  • Event-driven invalidation when docs update (especially policies/pricing/security)
  • Partition invalidation as a practical compromise (invalidate "policies" partition)
pyInvalidation beyond TTL (doc events + partitions + version bumps)
# Invalidation strategies: TTL + version bumps + doc update events.
# Event-driven invalidation requires reverse mappings: doc_id -> [cache_keys_that_used_it].

from collections import defaultdict
from datetime import datetime, timedelta

# Example stores (backed by Redis / DB in real life)
CACHE = {}                   # cache_key -> {"value": ..., "written_at": datetime, "meta": {...}}
DOC_TO_CACHEKEYS = defaultdict(set)  # doc_id -> set(cache_key)

def cache_write(cache_key, value, meta):
    CACHE[cache_key] = {"value": value, "written_at": datetime.utcnow(), "meta": meta}

    # If this cache entry cites docs, store reverse mappings for event invalidation.
    cited_doc_ids = meta.get("cited_doc_ids", [])
    for doc_id in cited_doc_ids:
        DOC_TO_CACHEKEYS[doc_id].add(cache_key)

def invalidate_by_doc_update(doc_id):
    # Called when a document is updated (policy/pricing/security docs = high priority).
    keys = list(DOC_TO_CACHEKEYS.get(doc_id, []))
    for k in keys:
        CACHE.pop(k, None)
    DOC_TO_CACHEKEYS.pop(doc_id, None)
    return len(keys)

def invalidate_partition(partition_name):
    # Practical compromise when reverse mappings are hard:
    # partition caches by collection tag (policies/pricing/api-docs) and nuke that namespace.
    keys = [k for k, v in CACHE.items() if v["meta"].get("partition") == partition_name]
    for k in keys:
        CACHE.pop(k, None)
    return len(keys)

def ttl_sweep(now=None):
    now = now or datetime.utcnow()
    deleted = 0
    for k in list(CACHE.keys()):
        ttl_sec = CACHE[k]["meta"].get("ttl_sec", 0)
        if ttl_sec and (CACHE[k]["written_at"] + timedelta(seconds=ttl_sec)) < now:
            CACHE.pop(k, None)
            deleted += 1
    return deleted

# Version bump invalidation is simplest and most reliable:
# - include prompt_version / retrieval_version / tooling_version in the cache key
# - deploying changes automatically creates a new cache namespace (old entries become unreachable).

6) Validation gates: how to cache without breaking correctness

Caching needs a validation contract—fast checks that prevent incorrect cached serves:

  1. Permission & scope check (non-negotiable) — verify access to every cited chunk/doc
  2. Freshness check — bypass if cited docs updated after cache write time (strict for high-risk)
  3. Schema/format check — for cached JSON, validate parse + required keys/types
  4. Groundedness clause check — citations exist and clauses still match (fingerprint/substr/fuzzy)
  5. Stale-while-revalidate — only for low-risk public FAQs (optional)
{}Validation gates (permission + freshness + clause checks)
type CachedRetrieval = {
  candidateChunkIds: string[];
  citedDocIds?: string[];
  writtenAtMs: number;
  // optional for correctness checks
  citedClauseFingerprints?: Array<{ docId: string; chunkId: string; fingerprint: string }>;
};

type ValidationInput = {
  tenantId: string;
  userScope: string;
  nowMs: number;

  // doc metadata / permission services
  canAccessChunk: (chunkId: string, tenantId: string, userScope: string) => Promise<boolean>;
  docLastUpdatedMs: (docId: string) => Promise<number>;
  chunkContainsFingerprint: (chunkId: string, fingerprint: string) => Promise<boolean>;

  // policy knobs
  freshnessMode: "strict" | "best_effort";
  highRiskCollections: Set<string>;
  citedDocCollection?: (docId: string) => Promise<string | null>;
};

export async function validateCachedItem(
  cached: CachedRetrieval,
  v: ValidationInput
): Promise<{ ok: true } | { ok: false; reason: string }> {
  // Gate 1 — permission & scope check (non-negotiable)
  for (const chunkId of cached.candidateChunkIds) {
    const ok = await v.canAccessChunk(chunkId, v.tenantId, v.userScope);
    if (!ok) return { ok: false, reason: `permission_denied_for_chunk:${chunkId}` };
  }

  // Gate 2 — freshness check (doc-age aware)
  if (cached.citedDocIds?.length) {
    for (const docId of cached.citedDocIds) {
      const updatedAt = await v.docLastUpdatedMs(docId);
      if (updatedAt > cached.writtenAtMs) {
        // High-risk collections: always bypass on doc changes.
        const coll = (await v.citedDocCollection?.(docId)) ?? null;
        const highRisk = coll ? v.highRiskCollections.has(coll) : false;

        if (v.freshnessMode === "strict" || highRisk) {
          return { ok: false, reason: `stale_doc:${docId}` };
        }
      }
    }
  }

  // Gate 4 — groundedness / contradiction check (cheap version)
  // Verify the cited clause still exists (fingerprint can be substring hash / fuzzy token signature).
  if (cached.citedClauseFingerprints?.length) {
    for (const c of cached.citedClauseFingerprints) {
      const ok = await v.chunkContainsFingerprint(c.chunkId, c.fingerprint);
      if (!ok) return { ok: false, reason: `clause_mismatch:${c.docId}:${c.chunkId}` };
    }
  }

  return { ok: true };
}

/**
 * Add Gate 3 for structured outputs (JSON/schema validation) at response-serve time.
 * Add Gate 5 (stale-while-revalidate) only for low-risk public endpoints.
 */

7) Provider prompt caching (OpenAI, Anthropic, etc.)

OpenAI, Anthropic, and others offer prompt caching (prefix caching): identical prompt prefixes are cached server-side and billed at a lower rate. Use it when your system prompt + tool schemas + stable context are large and repeated across calls.

When to rely on provider caching vs your own: Provider caching is automatic and requires no key design—but you don't control TTL or invalidation. For multi-tenant or policy-sensitive systems, your own prompt prefix cache (with tenant/version in the key) gives control. Use provider caching for single-tenant or low-risk workloads; add your own cache when you need strict versioning and tenant isolation.

8) Playbook per cache layer (what to do, what to avoid)

A) Prompt cache (cost saver, low risk if versioned)

Do: cache stable prefix; key includes prompt_version + tooling_version; separate tenants/policies.

Avoid: caching full prompts that include private user context unless keys are strict and scope-limited.

B) Retrieval cache (best ROI for RAG)

Do: cache normalized query + filters → chunk IDs + doc versions; cache embeddings; cache rerank scores.

Avoid: over-normalizing queries; ignoring filters; caching without doc/chunk versions.

C) Response cache (highest savings, highest risk)

Do: use only for public FAQs and low-risk content; include all versions; validate citations + freshness.

Avoid: caching personalized or account-state answers; caching "current status" answers without date anchors.

Safer pattern

Cache analysis artifacts (intent, candidate set, rerank scores) and regenerate final text cheaply, instead of caching the final answer for anything non-trivial.

9) Cache poisoning and security (don't skip this)

Caching introduces new attack surfaces: prompt injection that gets cached and amplified, and cross-tenant leakage from loose keys.

  • Never cache responses that contain policy-triggering content without validation
  • Cache only after safety checks pass
  • Always include tenant_id (and strict user_scope where needed)
  • Consider encryption at rest if cache infra is shared

10) Metrics that prove caching helped (without harming quality)

Track caching like a production feature: savings, hit rates, rejection/override rates, freshness violations, and offline eval regressions.

__CODE_BLOCK_METRICS__

11) Recommended rollout order (lowest risk → highest payoff)

  1. Embedding cache (doc + query embeddings)
  2. Retrieval candidates cache (query→chunk IDs)
  3. Rerank cache (query/chunk → score)
  4. Prompt prefix caching (stable scaffolding)
  5. Response caching only for strict low-risk public FAQs

At each step: add offline eval gates, permission checks, and doc-version freshness checks.

12) Shipping checklist (copy into your PR)

Keying & versioning

  • Keys include tenant/scope/model/prompt/retrieval/policy versions
  • Query canonicalization avoids collisions (don't over-normalize IDs)
  • Date anchors included for time-relative queries ("today")

Invalidation

  • TTL set per layer
  • Version bump invalidation on deploy
  • Doc update invalidation for policy/pricing (event-driven or partition)

Validation

  • Permission check before serving cached retrieval/response
  • Doc freshness check for cited docs
  • JSON/schema validation for structured outputs
  • Citation existence + clause fingerprint check for RAG answers

Observability

  • Hit rate, savings, and override rate tracked per layer
  • Cache-related incidents are loggable and searchable
  • Canary rollout + rollback plan

13) Closing: correctness-first caching

If you remember one rule, make it this: cache artifacts that are stable and easy to validate (embeddings, candidates, rerank scores) before you cache final answers. That's how you get meaningful cost reduction without turning caching into a reliability bug factory.

See our case study on inference cost reduction for a real-world example: caching, routing, and retrieval policy changes drove 25–60% cost reduction with before/after benchmarks.

Want TTLs + keys tailored to your stack?

Tell us your cache layer (Redis/Postgres), retrieval stack (OpenSearch/Elastic/pgvector/vector DB), and doc update frequency. We'll propose safe defaults (TTLs, key schema, invalidation strategy, validation gates) and ship it via{" "} AI Optimization{" "} or validate end-to-end in an{" "} AI System Audit.

{}Recommended defaults (TTLs + gates) — start here
# Correctness-first cache defaults (adjust by doc churn + risk)

prompt_prefix_cache:
  enabled: true
  ttl: "24h"
  notes:
    - "Key must include prompt_version + tooling_version + model_id + decoding params"
    - "Strong tenant/policy separation"

embedding_cache:
  enabled: true
  ttl: "7d"
  notes:
    - "Cache doc embeddings on ingestion"
    - "Cache query embeddings if queries repeat"

retrieval_cache:
  enabled: true
  ttl: "30m"
  freshness:
    mode: "strict" # strict for policies/pricing/security
  invalidate_on_doc_update_collections:
    - "policies"
    - "pricing"
    - "security"
  notes:
    - "Store chunk_ids + doc versions + filter hash"
    - "Always permission-check before reuse"

rerank_cache:
  enabled: true
  ttl: "2h"
  key:
    - "query_hash"
    - "chunk_id"
    - "reranker_version"

response_cache:
  enabled: false # enable last
  ttl: "10m"
  scope: "public_faq_only"
  validation:
    - "permission_check"
    - "doc_freshness_check"
    - "citation_exists"
    - "clause_fingerprint_check"
  notes:
    - "Do not cache personalized/account-state answers"
    - "Include date anchors for time-relative questions ('today')"

gates:
  - "must_not_increase_cached_answer_override_rate > 2x baseline"
  - "must_not_reduce offline_recall@10 on golden set"
  - "must_not_increase p95_retrieval_latency > 15% without approval"
-caching">7) Provider prompt caching (OpenAI, Anthropic, etc.)

OpenAI, Anthropic, and others offer prompt caching (prefix caching): identical prompt prefixes are cached server-side and billed at a lower rate. Use it when your system prompt + tool schemas + stable context are large and repeated across calls.

When to rely on provider caching vs your own: Provider caching is automatic and requires no key design—but you don't control TTL or invalidation. For multi-tenant or policy-sensitive systems, your own prompt prefix cache (with tenant/version in the key) gives control. Use provider caching for single-tenant or low-risk workloads; add your own cache when you need strict versioning and tenant isolation.

8) Playbook per cache layer (what to do, what to avoid)

A) Prompt cache (cost saver, low risk if versioned)

Do: cache stable prefix; key includes prompt_version + tooling_version; separate tenants/policies.

Avoid: caching full prompts that include private user context unless keys are strict and scope-limited.

B) Retrieval cache (best ROI for RAG)

Do: cache normalized query + filters → chunk IDs + doc versions; cache embeddings; cache rerank scores.

Avoid: over-normalizing queries; ignoring filters; caching without doc/chunk versions.

C) Response cache (highest savings, highest risk)

Do: use only for public FAQs and low-risk content; include all versions; validate citations + freshness.

Avoid: caching personalized or account-state answers; caching "current status" answers without date anchors.

Safer pattern

Cache analysis artifacts (intent, candidate set, rerank scores) and regenerate final text cheaply, instead of caching the final answer for anything non-trivial.

9) Cache poisoning and security (don't skip this)

Caching introduces new attack surfaces: prompt injection that gets cached and amplified, and cross-tenant leakage from loose keys.

  • Never cache responses that contain policy-triggering content without validation
  • Cache only after safety checks pass
  • Always include tenant_id (and strict user_scope where needed)
  • Consider encryption at rest if cache infra is shared

10) Metrics that prove caching helped (without harming quality)

Track caching like a production feature: savings, hit rates, rejection/override rates, freshness violations, and offline eval regressions.

{}Metrics that prove ROI (and catch correctness regressions)
-- Cache metrics that actually prove ROI (cost + correctness)

-- Hit rate by layer
-- cache_hit_total{layer="retrieval"} / cache_req_total{layer="retrieval"}

-- Token savings (prompt vs completion)
-- prompt_tokens_saved_total, completion_tokens_saved_total

-- Correctness proxy: cache override (validation rejected cache)
-- cache_override_total{layer="response"} / cache_hit_total{layer="response"}

-- Freshness violations (should be ~0 for high-risk collections)
-- cached_response_cites_updated_doc_total / cached_response_total

-- Cost per Successful Task (CPS) should improve, not just cost/call
-- CPS = total_tokens_spent / successful_tasks

-- Tail latency by stage (secondary)
-- p95_retrieval_ms, p95_rerank_ms, p95_generation_ms

11) Recommended rollout order (lowest risk → highest payoff)

  1. Embedding cache (doc + query embeddings)
  2. Retrieval candidates cache (query→chunk IDs)
  3. Rerank cache (query/chunk → score)
  4. Prompt prefix caching (stable scaffolding)
  5. Response caching only for strict low-risk public FAQs

At each step: add offline eval gates, permission checks, and doc-version freshness checks.

12) Shipping checklist (copy into your PR)

Keying & versioning

  • Keys include tenant/scope/model/prompt/retrieval/policy versions
  • Query canonicalization avoids collisions (don't over-normalize IDs)
  • Date anchors included for time-relative queries ("today")

Invalidation

  • TTL set per layer
  • Version bump invalidation on deploy
  • Doc update invalidation for policy/pricing (event-driven or partition)

Validation

  • Permission check before serving cached retrieval/response
  • Doc freshness check for cited docs
  • JSON/schema validation for structured outputs
  • Citation existence + clause fingerprint check for RAG answers

Observability

  • Hit rate, savings, and override rate tracked per layer
  • Cache-related incidents are loggable and searchable
  • Canary rollout + rollback plan

13) Closing: correctness-first caching

If you remember one rule, make it this: cache artifacts that are stable and easy to validate (embeddings, candidates, rerank scores) before you cache final answers. That's how you get meaningful cost reduction without turning caching into a reliability bug factory.

See our case study on inference cost reduction for a real-world example: caching, routing, and retrieval policy changes drove 25–60% cost reduction with before/after benchmarks.

Want TTLs + keys tailored to your stack?

Tell us your cache layer (Redis/Postgres), retrieval stack (OpenSearch/Elastic/pgvector/vector DB), and doc update frequency. We'll propose safe defaults (TTLs, key schema, invalidation strategy, validation gates) and ship it via{" "} AI Optimization{" "} or validate end-to-end in an{" "} AI System Audit.

FAQ

Questions readers usually ask next

Which cache layer should I ship first?

Start with embeddings (doc + query), then retrieval candidates (query→chunk IDs), then rerank score cache, then prompt prefix caching. Add response caching last and only for strict low-risk public FAQs, with short TTL and validation gates.

What's the most common correctness failure with caching?

Serving cached responses across the wrong tenant/scope/policy or after doc updates. The fix is safe key design (tenant + scope + versions), plus permission/freshness checks and event-driven invalidation for high-risk collections like pricing and policies.

Do I need event-driven invalidation or is TTL enough?

TTL is a baseline, not a safety mechanism. If you have freshness-critical docs (policy/pricing/security), you need version bump invalidation on deploy and event-driven invalidation (or at least partition invalidation) on doc updates.

Redis vs Postgres for LLM/RAG caching?

Redis is faster and better for high-throughput, short-TTL caches (retrieval, rerank). Postgres works for audit trails, longer retention, or when you need transactional consistency with other data. Use Redis for hot path; Postgres when you need durability or complex queries over cache metadata.

When is it safe to enable response cache?

Only for low-risk, non-personalized content: public FAQs, static help text, template-style answers. Never for account-state, personalized recommendations, or anything that depends on 'current' data (pricing, policies, inventory) unless you have strict validation gates and short TTL.

Want defaults tailored to your stack?

Share your caching layer and doc update frequency. We’ll propose safe TTLs + key dimensions + invalidation strategy and ship it via AI Optimization or validate end-to-end in an AI System Audit.

Last updated

March 7, 2026

Recent Posts

Latest articles from our insights