The core idea
Treat an LLM vendor migration like a production release. Prove parity by cohort, shadow test under real traffic patterns, and keep rollback active until the new provider is stable.
Switching LLM vendors is not a model swap. It is a production migration across prompts, evals, routing, latency, cost, safety, and rollback.
Teams usually start an LLM vendor migration for a reasonable reason: lower cost, better latency, stronger reasoning, better enterprise terms, regional availability, provider redundancy, or less lock-in. The mistake is treating the migration as a config update.
A new provider changes more than output style. It can change tokenization, instruction following, tool-call behavior, JSON adherence, refusal patterns, streaming behavior, timeout rate, rate limits, and cost shape. If you use RAG, agents, regulated workflows, or customer-facing automation, those differences can become production regressions quickly.
Context
Part of the LLM Audit hub. Related: LLM Evaluation Framework, Golden Dataset from Real Logs, Model Routing for Cost Control, AI Observability, and LLM Regression Testing.
Why LLM vendor migrations break production
Production LLM behavior is the product of a full system: model, prompt, retrieval, context assembly, tools, validators, safety policies, serving limits, retries, and monitoring. A provider swap touches many of those layers at once.
The common failure modes are predictable:
- Prompt drift: instructions that worked with one provider become weaker, too strict, or overinterpreted by another.
- Structured output drift: JSON, XML, citations, function arguments, or schema fields fail at a different rate.
- Tool-call drift: the model chooses tools differently, calls too early, skips required tools, or generates subtly wrong arguments.
- RAG grounding drift: the new model uses retrieved evidence differently, overweights weak context, or refuses when answerable context exists.
- Safety drift: refusals, policy boundaries, and sensitive-topic behavior change in ways support and compliance did not approve.
- Latency drift: time-to-first-token, streaming cadence, retries, and p95 latency move even when average latency looks acceptable.
- Cost drift: lower unit price does not help if the new provider needs more tokens, more retries, or more fallback calls.
The migration rule: prove parity before traffic moves
The migration should not ask "is the new model better?" in the abstract. It should ask: Does the new provider meet or beat the current provider on the workflows we actually serve, under the constraints we actually operate?
That means parity gates before rollout:
- quality parity by use case, intent, language, tenant, document type, and risk level
- cost per successful task, not only cost per request
- p50, p95, p99, timeout, and retry rates
- tool-call validity and workflow completion rate
- safety, compliance, and refusal correctness
- monitoring coverage and rollback path
Migration principle
A cheaper or stronger model is not production-ready until it clears the same release gates as any other user-visible change.
Phase 1: Define the migration target
Start by writing down why the migration exists. If the reason is vague, the rollout criteria will be vague too.
Choose the primary target:
- reduce cost per successful task by a specific percentage
- reduce p95 latency or timeout rate for specific workflows
- improve answer quality, groundedness, or tool completion rate
- add provider redundancy for availability or procurement risk
- meet data residency, compliance, security, or enterprise contract requirements
Then define non-negotiables. For example: no drop in groundedness for legal answers, no increase in PHI/PII exposure risk, no more than 5 percent p95 latency regression, no degradation in tool-call success, no unsupported region for regulated traffic.
Phase 2: Build the baseline before touching the provider
You cannot prove a migration worked if you do not know how the current system performs. Build the baseline from production-shaped examples, not handpicked demos.
Minimum baseline:
- current model and provider version
- prompt, policy, retriever, reranker, tool, and guardrail versions
- task success rate by cohort
- groundedness and citation validity for RAG workflows
- schema validity and tool-call success for agents
- cost per request and cost per successful task
- p50, p95, p99, timeout rate, retry rate, and fallback rate
- top failure categories with concrete examples
The baseline should include success cases and failure cases. A migration that only tests happy paths will miss the cases that actually drive support tickets.
Phase 3: Map provider compatibility risks
Before running evals, map how the new provider differs from the current provider. This makes debugging much faster when parity fails.
| Risk area | What to check | Failure signal |
|---|---|---|
| Prompt semantics | system messages, hierarchy, examples, refusal instructions | same prompt produces different policy or formatting behavior |
| Structured output | JSON mode, schema constraints, citation format, streaming chunks | parser errors, missing fields, invalid citations |
| Tool calling | tool choice, argument generation, validation, retries | wrong tool, wrong arguments, looped calls |
| Context limits | tokenization, max context, truncation behavior, prompt caching | lost instructions, higher cost, answer quality drop on long docs |
| Serving limits | rate limits, concurrency, timeout defaults, regional availability | p95 spikes, retries, queueing, incident during peak traffic |
| Governance | data retention, logging, training use, audit trails, access control | security approval blocked after engineering work is complete |
Phase 4: Run offline evals by cohort
Offline evals are where you find migration regressions before users do. The important part is cohort coverage.
Split the eval set into meaningful slices:
- simple questions vs multi-step workflows
- RAG answers with clear evidence vs ambiguous evidence
- high-value customers or regulated tenants
- languages and locales
- short context vs long context
- tool-required vs no-tool-required requests
- known failure modes from support tickets or incident reviews
Score both providers side by side. If the new provider wins overall but loses on one critical cohort, the migration is not ready. You may still use the new provider for low-risk routes, but do not claim full replacement parity.
Phase 5: Shadow test under production-shaped load
Offline evals do not show every serving issue. Shadow testing lets the new provider process real traffic patterns without controlling the user-visible answer.
During shadow testing, compare:
- answer score against the current provider
- tool-call decision and argument differences
- retrieved-context usage and citation choice
- latency distribution, especially p95 and p99
- timeout, retry, and rate-limit behavior
- cost per successful shadowed task
- safety refusals and policy boundary differences
Keep shadow logs privacy-safe. You need enough trace detail to diagnose differences, but not more sensitive data than your retention and access policies allow.
Phase 6: Roll out with routing, fallbacks, and rollback
The safest migration is rarely "all traffic moves on Friday." Use route-level control.
- Internal traffic: employees, test tenants, and non-customer-facing workflows.
- Low-risk canary: simple intents, no regulated content, no write actions.
- Cohort expansion: add traffic by use case only after metrics hold.
- High-risk workflows: RAG, agents, compliance-sensitive responses, and write actions last.
- Default route switch: only after rollback and fallback have been exercised.
Keep the old provider available until you have enough production evidence that the new provider is stable. Removing rollback too early turns a migration into a lock-in event.
The checklist
Planning
- Migration goal is written and measurable.
- Non-negotiable quality, latency, cost, and safety thresholds are defined.
- Security, legal, procurement, and data governance constraints are reviewed before implementation.
- Owner is assigned for prompts, evals, serving, observability, security, and rollout.
Baseline
- Current provider performance is measured by cohort.
- Golden dataset includes real production logs, support escalations, and known failures.
- Cost per successful task is known for the current provider.
- Latency, timeout, retry, and fallback baselines are captured.
Compatibility
- Prompt behavior differences are tested.
- Structured output and parser compatibility are tested.
- Tool schemas, validators, and error handling are tested.
- Context window, truncation, streaming, and tokenization differences are reviewed.
- Rate limits, regional availability, and concurrency limits are documented.
Evaluation
- Offline evals compare old and new provider side by side.
- Results are broken down by cohort, not only aggregate score.
- Human review covers ambiguous, high-risk, and compliance-sensitive cases.
- Regression gates are added before rollout.
Rollout
- Shadow test runs before user-visible traffic moves.
- Canary starts with low-risk cohorts.
- Routing policy can send specific workflows back to the old provider.
- Fallback behavior is tested under failure, timeout, and rate-limit conditions.
- Rollback is documented, owned, and rehearsed.
Monitoring
- Dashboards show quality, cost, latency, timeout, retry, and fallback metrics by provider.
- Support tickets and user feedback are labeled by provider version.
- Alerts trigger on cohort-level regressions, not just global averages.
- Post-migration review compares expected gains with actual production results.
Common migration traps
Trap 1: Optimizing for unit price instead of task economics
A provider can be cheaper per token and still more expensive per successful task if it requires longer prompts, more retries, or more human escalations.
Trap 2: Testing only aggregate quality
Aggregate scores hide the failures that matter. Look at cohorts. The new provider may perform well on simple support questions and poorly on long-context legal or finance queries.
Trap 3: Forgetting tool and schema behavior
Many migrations pass free-text answer tests and fail in production because function arguments, validators, and parsers behave differently.
Trap 4: Moving retrieval and model at the same time
If you change provider, embedding model, chunking, reranking, and prompt at once, you will not know what caused the regression. Change fewer variables where possible.
Trap 5: Removing the old provider too soon
Keep rollback available until the new path has enough production evidence across normal load, peak load, edge cases, and incident conditions.
What good looks like
A strong LLM vendor migration ends with a decision memo, not a vibe. The memo should state:
- why the migration was attempted
- which cohorts were tested
- where the new provider beat, matched, or lost to the current provider
- what prompt, routing, schema, or serving changes were required
- what traffic is allowed to move now
- what traffic should stay on the old provider
- what monitoring and rollback controls are active
That is the standard: not "we switched models," but "we changed providers with measured parity, controlled exposure, and a rollback path that still works."
FAQ
Questions readers usually ask next
What should we test before switching LLM vendors?
Test task success, groundedness, citation quality, tool-call correctness, schema adherence, refusal behavior, latency, timeout rate, retry behavior, cost per successful task, and safety outcomes. Break results down by cohort so one high-volume segment does not hide failures in a regulated or high-value use case.
Can we migrate by changing the model name in our API wrapper?
That is the highest-risk version of a migration. Different providers can behave differently on instruction following, JSON validity, tool calling, context length, streaming, safety refusals, tokenization, latency, and rate limits. Treat the change like a production release with eval gates and rollback.
How much traffic should go to the new model first?
Start with shadow traffic or internal traffic first, then a small canary for low-risk cohorts. Increase exposure only after the new provider meets quality, cost, latency, and safety thresholds for the cohorts receiving traffic.
What is the most common LLM migration failure?
The most common failure is aggregate parity without cohort parity. The new model looks fine overall, but fails on specific intents, languages, document types, tool calls, or high-risk workflows that were underrepresented in the test set.
Migration risk
Most provider swaps fail because teams test answer quality but miss schema adherence, tool calls, latency distribution, and cohort-specific regressions.
Best control
Keep provider routing explicit: old provider, new provider, fallback, and rollback paths should all be observable during the migration.
Need migration evidence before moving traffic?
We help teams compare providers with production-shaped evals, cost and latency baselines, and rollout gates. Start with an AI Production Audit.
Last updated
May 8, 2026




