LLM Vendor Migration Checklist: Switching Models Without Breaking Production

On this page

Share this article

The core idea

Treat an LLM vendor migration like a production release. Prove parity by cohort, shadow test under real traffic patterns, and keep rollback active until the new provider is stable.

Switching LLM vendors is not a model swap. It is a production migration across prompts, evals, routing, latency, cost, safety, and rollback.

Teams usually start an LLM vendor migration for a reasonable reason: lower cost, better latency, stronger reasoning, better enterprise terms, regional availability, provider redundancy, or less lock-in. The mistake is treating the migration as a config update.

A new provider changes more than output style. It can change tokenization, instruction following, tool-call behavior, JSON adherence, refusal patterns, streaming behavior, timeout rate, rate limits, and cost shape. If you use RAG, agents, regulated workflows, or customer-facing automation, those differences can become production regressions quickly.

Context

Part of the LLM Audit hub. Related: LLM Evaluation Framework, Golden Dataset from Real Logs, Model Routing for Cost Control, AI Observability, and LLM Regression Testing.

Why LLM vendor migrations break production

Production LLM behavior is the product of a full system: model, prompt, retrieval, context assembly, tools, validators, safety policies, serving limits, retries, and monitoring. A provider swap touches many of those layers at once.

The common failure modes are predictable:

Prompt drift: instructions that worked with one provider become weaker, too strict, or overinterpreted by another.
Structured output drift: JSON, XML, citations, function arguments, or schema fields fail at a different rate.
Tool-call drift: the model chooses tools differently, calls too early, skips required tools, or generates subtly wrong arguments.
RAG grounding drift: the new model uses retrieved evidence differently, overweights weak context, or refuses when answerable context exists.
Safety drift: refusals, policy boundaries, and sensitive-topic behavior change in ways support and compliance did not approve.
Latency drift: time-to-first-token, streaming cadence, retries, and p95 latency move even when average latency looks acceptable.
Cost drift: lower unit price does not help if the new provider needs more tokens, more retries, or more fallback calls.

The migration rule: prove parity before traffic moves

The migration should not ask "is the new model better?" in the abstract. It should ask: Does the new provider meet or beat the current provider on the workflows we actually serve, under the constraints we actually operate?

That means parity gates before rollout:

quality parity by use case, intent, language, tenant, document type, and risk level
cost per successful task, not only cost per request
p50, p95, p99, timeout, and retry rates
tool-call validity and workflow completion rate
safety, compliance, and refusal correctness
monitoring coverage and rollback path

Migration principle

A cheaper or stronger model is not production-ready until it clears the same release gates as any other user-visible change.

Phase 1: Define the migration target

Start by writing down why the migration exists. If the reason is vague, the rollout criteria will be vague too.

Choose the primary target:

reduce cost per successful task by a specific percentage
reduce p95 latency or timeout rate for specific workflows
improve answer quality, groundedness, or tool completion rate
add provider redundancy for availability or procurement risk
meet data residency, compliance, security, or enterprise contract requirements

Then define non-negotiables. For example: no drop in groundedness for legal answers, no increase in PHI/PII exposure risk, no more than 5 percent p95 latency regression, no degradation in tool-call success, no unsupported region for regulated traffic.

Phase 2: Build the baseline before touching the provider

You cannot prove a migration worked if you do not know how the current system performs. Build the baseline from production-shaped examples, not handpicked demos.

Minimum baseline:

current model and provider version
prompt, policy, retriever, reranker, tool, and guardrail versions
task success rate by cohort
groundedness and citation validity for RAG workflows
schema validity and tool-call success for agents
cost per request and cost per successful task
p50, p95, p99, timeout rate, retry rate, and fallback rate
top failure categories with concrete examples

The baseline should include success cases and failure cases. A migration that only tests happy paths will miss the cases that actually drive support tickets.

Phase 3: Map provider compatibility risks

Before running evals, map how the new provider differs from the current provider. This makes debugging much faster when parity fails.

Risk area	What to check	Failure signal
Prompt semantics	system messages, hierarchy, examples, refusal instructions	same prompt produces different policy or formatting behavior
Structured output	JSON mode, schema constraints, citation format, streaming chunks	parser errors, missing fields, invalid citations
Tool calling	tool choice, argument generation, validation, retries	wrong tool, wrong arguments, looped calls
Context limits	tokenization, max context, truncation behavior, prompt caching	lost instructions, higher cost, answer quality drop on long docs
Serving limits	rate limits, concurrency, timeout defaults, regional availability	p95 spikes, retries, queueing, incident during peak traffic
Governance	data retention, logging, training use, audit trails, access control	security approval blocked after engineering work is complete

Phase 4: Run offline evals by cohort

Offline evals are where you find migration regressions before users do. The important part is cohort coverage.

Split the eval set into meaningful slices:

simple questions vs multi-step workflows
RAG answers with clear evidence vs ambiguous evidence
high-value customers or regulated tenants
languages and locales
short context vs long context
tool-required vs no-tool-required requests
known failure modes from support tickets or incident reviews

Score both providers side by side. If the new provider wins overall but loses on one critical cohort, the migration is not ready. You may still use the new provider for low-risk routes, but do not claim full replacement parity.

Phase 5: Shadow test under production-shaped load

Offline evals do not show every serving issue. Shadow testing lets the new provider process real traffic patterns without controlling the user-visible answer.

During shadow testing, compare:

answer score against the current provider
tool-call decision and argument differences
retrieved-context usage and citation choice
latency distribution, especially p95 and p99
timeout, retry, and rate-limit behavior
cost per successful shadowed task
safety refusals and policy boundary differences

Keep shadow logs privacy-safe. You need enough trace detail to diagnose differences, but not more sensitive data than your retention and access policies allow.

Phase 6: Roll out with routing, fallbacks, and rollback

The safest migration is rarely "all traffic moves on Friday." Use route-level control.

Internal traffic: employees, test tenants, and non-customer-facing workflows.
Low-risk canary: simple intents, no regulated content, no write actions.
Cohort expansion: add traffic by use case only after metrics hold.
High-risk workflows: RAG, agents, compliance-sensitive responses, and write actions last.
Default route switch: only after rollback and fallback have been exercised.

Keep the old provider available until you have enough production evidence that the new provider is stable. Removing rollback too early turns a migration into a lock-in event.

The checklist

Planning

Migration goal is written and measurable.
Non-negotiable quality, latency, cost, and safety thresholds are defined.
Security, legal, procurement, and data governance constraints are reviewed before implementation.
Owner is assigned for prompts, evals, serving, observability, security, and rollout.

Baseline

Current provider performance is measured by cohort.
Golden dataset includes real production logs, support escalations, and known failures.
Cost per successful task is known for the current provider.
Latency, timeout, retry, and fallback baselines are captured.

Compatibility

Prompt behavior differences are tested.
Structured output and parser compatibility are tested.
Tool schemas, validators, and error handling are tested.
Context window, truncation, streaming, and tokenization differences are reviewed.
Rate limits, regional availability, and concurrency limits are documented.

Evaluation

Offline evals compare old and new provider side by side.
Results are broken down by cohort, not only aggregate score.
Human review covers ambiguous, high-risk, and compliance-sensitive cases.
Regression gates are added before rollout.

Rollout

Shadow test runs before user-visible traffic moves.
Canary starts with low-risk cohorts.
Routing policy can send specific workflows back to the old provider.
Fallback behavior is tested under failure, timeout, and rate-limit conditions.
Rollback is documented, owned, and rehearsed.

Monitoring

Dashboards show quality, cost, latency, timeout, retry, and fallback metrics by provider.
Support tickets and user feedback are labeled by provider version.
Alerts trigger on cohort-level regressions, not just global averages.
Post-migration review compares expected gains with actual production results.

Common migration traps

Trap 1: Optimizing for unit price instead of task economics

A provider can be cheaper per token and still more expensive per successful task if it requires longer prompts, more retries, or more human escalations.

Trap 2: Testing only aggregate quality

Aggregate scores hide the failures that matter. Look at cohorts. The new provider may perform well on simple support questions and poorly on long-context legal or finance queries.

Trap 3: Forgetting tool and schema behavior

Many migrations pass free-text answer tests and fail in production because function arguments, validators, and parsers behave differently.

Trap 4: Moving retrieval and model at the same time

If you change provider, embedding model, chunking, reranking, and prompt at once, you will not know what caused the regression. Change fewer variables where possible.

Trap 5: Removing the old provider too soon

Keep rollback available until the new path has enough production evidence across normal load, peak load, edge cases, and incident conditions.

What good looks like

A strong LLM vendor migration ends with a decision memo, not a vibe. The memo should state:

why the migration was attempted
which cohorts were tested
where the new provider beat, matched, or lost to the current provider
what prompt, routing, schema, or serving changes were required
what traffic is allowed to move now
what traffic should stay on the old provider
what monitoring and rollback controls are active

That is the standard: not "we switched models," but "we changed providers with measured parity, controlled exposure, and a rollback path that still works."

FAQ

Questions readers usually ask next

What should we test before switching LLM vendors?

Test task success, groundedness, citation quality, tool-call correctness, schema adherence, refusal behavior, latency, timeout rate, retry behavior, cost per successful task, and safety outcomes. Break results down by cohort so one high-volume segment does not hide failures in a regulated or high-value use case.

Can we migrate by changing the model name in our API wrapper?

That is the highest-risk version of a migration. Different providers can behave differently on instruction following, JSON validity, tool calling, context length, streaming, safety refusals, tokenization, latency, and rate limits. Treat the change like a production release with eval gates and rollback.

How much traffic should go to the new model first?

Start with shadow traffic or internal traffic first, then a small canary for low-risk cohorts. Increase exposure only after the new provider meets quality, cost, latency, and safety thresholds for the cohorts receiving traffic.

What is the most common LLM migration failure?

The most common failure is aggregate parity without cohort parity. The new model looks fine overall, but fails on specific intents, languages, document types, tool calls, or high-risk workflows that were underrepresented in the test set.

Migration risk

Most provider swaps fail because teams test answer quality but miss schema adherence, tool calls, latency distribution, and cohort-specific regressions.

Best control

Keep provider routing explicit: old provider, new provider, fallback, and rollback paths should all be observable during the migration.

Need migration evidence before moving traffic?

We help teams compare providers with production-shaped evals, cost and latency baselines, and rollout gates. Start with an AI Production Audit.

Last updated

May 8, 2026

Posts you might be interested in

baselinescorecards

What an AI Production Audit Actually Delivers: Sample Findings, Scorecards, and a 30/60/90 Roadmap

A real AI Production Audit should not end with vague recommendations. It should leave your team with sample findings, a usable scorecard, and a 30/60/90 roadmap clear enough for product, engineering, and finance to act on.

Apr 2, 2026•1 min read

cost-spikemetrics-kpi

Why LLM Features Fail ROI Reviews: A Unit Economics Playbook for CTOs

Many LLM features fail ROI reviews because teams show request volume and token spend instead of outcome economics. This playbook gives CTOs a practical way to frame cost per successful task, avoided cost, human rescue burden, and scale decisions before leadership kills the feature.

Mar 17, 2026•1 min read

cost-spikerouting

Model Routing for Cost Control: When to Use Small, Large, or Fallback Models

Model routing is one of the fastest ways to cut LLM cost, but only when it is treated as a measured policy instead of a blanket downgrade. This guide explains when to use small models, when large models should stay in the path, and how fallback rules keep quality intact.

Mar 12, 2026•1 min read

AI Production Audit

Baseline quality + cost per successful task. Diagnose root causes. Prioritized roadmap.

Optimization Sprint (4–6 weeks)

Ship PRs to fix wrong answers and cost drivers. Verify before/after benchmarks.

Reliability Retainer — regression gates + monitoring

Ongoing AI governance to prevent cost/quality drift after you ship changes.

Proof (Case Studies)

Measurable before/after outcomes.

Decision (Pricing)

Audit → Sprint → Retainer.

LLM Vendor Migration Checklist: Switching Models Without Breaking Production

Why LLM vendor migrations break production

The migration rule: prove parity before traffic moves

Phase 1: Define the migration target

Phase 2: Build the baseline before touching the provider

Phase 3: Map provider compatibility risks

Phase 4: Run offline evals by cohort

Phase 5: Shadow test under production-shaped load

Phase 6: Roll out with routing, fallbacks, and rollback

The checklist

Common migration traps

What good looks like

Questions readers usually ask next

What should we test before switching LLM vendors?

Can we migrate by changing the model name in our API wrapper?

How much traffic should go to the new model first?

What is the most common LLM migration failure?

Related Posts

What an AI Production Audit Actually Delivers: Sample Findings, Scorecards, and a 30/60/90 Roadmap

Why LLM Features Fail ROI Reviews: A Unit Economics Playbook for CTOs

Model Routing for Cost Control: When to Use Small, Large, or Fallback Models

Recent Posts

AI Incident Postmortem Template for LLM and RAG Teams

AI Production Audit Pricing: What You Get at $3.8k, $9.8k, and an Optimization Sprint

What an AI Production Audit Actually Delivers: Sample Findings, Scorecards, and a 30/60/90 Roadmap

Enforce the Audit → Sprint → Retainer ladder