The core idea
Peak season pain is rarely "more traffic." It's usually constraints that were already present — revealed by growth.
This is an anonymized case study from a long-running e-commerce business. On normal days the site was "mostly fine" — but peak sales were a nightmare: lag, timeouts, and revenue loss. The team kept upgrading servers, but the pain returned every season.
Anonymized but real
Names, exact volumes, and identifying details are removed. The process and metrics are preserved to show what actually changed — and how we validated it.
Executive summary
The client operated an e-commerce platform for more than 8 years. Over time, feature growth and data accumulation made the system increasingly fragile. During seasonal promotions, the website slowed dramatically and checkout often timed out.
We conducted a baseline audit, identified the real constraints (legacy runtime + query inefficiency + infrastructure quality), and delivered a modernization plan: migrate from PHP to FastAPI, migrate from MySQL to PostgreSQL with full historical retention, and rebuild deployment + observability for reliable operations.
The result: peak season traffic became stable and predictable. We validated improvements with before/after distributions (P50/P95/P99), segmentation, and constraint signals.
The situation
On normal days, the platform would occasionally "stutter" — slow page loads, delayed interactions, and inconsistent checkout speed. But during peak events (flash sales, seasonal promotions), the site often became nearly unusable.
- Severe lag and stalls
- Checkout timeouts and 5xx spikes
- Support tickets and customer frustration
- Operations teams forced into war-room mode
The business impact was significant: lost revenue during the highest demand windows, rising infrastructure costs, degraded SEO due to slow speed, and growing internal stress. The team had attempted scaling (more servers, more CPU/RAM), but the core behavior didn't change.
Baseline (before)
Before changing anything, we established a comparable baseline using production traffic. We focused on the flows that mattered most: product browse, cart, and checkout.
Baseline snapshot (peak sale conditions)
| Metric | Before | Impact |
|---|---|---|
| P95 page load | 8–15s | Users abandon during peaks |
| P99 page load | 20–45s | Trust breaks (feels "stuck") |
| 5xx error rate | 3–9% | Support load + revenue loss |
| Checkout timeouts | Frequent | Direct lost orders |
| DB pool wait | Maxed continuously | Waiting-bound collapse |
Note: Values are ranges across segments (regions, device cohorts, cache state) and reflect peak conditions — not best-case testing.
What we found (root causes)
The key insight: this wasn't a "just add servers" problem. The system was constrained by execution and data behavior. We identified three constraint groups.
1) Legacy runtime constraints
PHP was running on a very old version with limited concurrency characteristics under heavy load. Too much synchronous work existed on the critical request path.
2) Query inefficiency vs accumulated data
Over years, data volume grew significantly while query patterns stayed optimized for earlier scale. We found expensive joins, N+1 patterns, and queries that scanned large tables during peaks. Under concurrency, contention and pool saturation amplified tail latency.
3) Infrastructure quality mismatch
Servers had high CPU/RAM specs but inconsistent performance: unstable I/O and network jitter under sustained load. Combined with weak deployment discipline and limited observability, reliability suffered during spikes.
Big specs ≠ stable performance. When the constraint is waiting/contending, more CPU doesn't fix the behavior — it just buys time.
The plan
The business had hard constraints: keep all historical data, avoid long downtime, and survive peak sale spikes. We delivered a plan in two tracks:
- Modernize execution: migrate PHP → FastAPI to improve throughput, concurrency, and maintainability.
- Modernize data: migrate MySQL → PostgreSQL with full history retention and improved query performance.
- Make it operable: professional deployment pipeline, dashboards, flow-based alerts, and reliable infrastructure.
Migration approach (keep 100% historical data)
The migration was planned to preserve all history while minimizing risk:
- Initial snapshot conversion
- Schema transformation pipeline
- Incremental sync in batches
- Short, controlled cutover window
After cutover, we validated correctness (orders, customers, inventory) and performance improvements using comparable baseline windows.
Results (after)
We validated results with before/after distributions and constraint-aligned signals. Improvements held across segmentation (regions, device cohorts, cache state).
Peak sale performance (validated)
| Metric | Before | After | Change |
|---|---|---|---|
| P95 page load | 8–15s | 1.2–2.1s | ~6–8× faster |
| P99 page load | 20–45s | 2.8–4.5s | ~8–10× faster |
| 5xx error rate | 3–9% | <0.3% | massive reduction |
| Checkout timeouts | Frequent | Near zero | eliminated |
| DB pool wait | Maxed | Healthy stable range | normalized |
Improvements were verified under real peak traffic conditions — not just in staging or synthetic benchmarks.
Business impact
Once peak behavior became predictable, the organization saw immediate benefits:
- Sales events stopped being war-room operations
- Conversion rate improved due to smoother user experience
- Order volume increased significantly during peaks compared to prior seasons
- SEO improved due to faster page speed and stronger Core Web Vitals
- Operational stress dropped — the team could ship changes with confidence
Why this worked
The key was not a single optimization. It was eliminating the constraints that only appear under growth pressure: execution inefficiency, query behavior at scale, and infrastructure quality.
- We measured distributions (P50/P95/P99), not averages
- We segmented results so improvements weren't hiding regressions
- We validated constraints (pool wait, queueing, saturation), not just latency
- We made the system operable with better deploys and observability
What we delivered
- Baseline dashboards + flow-based alerts
- Trace-driven bottleneck report and execution plan
- Migration pipeline and cutover plan with full history retention
- New architecture + professional deployment workflow
- Before/after evidence package for leadership and engineering
- Post-migration performance roadmap and guardrails
Next steps
If peak traffic is painful — and performance work feels like guessing — the fastest way forward is a baseline audit. It turns "it feels slower" into measurable constraints and a plan you can execute confidently.
Want a bottleneck map for your system?
We run a 7-day baseline + constraint audit: production distributions, bottleneck isolation, and a prioritized plan with before/after validation.
What made this hard
The client wanted to keep 100% of historical data — no "reset," no shortcuts — while eliminating peak season instability.
What made this work
We validated improvements under real peak traffic with distributions (P50/P95/P99) and constraint signals — not opinions.
Want results you can prove?
If peak traffic hurts and performance work feels like guessing, start with a 7-day baseline audit. We map constraints and validate improvements with before/after evidence. See more about our AI audit.
Last updated
December 31, 2025
