Rate Limiting — The Day We Throttled Our Own App
This blog tells the story of a SaaS company that introduced rate limiting to stop bot abuse on its public APIs only to accidentally throttle its own internal microservices. What began as a simple protection mechanism using a sliding-window algorithm soon spiraled into a self-inflicted denial-of-service when internal service calls were routed through the same rate-limited gateway, triggering cascading retries and system-wide failures. The narrative highlights how defensive systems like rate limiting must be context-aware and tested against internal traffic not just external threats and emphasizes that poorly tuned safeguards can end up harming the platform they’re meant to protect.
Yash Sharma
Welcome to my blog! I write about technology, development, and more.
Traffic was booming. Dashboards looked healthy. Growth graphs pointed aggressively upward. Inside the SaaS company, everyone celebrated the sign of scaling success until that success started inviting unwanted attention. Bots began hammering open APIs, spam sign-ups flooded databases, and fake traffic muddied product analytics.
Engineering assembled a war-room and emerged with a clear mandate: deploy rate limiting across all public APIs.
Act I — The Perfect Shield
Implementation was swift and elegant, a sliding-window rate limiter attached to each endpoint. Every request was counted per IP and user ID. Any client exceeding the limit would be quietly throttled for a few seconds.
“Real users don’t hit our endpoints that hard. Everything beyond that must be abuse.”
Initially, it looked like victory:
- Spam traffic dropped.
- Fake accounts vanished.
- Support and SRE sleep schedules returned to normal.
Rate limiting faded into the background, quiet, reliable, forgotten.
Act II — The Meltdown Nobody Saw Coming
Weeks passed. Suddenly, billing requests started timing out. Minutes later, authentication failed. Notifications went dark. The dashboard turned into a graveyard of 503 errors.
When engineers jumped in to debug… they too were getting blocked.
Logs revealed thousands of entries marked:
RATE_LIMIT_EXCEEDED
Panic escalated. Was this a new attack? A DDoS? A cloud outage?
Traffic patterns, however, were perfectly normal.
Then came the horrifying truth:
the app was rate-limiting itself.
Act III — Self-Inflicted Chaos
During a refactor, internal microservices had started calling each other through the public API gateway, the exact same gateway protected by rate limiting.
Internal processes like:
- session validation,
- notification triggers,
- metadata lookups,
- even monitoring bots
…began hitting API endpoints at machine speed, blowing past limits designed for humans. Requests were throttled. Those throttled requests retried automatically. Retries triggered more throttles. Throttles triggered more retries.
The system spiralled into a full-blown, self-created denial-of-service event.
Meanwhile, the real attackers simply adapted using rotating IPs and distributed scripts to remain under the threshold.
By the time the truth was uncovered, half the platform lay unresponsive behind the very walls built to protect it.
Act IV — The Aftermath and the Rebuild
In a twelve-hour incident marathon, teams:
- Whitelisted internal traffic.
- Introduced context-aware bypass tokens.
- Moved inter-service calls off the public gateway.
- Added distributed tracing to detect retry loops.
- Introduced adaptive, dynamic rate limits rather than hardcoded numbers.
What emerged wasn’t just a rate limiter, it was a traffic intelligence system: aware of who was calling, why, and from where.
The Lesson the Wall Taught Us
Build defenses for attackers but test them against yourself first.
The quickest way to take down your system is to point your protection inwards.
Today when any new safeguard is proposed in the company, someone always asks:
“Are we sure this won’t throttle ourselves again?”
Rate limiting stayed. But it stopped being a blunt instrument.
It became a precision-tuned safety net not just against outsiders, but in harmony with the system it protects.
More articles you might like
Refunds — The Silent Killer of Subscription Engineering
This blog uncovers why refunds, often treated as a minor support feature in subscription products, are actually one of the most complex engineering challenges at scale. It walks through a real-world scenario where a fast-growing digital startup stumbles into chaos due to underestimated refund mechanics — from financial ledger mismatches, multi-system rollback issues, coupon and affiliate payout reversals, abuse loops, cross-financial-year tax complications, to analytics corruption and unexpected international chargebacks.
The Silent Migration: How Salesforce Moved 760+ Kafka Nodes Without a Single Drop
This blog recounts Salesforce’s massive engineering feat of migrating 760+ Kafka nodes handling 1 million+ messages per second, all with zero downtime and no data loss. Told in a story-like war-room style, it highlights the challenges of moving from CentOS to RHEL and consolidating onto Salesforce’s Ajna Kafka platform. The narrative walks through how the team orchestrated the migration with mixed-mode clusters, strict validations, checksum-based integrity checks, and live dashboards. In the end, it showcases how a seemingly impossible migration was achieved smoothly proving that large-scale infrastructure upgrades are less about brute force and more about meticulous planning, safety nets, and engineering discipline.
Forgot Password? The Hidden Identity Nightmare
What starts as a basic two-step flow, user requests a reset, clicks a link, sets a new password, quickly spirals into complex challenges like brute-forceable OTPs, token misuse on shared devices, old reset links that never expire, and lack of GDPR-grade logging.