The Day the Monolith Broke — and Kafka Saved the City
It was a crisp Monday morning at TechCity Corp. Inside the glass-walled war room, the backend team stared at dashboards glowing red. The monolithic application the one that had faithfully handled all orders, payments, and notifications for years, had just… frozen.
Yash Sharma
Welcome to my blog! I write about technology, development, and more.
It was a crisp Monday morning at TechCity Corp. Inside the glass-walled war room, the backend team stared at dashboards glowing red. The monolithic application — the one that had faithfully handled all orders, payments, and notifications for years — had just… frozen.
Orders were stuck. Payments weren’t going through. Notifications were delayed. Every department was shouting:
Marketing: “Our campaign just launched, but no one’s getting order confirmations!”
Customer Support: “We have 1,000 complaints in the last hour!”
DevOps: “CPU is fine, but… the database is choking.”
The Problem The system was built so that every service talked directly to every other service — payment service called inventory, inventory called shipping, shipping called notifications. It was tight coupling at its worst. If one part slowed down, everything slowed down.
The Hero Enters — Apache Kafka Amid the chaos, Priya, a senior software engineer, walked in. She had been quietly lobbying for event-driven architecture for months.
She opened her laptop, projected a diagram, and said:
“We need a central nervous system for our data. Something that lets services talk without waiting for each other to finish. We need Kafka.”
What is Kafka? (In Priya’s Words) Priya explained:
Apache Kafka is like a high-speed, fault-tolerant postal service for data.
Instead of services calling each other directly, they publish events (“Order Placed”) to Kafka topics.
Other services can subscribe and react whenever they want.
It stores these events so even if a service is down, it can catch up later.
She broke it down:
Producer → sends the message (like the order service saying, “Order 123 placed”).
Topic → the mailbox where the message is kept.
Consumer → any service that wants that message (like shipping or notifications).
Why Industry Uses Kafka Priya didn’t stop there:
Decoupling services — Systems don’t need to know each other’s details; they just know the topic name.
Handling massive data streams — Millions of messages per second, in real time.
Resilience — If a consumer goes down, messages wait for it in Kafka.
Scalability — Just add more consumers or partitions to handle more load.
Event replay — You can “rewind” events to rebuild system state or debug issues.
The Resolution The team agreed. They didn’t replace the whole system overnight, but they started with one pain point — order notifications. Orders now published an “Order Placed” event to Kafka. Notifications subscribed to it and sent emails instantly, without waiting for payments or shipping.
Within weeks, they expanded Kafka to handle payments, inventory updates, and analytics pipelines. The old monolith started feeling lighter, and customers noticed faster responses.
Epilogue Months later, TechCity Corp’s system was event-driven, resilient, and scalable. Priya’s diagram of Kafka’s topics and consumers now hung proudly in the war room — a constant reminder of the day the monolith broke, and Kafka saved the city.
More articles you might like
Rate Limiting — The Day We Throttled Our Own App
This blog tells the story of a SaaS company that introduced rate limiting to stop bot abuse on its public APIs only to accidentally throttle its own internal microservices. What began as a simple protection mechanism using a sliding-window algorithm soon spiraled into a self-inflicted denial-of-service when internal service calls were routed through the same rate-limited gateway, triggering cascading retries and system-wide failures. The narrative highlights how defensive systems like rate limiting must be context-aware and tested against internal traffic not just external threats and emphasizes that poorly tuned safeguards can end up harming the platform they’re meant to protect.
Refunds — The Silent Killer of Subscription Engineering
This blog uncovers why refunds, often treated as a minor support feature in subscription products, are actually one of the most complex engineering challenges at scale. It walks through a real-world scenario where a fast-growing digital startup stumbles into chaos due to underestimated refund mechanics — from financial ledger mismatches, multi-system rollback issues, coupon and affiliate payout reversals, abuse loops, cross-financial-year tax complications, to analytics corruption and unexpected international chargebacks.
The Silent Migration: How Salesforce Moved 760+ Kafka Nodes Without a Single Drop
This blog recounts Salesforce’s massive engineering feat of migrating 760+ Kafka nodes handling 1 million+ messages per second, all with zero downtime and no data loss. Told in a story-like war-room style, it highlights the challenges of moving from CentOS to RHEL and consolidating onto Salesforce’s Ajna Kafka platform. The narrative walks through how the team orchestrated the migration with mixed-mode clusters, strict validations, checksum-based integrity checks, and live dashboards. In the end, it showcases how a seemingly impossible migration was achieved smoothly proving that large-scale infrastructure upgrades are less about brute force and more about meticulous planning, safety nets, and engineering discipline.