The Silent Migration: How Salesforce Moved 760+ Kafka Nodes Without a Single Drop
This blog recounts Salesforce’s massive engineering feat of migrating 760+ Kafka nodes handling 1 million+ messages per second, all with zero downtime and no data loss. Told in a story-like war-room style, it highlights the challenges of moving from CentOS to RHEL and consolidating onto Salesforce’s Ajna Kafka platform. The narrative walks through how the team orchestrated the migration with mixed-mode clusters, strict validations, checksum-based integrity checks, and live dashboards. In the end, it showcases how a seemingly impossible migration was achieved smoothly proving that large-scale infrastructure upgrades are less about brute force and more about meticulous planning, safety nets, and engineering discipline.
Yash Sharma
Welcome to my blog! I write about technology, development, and more.
The Call to Arms
The email came late on a Friday night. CentOS 7 was reaching its end-of-life, and the old Marketing Cloud Kafka clusters had to be migrated. Not just a patch here and there, but hundreds of brokers 760+ nodes carrying more than a million messages per second and 15TB of data daily.
The mandate?
Migrate everything. Standardise on RHEL 9 and Salesforce’s central Ajna Kafka stack.
And do it without a single moment of downtime.
There was no pause button. Campaigns had to keep sending, journeys had to keep flowing, and customers had to remain oblivious to the chaos beneath their feet.
The War Room Awakens
The war room didn’t look like a battlefield, but it felt like one. Monitors flickered with graphs, Slack channels overflowed, and a dozen engineers clutched their coffee like armor.
The challenge was clear: replace the beating heart of Marketing Cloud while it kept pumping at full throttle.
Every node mattered. Every cluster had its own quirks. Every misstep could mean lag, dropped messages, or worse, millions of customer journeys disrupted.
The Phantom of Mixed Mode
The first battle was compatibility.
The old Kafka flavour didn’t perfectly align with Ajna Kafka. Authentication flows differed. Control-plane semantics didn’t match. Running both worlds side by side was like trying to make two dialects of the same language talk fluently without mistranslation.
So the team rehearsed. They created simulations of failures, injected partitions, and forced version mismatches. Over and over again, until the mixed world felt less like a risk and more like a carefully tamed beast.
The Orchestrator’s Dance
Then came the true hero: the orchestrator.
Not just a script runner, but a conductor.
Every migration step was wrapped in choreography:
- Preflight checks: Is the broker healthy? Are replicas balanced?
- In-flight monitoring: Is throughput steady? Is ISR intact?
- Post-step validation: Did checksums align? Did consumers stay caught up?
The orchestrator refused to move unless the system itself gave permission. It was slow, deliberate, and maddeningly cautious — but it meant no surprises.
Guardians of Data Integrity
Data was sacred. Losing a byte was unthinkable.
Before a node migrated, fingerprints of its partitions were recorded. After migration, the same fingerprints were checked again. Synthetic probes flowed through the system like scouts, verifying every message made it through alive.
It wasn’t just engineering, it was ritual. A ceremony of proving that data was not just moved, but honoured.
When Failure Became Boring
The system was built to expect disaster.
Replication factors ensured no node was a single point of failure. Rack-awareness scattered replicas across physical boundaries. Leadership balancing made sure no unlucky broker carried the burden of the world.
So when failures came, a node misbehaved, a rack blinked offline, nothing happened. The system yawned, shifted its weight, and carried on. The design made chaos boring. And boring was the secret weapon.
Eyes on the Night
Every night, the team became watchkeepers.
Dashboards lit the room: consumer lag, broker health, rack distribution. Alerts pinged in whispers, mismatched versions, jitter in throughput, odd rebalancing dances.
Once, a rogue config threatened to shuffle partitions behind their backs. The graphs twitched, the alerts blinked, and the team caught it before anyone outside the room noticed.
The world kept streaming, oblivious.
The Silent Ending
Weeks later, the migration ended not with fireworks but with silence.
No applause. No status-page banners. Just the same metrics as yesterday: steady throughput, flat consumer lag, normal latencies.
Behind the curtain, though, everything had changed. The clusters were now running on RHEL 9, unified under Ajna Kafka. What was once an aging patchwork had become a standard, scalable backbone.
And the customers? They never noticed a thing.
The Lesson of Invisible Victories
The true measure of success was invisibility.
No customer cares when the orchestra keeps playing in tune. No marketer thanks the engineer because their campaign never paused. But that’s the point.
Zero-downtime migrations aren’t about the drama of change; they’re about making change so seamless it feels like nothing happened at all.
And sometimes, nothing is the loudest victory you can achieve.
More articles you might like
Refunds — The Silent Killer of Subscription Engineering
This blog uncovers why refunds, often treated as a minor support feature in subscription products, are actually one of the most complex engineering challenges at scale. It walks through a real-world scenario where a fast-growing digital startup stumbles into chaos due to underestimated refund mechanics — from financial ledger mismatches, multi-system rollback issues, coupon and affiliate payout reversals, abuse loops, cross-financial-year tax complications, to analytics corruption and unexpected international chargebacks.
Rate Limiting — The Day We Throttled Our Own App
This blog tells the story of a SaaS company that introduced rate limiting to stop bot abuse on its public APIs only to accidentally throttle its own internal microservices. What began as a simple protection mechanism using a sliding-window algorithm soon spiraled into a self-inflicted denial-of-service when internal service calls were routed through the same rate-limited gateway, triggering cascading retries and system-wide failures. The narrative highlights how defensive systems like rate limiting must be context-aware and tested against internal traffic not just external threats and emphasizes that poorly tuned safeguards can end up harming the platform they’re meant to protect.
Forgot Password? The Hidden Identity Nightmare
What starts as a basic two-step flow, user requests a reset, clicks a link, sets a new password, quickly spirals into complex challenges like brute-forceable OTPs, token misuse on shared devices, old reset links that never expire, and lack of GDPR-grade logging.