Distributed Systems Are Broken (And We Are The Fix)
There is a dirty secret in modern software engineering. We have built systems of extraordinary complexity and then staffed them with humans whose job is to watch dashboards and wait for things to break. This is not an engineering problem. It is a design problem.
THE ON-CALL TAX Every engineering team running production infrastructure pays what I call the on-call tax. It shows up in multiple places: the 3AM pages that destroy sleep cycles, the Monday morning post-mortems about incidents that happened over the weekend, the senior engineer who leaves because they are tired of being the human failsafe for a distributed system. The on-call tax is not just a quality of life issue. It is a compounding productivity drain. An engineer who was paged at 3AM is not performing at full capacity the next day. An engineering culture that normalizes being on-call normalizes a certain level of fear and fragility.
WHY ALERTING TOOLS MADE IT WORSE PagerDuty, Opsgenie, and their equivalents were built on a reasonable premise: make sure the right human knows about the problem as fast as possible. That was the right solution for 2012 infrastructure. In 2026, the majority of production incidents are not novel. They are the same categories of failure repeating across different services: connection pool exhaustion, memory leaks, failed deployments, latency spikes from upstream dependencies. These failures have known causes and known fixes. Alerting a human to fix a problem that a script could fix is not a solution. It is a delay.
THE AUTONOMY GAP There is a gap between what monitoring tools do and what actually needs to happen when something breaks. Monitoring tools fill the detect-and-notify part. Everything after that — diagnose, decide, act — is left to humans. HealBot was built to fill that gap.
THE PHILOSOPHICAL SHIFT Building HealBot required a philosophical shift in how we think about incident response. The traditional model assumes humans are necessary for every incident. The autonomous model assumes humans are necessary only for incidents that require judgment that cannot be encoded — security decisions, business logic anomalies, novel architectural failures. Everything else should be handled by the system itself. This is not a radical idea. Aircraft have autopilot. Manufacturing has automated quality control. Finance has algorithmic risk management. Infrastructure is simply catching up.
About the Author
Rithwik is the founder of CartNerve. He spends his days thinking about how to make distributed systems less fragile and his nights wondering why we still wake up humans to restart servers in 2026.