CartNerve — Autonomous SRE Platform

We surveyed 50 engineers across startups and scale-ups about their on-call experience. The results were not surprising. They were just honest in a way that rarely gets said out loud.

WHAT WE FOUND The average engineer is on-call one week per month. For teams smaller than 8 people it is more frequent. The median time from page to resolution was 43 minutes. This includes time to wake up, understand the alert, diagnose the issue, and execute a fix. 72% of incidents involved one of five failure categories: connection pool exhaustion, memory pressure, failed deployments, latency spikes from upstream services, and high error rates from traffic spikes. These are not novel failures. They are the same problems repeating. When asked what percentage of their on-call incidents could have been handled automatically, the median answer was 80%. The most common word used to describe on-call: exhausting.

THE HIDDEN COST The financial cost of on-call is visible: on-call stipends, burnout-driven attrition, engineering hours spent on incident response instead of product development. The hidden cost is harder to measure. It shows up in the engineer who stops taking ambitious projects because they cannot afford to be deep in flow work during an on-call week.

WHAT ENGINEERS ACTUALLY WANT Engineers want to be on-call for things that actually require their judgment: security incidents, novel architectural failures, business logic anomalies. They do not want to be on-call for connection pool exhaustion at 3AM. That is what HealBot is for.

The Cost of 3AM Alerts: An SRE Survey

About the Author