There’s a lot missing from the Google Cloud outage incident report
Google’s incident report points to a null pointer bug, but many possible contributing factors like executive pressure remain unmentioned.
Google Cloud Platform had a 3-hour outage in the past week.
Google’s Incident Report has a clear explanation of how the incident happened.
On May 29, 2025, a new feature was added to Service Contfrol for additional quota policy checks. This code change and binary release went through our region by region rollout, but the code path that failed was never exercised during this rollout due to needing a policy change that would trigger the code. As a safety precaution, this code change came with a red-button to turn off that particular policy serving path. The issue with this change was that it did not have appropriate error handling nor was it feature flag protected. Without the appropriate error handling, the null pointer caused the binary to crash. Feature flags are used to gradually enable the feature region by region per project, starting with internal projects, to enable us to catch issues. If this had been flag protected, the issue would have been caught in staging.
On June 12, 2025 at ~10:45am PDT, a policy change was inserted into the regional Spanner tables that Service Control uses for policies. Given the global nature of quota management, this metadata was replicated globally within seconds. This policy data contained unintended blank fields. Service Control, then regionally exercised quota checks on policies in each regional datastore. This pulled in blank fields for this respective policy change and exercised the code path that hit the null pointer causing the binaries to go into a crash loop. This occurred globally given each regional deployment.
[…]
Within some of our larger regions, such as us-central-1, as Service Control tasks restarted, it created a herd effect on the underlying infrastructure it depends on (i.e. that Spanner table), overloading the infrastructure. Service Control did not have the appropriate randomized exponential backoff implemented to avoid this.
There has been a lot of discussion online, outlining different ways that the Google engineers must be the worst engineers that have ever walked the earth. I mean, there’s even someone to blame, right? The error handling is bad! There’s no feature flag, and the post promises that a feature flag cannot be misused and fixes problems like this! They didn’t implement randomized exponential backoff! Where are the tests?1
This is a classic blunder. In reality, you lack enough context to make a determination solely by reading the incident log. I’ve talked before about how it’s important to step back and imagine the circumstances surrounding major outages.
What do I mean? Let’s ask ourselves some questions:
Despite huge investment, Google Cloud is far behind Microsoft Azure and Amazon Web Services in marketshare. How much pressure do you think that executives feel to turn that around relative to prioritizing code quality complaints?
The incident report identifies this change as part of development of a new feature. Are we 100% sure that this was business as usual, or is it possible that this team, their group, or their product area have been under increased deadline pressure from executives?
The developers chose to enable this feature via a configuration method that replicates within seconds, instead of going through full rig-by-rig rollouts with feature flags. Were there reasons they might have preferred the faster approach?
A few days before the incident, Google solicited voluntary buyouts in other organizations within Google. The last time they did this, it preceded layoffs. Can you imagine individuals or teams feeling extra pressure from this, even in a different org? Can you imagine that pressure increasing when hearing people describe the “tough job market” in tech? Do they have reason to believe that their performance rating would be impacted if they didn’t launch these features?
The post talks about feature flags as a silver-bullet way to avoid an outage. When I was on Google Docs from 2010 to 2015, we shipped several bugs with bad feature flag eligibility logic. Have the feature flags become idiot proof?
The service did not have proper error handling in this code path. How rare is it for code paths in Service Control to lack proper error handling in this service? Is it normalized to expect that errors will be seen and reverted in minutes with minimal splash damage?
Was it reasonable for the reviewer and the code author to have any context on whether there was top-level error handling in this part of the service?
How many techniques are as important as “randomized exponential backoff” in keeping the entire system working? How common is it for these techniques to have uneven implementation coverage across the entire suite of services? As new techniques are identified as critical, are old systems upgraded to use them?
How easy or hard was it to understand that the service did not implement exponential randomized backoff? Was it controlled by some BCL spaghetti that nobody knew how to read?
Sure, maybe it’s the case that the engineers that made the change are out of line. Maybe the executives were chill, there was no deadline pressure, they recklessly rejected feature flags even though they knew the risk, they are definitely not worried about layoffs and the job market more broadly, feature flags are impossible to misuse, the reviewer and code author went to extraordinary lengths to add a code path that did not have error handling, and specific people have been negligant and failing to implement a single specific technique despite everyone knowing they weren’t doing it.
Personally, it seems unlikely that every single one of these circumstances are true. I find it likely that there was deadline pressure2. I find it likely that they had a lot of competing priorities, and it was difficult to prioritize system modernization/hardening versus new product work. I find it likely that engineering decisions are starting to change in parts of the company, ever so slightly, even if they won’t admit it, as the spectre of layoffs looms over their head.
Are all of these things true? Are any of these things true? I have no idea. I lack the complete story because Google would never put any of it in the incident report, even if they are all relevant to why the outage happened. If an external agency — some kind of FAA for software outages — investigated the incident and identified all of the factors, they would absolutely include all of the surrounding circumstances that included things like executive pressure and job safety concerns.
But again, I know that I lack the complete story. And accordingly I’m going to give everyone the benefit of the doubt until we learn more, if we ever do.
Many will be unsurprised that one (of many) possible citations is the URL https://news.ycombinator.com/item?id=44275870
Especially given what I hear from Googlers I know.