OneWeb has 48-hour leap year outage
Yet again, software engineers participate in the age-old tradition of writing buggy date code.
It’s natural to want your computer software to calculate the difference between dates and times. When does this coupon expire? What time is it in Bangkok if it’s 6:00am in New York City? If my claim will be processed in 4 to 6 business weeks, then what is the earliest and latest it will be processed?
But if you’ve worked in the software engineering field long enough, you know that date and time code is notoriously difficult to get right. Computer Science programs drill into their students’ heads that they should refrain from rolling their own date calculations in favor of heavily-vetted open-source libraries.
What kinds of errors are possible? Some of them are collected in “Falsehoods Programmers Believe About Time”. But here are some real-world examples:
When I was at Google (and maybe even now), their server’s time zones were set to the time zone of their headquarters in Mountain View, California. Depending on who you asked, they called it “Google Standard Time” or “Google Server Time.” But server software usually runs in the UTC time zone. After all, businesses are global these days, so why not just use a universal timezone? So sometimes either a person or a program would assume UTC and some calculation would be 8 hours wrong — give or take Daylight Savings Time. This happened over and over.
Back in 2012, a leap second took down vast swaths of the Internet because Linux, one of the most heavily-scrutinized codebases on the planet, did not handle a leap second correctly. So now instead of performing leap seconds, companies often “smear” the leap second across the entire day. On its surface this seems preposterous. “There have been a bunch of leap seconds already, and we already handle Daylight Savings Time. Surely a leap second is just another leap?” But here we are.
I’m certainly guilty of this. When I was on the Google Docs team, I wrote some code that assumed that two high-performance timer readings in a web browser would always be different. It worked fine until we encountered one user whose high-performance timer always reported the current time as 01, at which point it started crashing. You might say “how could you expect that?” But 2 years before that, I proved to a colleague that High Performance Timers could sometimes “travel back in time2” and they weren’t always guaranteed to advance. So at the very least I should have guarded against this case!
The point is that excellent engineers — and I — screw up handling time. I want us to keep this in mind as we read this week’s news story.
According to The Register, satellite broadband provider Eutelsat OneWeb had a complete outage that lasted 2 days.
The satellite broadband service fell over on December 31, 2024, for 48 hours. According to Eutelsat, "the root cause was identified as a software issue within the ground segment." Issues began just after 0000 UTC, and it took until January 1 to get 80 percent of the network operational. By the morning of January 2, everything was working again.
A spokesperson told The Register: "We can confirm that the issue was caused by a leap year problem, related to day 366 in 2024, which impacted the manual calculation for the GPS-to-UTC offset."
According to aviationweek.com, the issue was caused by their ground segment, which is maintained by a vendor.
Eutelsat’s OneWeb constellation went down for 48 hr. after its ground segment, maintained by Hughes Network Systems, was not programed so that 2024 was a 366-day-long leap year.
[…]
“The root cause was identified as a software issue within the ground segment,” Eutelsat said on Jan. 2. “Eutelsat was fully mobilized and worked with the vendor to restore full service, while maintaining a constant dialogue with affected customers. The constellation is operating nominally once again.”
The fact that this affects a GPS to UTC conversion brings up so many mysteries. GPS time was set to UTC time in 1980, and they have only drifted because UTC uses leap seconds to account for changes in Earth’s rate of rotation, but GPS time does not. GPS and UTC differ by 18 seconds right now. This seems like a straightforward coding task? The conversion function would just advance the underlying time representation by the 18 seconds.
Well, it’s a little more complicated than that. GPS time and UTC are represented differently in computers. GPS time is represented as the number of weeks and seconds since Jan 1, 1980, while UTC is represented as… well, there’s no universal way to represent UTC. One common way is to represent it as the number of seconds/milliseconds/microseconds since the Unix time epoch of Jan 1, 1970. But it doesn’t have to be done this way! You could also just have a structure that has the year, month, day, hours, minutes, seconds, and fractional seconds. Or any number of representations!
So the Hughes Network Systems coders were handling a slightly different problem, “given the number of weeks and seconds since 1980, and the number of leap seconds that have occurred, produce the current time in UTC.” And it looks like the Hughes/OneWeb vender partnership began at the end of 2020, so this is the first leap year encountered by this partnership.
Given all of that, it seems like it’s relatively easy to guess a story of what happened. Some enterprising programmer wrote a conversion function mapping GPS time’s week and second format into some structure that included the year. They glossed over all of the leap seconds and leap years by just finding the current offset when they wrote it, and based all of their calculations on that offset. As soon as the 366th day of 2024 rolls around, their ground system happily converts the GPS time into January 1, 2025, and passes that along to another system. That other system looks at it and says something like “I don’t know what 2025 is” or “this request happened in the future, I’m ignoring it.” And then the whole system filled up with machines saying “please” and “no thank you” until humans had to come along and patch them to stop the outage.
Oh well. It happens to the best of us.
We brainstormed for a while about this. The simple and obvious explanation was that window.performance
’s high-performance timer was simply overridden by a function that always returned zero, but we wondered if there were other possible scenarios like Chrome running inside some Virtual Machine or headless environment that violated the timer’s assumptions.
I don’t know if this is still true, but around 2009 High Performance Timers in Wintel machines relied on data they got from processor cores, and the timers weren’t always perfectly synchronized for reasons I didn’t fully understand. So you could get 2 readings on 2 cores that occurred sequentially, but the second time is earlier than the first.