12 Comments
User's avatar
Ahmed Hassanein's avatar

The lesson is: hair pulling IS an integral part of this career.

One must expect to be super productive 80% of the time then drop to 20% over some insane, unexplainable, embarrassing, energy wasting and blood pressure-increasing bug.

Not giving up during this time means you have what it takes 😁

Expand full comment
Guillaume Duhan's avatar

Love it man. Rings a bell as a former Google Eng. Best!

Expand full comment
Jason Harrison's avatar

There probably isn't a lesson from your side of the problem other than that if you had waited long enough to work on it, it would have disappeared with the next release of V8. I have no idea how long that would be.

It is disappointing that the V8 team at Google was shipping without even a small set of unit tests for Math.abs(). Inputs [-1, 0, 1] would have caught the problem.

Again, it's scary how much of our software is built using unverified and untested code. Attempting to binary search for the source of the problem across the dependencies might have helped, but you had already established that the problem started with a specific Chrome Release. This meant you could have looked at what had changed in the dependencies of that specific release, but my experience is that it is much more likely that the problem is in my code than in another team's code.

Not every single time, but starting by suspecting someone else rarely helps to find the problem.

Expand full comment
Sitong's avatar

I really wonder how many people ran into that Math.abs() issue in that version of V8 and actually discovered it was Math.abs(). Definitely would have needed to resort to human compiling.

Expand full comment
throwaway's avatar

Always nice hearing how the dev side of the house handles troubleshooting in the trenches.

I don't think many people outside roles that touch directly on troubleshooting and debugging with years of experience, would really get the contextual background and horror, of issues like this being non-deterministic.

A problem domain that is almost un-characterizable, unpredictable, and unknowable in detail when coming from a working backwards trajectory, and often un-differentiable from random chaos without implicit complete knowledge of the whole system which Ops often didn't have.

There is this orders of magntitude difference in cost between deterministic issues and non-deterministic issues that just can't be conveyed well. That said you did pretty good in your post conveying that.

Expand full comment
Aditi Medhane's avatar

That was an excellent explanation. I love the way you think and write.

Expand full comment
Tom Jenkinson's avatar

Thanks for writing this. I used to work on a tv app for a UK broadcaster, and we were debugging a weird issue on one device where when the user was seeking after it would sometimes start moving in the wrong direction. We narrowed it down to what is probably the same bug!

Whilst the user is seeking the ui is constantly updated and the code was in a hot path so it makes sense.

Ended up replacing the Math.abs function with a custom one on the effected device to fix it.

Expand full comment
Jacob Voytko's avatar

Amazing, I've never heard about someone else hitting it. If this happened sometime in the range 2011-2014, then it was very likely the same bug.

Expand full comment
Tom Jenkinson's avatar

Yeh I think that probably lines up. TVs tended to be a bit behind on updates. Do you have a link to the commit/issue on v8?

Expand full comment
Jacob Voytko's avatar

Unfortunately no; I tried finding it while writing the publication, but I couldn't find the specific issue. I don't know enough about the V8 project to know what that opcode refactoring project was called or how they might have triaged.

Expand full comment
Daniel Orner's avatar

40 seconds was slow I guess? :D I had to debug a memory leak in a kiosk application in the very early 2000's. It only ran on IE6. It took *over an hour* for us to even be able to notice the memory going up. And this was way before debugging tools were installed in browsers. We had to use third-party hacker-ish DOM inspectors to get anywhere.

It took us *weeks* to solve this one. And to this day we never figured out why it happened. The fix was to take the button which our robot was constantly pressing, and wrap it in a link. The memory leak went away. We have no idea why.

That's pretty much my only "when I was your age" geezer story.

Expand full comment
ewatch's avatar

Never read a text about a bug in such an exciting way. There was always the question in my head: „What will happen next?“.

Nice! thank you!

Expand full comment