War story: I fixed this bug after 3 months with a shower thought
I talk over a bug from a computer vision prototype from 2009 that required hardware fixes, software fixes, BIOS fixes, Windows Registry fixes, and a tour through distributed queuing.
Back in 2009-2010, I worked on a DARPA contract. The contract was to create an augmented-reality video game training simulator. Soldiers would enter a fake Iraqi or Afghan village, strap on VR goggles, and start a mission with their team. A training scenario would play through the goggles that was synchronized across every member of the team. They could walk through the world normally, and the characters in the scenario were shown on the goggles.
For the curious: the hardware setup was 4 cameras on a helmet, an IMU, VR goggles, and a plastic gun. The gun was placed into the scene using computer vision. Tying all of this together was a top-of-the-line gaming laptop in a backpack.
Here is baby me, wearing the finished research prototype in Camp Pendleton:
As usual, my company overpromised our capabilities so they would win the contract, and that was the engineers’ problem. It was an 18-month-long death march.
We did have some advantages. We had research prototypes where camera pairs could be tracked through 3D space. We had a new GPU-powered stereo vision algorithm that performed well on real-world occlusions like branches and fingers. We had some code that used a “landmark database” to correct any dead-reckoning drift that had accumulated in the system. These were all required by the system.
The advantages got us like 5% of the way. We still needed working hardware. We needed all of the systems code to interop with that hardware. We still needed to identify VR goggles that could work with the system. We still needed to manage a fleet of these machines remotely during the scenario. And my boss demanded that we repurpose a pipes-and-filters library from an old project at the company, so we had to throw away all of our normal executables and rewrite everything within this framework.
About a third of the way into the project, we finally had all of the hardware and software ready for a first test run of the independent backpack system. This was just a small part of the project — it didn’t include the guns, the scenarios, etc — but it was proof of life. The hardware worked okay on the bench. The software worked okay on the bench. Let’s see what happens when we run it together! We turned everything on, strapped a researcher into the system, created a scene where a guy was standing still in front of him, and then started the scenario.
It actually worked! The researcher was slowly shuffling around1 and looking at the virtual bad guy. Then he stopped. “Something’s wrong. The FPS dropped. It’s barely working. This isn’t even showing 1 frame per second.”
We passed around the helmet and looked. The video was really choppy.
We did the most reasonable debugging step: we turned everything off and on again. The same thing happened. It worked fine for a few seconds, and then it got unusably choppy.
I opened the flap of the laptop. Heat whooshed up into my face. I took the laptop out. It was almost too hot to handle. We checked the CPU temperature and it was through the roof. Well, that’s bad news! We needed this to operate in deserts, and it wasn’t even working on a crisp fall day in New Jersey.
My boss wanted to see what happened if we prevented the laptop from doing any thermal throttling. We found a BIOS setting related to throttling. I crawled over MSDN and developer forums looking for any documentation or posts about this and found a Windows Registry setting we could try.
We started the system again. “I think it worked,” the researcher with the helmet said. And then the laptop powered down. At first it wouldn’t turn on, but eventually it sputtered back to life. The laptop got so hot that it tripped a temperature sensor and the machine powered off rather than suffer damage.
Someone said, “Just cut it open, this backpack idea is junk. Just cut as much of the fabric off as you can. Tie the laptop to the straps.”
So some Guy With A Knife materialized and cut the backpack down to the straps and we ziptied it down. Then we started the system again.
The researcher got up and shuffled around. It lasted longer than it had before. We started doing experiments, but after a couple of minutes the video got really choppy again.
I wanted to run an experiment. “If this is heat, shouldn’t it recover when it cools off? Can we just leave it on the desk for a few minutes and see if it gets better?”
We restarted the scenario, waited until it started stuttering, put it on the desk, and checked back 15 minutes later. The framerate never recovered and the computer’s temperature was fine.
One of the systems guys wondered if there was separate power-based throttling. Since the machine wasn’t plugged in, he hypothesized that either Windows or the BIOS was slowly scaling the clock speed below the performance we needed. The deadlines were pretty tight: it took just under 100 milliseconds to run the stereo algorithm, and we were taking 10 pictures per second. We did some more research and discovered (1) yes it was probably doing power throttling, and (2) we couldn’t disable it on this specific laptop. So we did the next best thing: we had the researcher carry around a huge portable battery with the laptop plugged in. The system lasted longer! But after about 10 minutes, it started stuttering again.
The hardware guys started spec’ing their own portable battery that could fit onto the backpack to prevent the power throttling. Meanwhile, my boss pulls me aside and tells me he thinks there’s a system problem and that we need to do much more intensive bench testing. He wanted me to run everything on the bench all day and try to identify the problem. In some ways, he was saying, “Please argue with everyone for an entire day.” There was only 1 working unit and everyone needed it.
And this created a huge buzz around the office! My boss asked me about the system problem whenever he saw me. People kept asking me how the system problem was going. My vice president popped his head in and asked about the systems problem. I was quite argumentative 15 years ago, and even I couldn’t argue with this. There probably was something happening in this new pipes-and-filters code I had written!
On Monday, I put our only working prototype on the bench and turned it on. I started swatting away everyone who tried to take the system for whatever they were doing. After about 10 minutes, it suddenly started exhibiting the problem. I wonder if it’s always about 10 minutes. I turn it on and wait a half hour. No stuttering. I went to lunch and came back. It’s stuttering again.
Okay, how did our pipes-and-filters code work? It was controlled from an XML file2 and specified a series of processing nodes in a graph. Each node specified a DLL and class that should be loaded into the application, and allowed the user to specify three things:
The node’s configuration: name, any node-specific configuration, input queue depth, output queue depth.
All of the data in the node’s output.
A list of other nodes that are the inputs to this node.
And then the framework would load all of the DLLs, connect the graph together, etc. This was a framework that was written for a previous project. My boss had been eyeing it for some time. This was finally his chance to split it into a standalone library. My boss also had some axes to grind. The elders who ran the previous project had strict rules: all pipelines should be linear — do NOT run things in parallel! — only pass along exactly what needs to be output, write very granular filters. My boss had the exact opposite philosophy: the pipeline should be maximally parallelizable, every filter should pass along every packet it got “because you’ll always need to correlate something with something else eventually, and there’s just so much churn constantly editing the configs as you develop.”
How often did data flow through the system? The cameras took pictures at 10FPS, since the laptop could only run the stereo vision GPU algorithm in about 100 milliseconds, give or take. The IMUs generated position data at 200 packets per second. There was some fanout to a few calculator modules — visual odometry and a few things that I don’t remember — and then went to the GPU and a few other modules — and then all of the data flowed into an octopus-merge sink that merged all of the position and stereo data with the camera images, and sent all of that to the video game engine for display. I’m a little sorry for the vagueness, but this was quite some time ago. The point is, we did as much fanout as possible and then merged it all back together for the final publication to the video game server.
At this point in the investigation, I have no idea where to start. So I start looking at each of the individual components.
Each node had fixed-length queues. Were these somehow filling up? I doubled their lengths. Within a half hour the system was stuttering. I tried tuning other configuration settings. No dice.
I poured over all of the code I had written. I wrote a merge filter that was a heavy abuse of the S++ STL’s map type, and I was quite proud of them. Were these somehow the problem? I added some logging to them and ran the system until the problem happened again. It didn’t seem to be the problem.
At this point, I go over everything that I’ve done with my boss. We brainstorm one or two more things it could be, but in the next few days I tested each of them and it would still seize up after 15-30 minutes.
And at that point, my boss decided that we should stop investigating it. It’s just a research prototype, we’re not shipping this to customers. We just needed to make it easy to restart a scenario remotely when it acted up. So for a few months, we just lived with this.
And then one day three months later, I was in the middle of a meeting for an unrelated project. I was speaking a sentence that had nothing to do with the project. Suddenly a moment of clarity overcame me.
The stuttering isn’t caused by a single component. The stuttering is caused by EVERYTHING. The sensor sample rate. The fanout. The fan in. The GPU algorithm. Passing every packet along.
So here’s what was happening: We had about 240 sensor samples per second: 200 IMU samples and four cameras producing 10 images each. Do you remember the rules that our elders had for using this system? Never parallelize and only pass along data that you will use? Well, it turned out those didn’t just exist to annoy my boss.
So let’s say all of these sensor messages fan out to 4 modules. Each of those 4 modules gets 240 sensor messages per second. If they then pass all of these messages to a merge, that merge has to process almost 1000 messages per second, as does every downstream module.
The downstream modules all have input queues. When the queues in this system filled up, they stopped accepting new messages and just dropped the others. We thought by setting the queue length in the thousands, we would always be guaranteed to catch up.
However, we had an extremely tight budget on our stereo vision. It took a little less than 100ms to run, and needed to process 10 frames per second. Did it really take a little less than 100ms to run? No, in practice that was the floor. It wasn’t uncommon for it to take 120 or 130 milliseconds. The laptop was doing a lot of work. RAM was filled to the gills, the CPUs were running close to max, and the GPUs were also fully saturated. Just because of system scheduling, sometimes it took even longer to finally get around to reading the GPU response.
The problem is that the octopus merge effectively waited for the stereo packet to synchronize all of the outputs of the various streams, so it had the same frame rate as the stereo module. When it would finally finish processing a frame, it would gallop through a bunch of IMU packets until it finally reached one with camera frames it hadn’t yet seen. But every delay would slowly make the queue longer, and longer, and longer. Eventually, the whole queue would be completely filled up, and the new camera frames would mostly be dropped on the floor. Suddenly, most of the images don’t actually get merged or sent downstream because they’re just never added to the queue. And the final few steps of the system — merge and send to the video game server — were very cheap, so it was never noticed that they were doing appreciably less work.
So I removed the “pass through unnecessary packets” behavior and put a system on a bench. Multiple hours later, it was still running smoothly. It even seemed to be a little more responsive than before! I finally caught up with my boss and delivered the news.
He went through all seven stages of grief. He was shocked that I even suggested this. Denial came quick. Anger came quicker. But we worked through the stages until he reached the inevitable conclusion: he couldn’t just pass through all information to each filter. We actually needed to spend a few minutes to modify the config when we needed this to happen.
What are the takeaways? Well first, having good monitoring would have made this a non-issue. However that was a bit challenging in 2009 - it was years before common monitoring tools like Prometheus and Grafana were developed. Second, this may have been the first “emergent behavior” bug that I debugged, where no individual component of the system was malfunctioning but the whole system was misbehaving. When I was 24, I was still fixated on the idea that some individual component must be misbehaving. In a way, I was the component that was misbehaving, because I was one of the technical owners of the application graph.
Do you have a war story that you’d like to share? If so, email a brief pitch to jakevoytko@gmail.com
The cameras on the helmet were spaced really far apart to make stereo vision work. It was like 2x a normal interocular distance. When you viewed this through goggles it basically made you unable to move, because your body stopped trusting how far away anything was. We never overcame this, and this was probably the primary takeaway for the government.
What a depressing thought.
The recommendation for linear vs parallel pipelines make sense when paired with the decision to append calculations from each node to its input and pass the whole thing along.
As soon as you wrote about passing the input of a node to the output I guessed the cause. Each node should extract data or get out of the way. Otherwise you have a graph of nodes logging verbosely and the next node reading from those logs. Perhaps that is an easier explanation for why the decision to pass along inputs to outputs was a really poor design choice.
Congratulations on figuring it out!