The Butler Did It


There's that scene in Five Easy Pieces (still fresh in my mind), where Jack Nicholson sits down behind the wheel of his car, ready to drive off. There's a motionless three-count, and then he explodes in this spastic burst of pure frustrated rage, somewhere between hitting the wheel of his car and whamming his head into it full force. It's a stunning outburst, like a human safety value exploding. That's kind of how I felt today at work.

The bug was astounding. I recognize that, on the all-time scale of monster bugs, one that takes only a day to track down doesn't even make the top ten thousand. There are bugs out there that would only repro after two straight weeks of uptime on a ten-computer cluster, bugs that manifest only in the precisely-timed interaction of twenty different subcomponents, memory leaks that cross several hundred function barriers, all sorts of truly nasty bugs. That said, I did literally nothing else today, for about nine or ten working hours, beyond chase after this rapscallion. It was amazing.

At first, I thought it was something wrong with my changes to the project. But then it repro'ed on my vanilla second enlistment. In fact, it reproed on a lab build from two weeks ago, one that passed all sorts of stringent build lab tests. Further probing indicated that it only happened for debug builds, and only on my machine. If I picked up Matt's build, it broke; but if Steve picked up mine, it ran just fine. I tried swapping in and out every component I could change, I reinstalled everything I could reinstall, I scanned my hard drive for bad sectors, flushed every cache I knew about, erased my temp directory. Still nothing. It was like my computer was cursed.

I was halfway ready to reformat my hard drive, waste a day reinstalling everything, and write it off to some kind of crazy bad luck, but I was stopped by the same remorseless logic that made the whole experience so frustrating. Clearly, there had to be some kind of state on my machine that wasn't getting cleared properly and was interfering in some strange way with every run of the program. But unless the product was doing something radically different than it was designed to, I'd systematically replaced or regenerated every component it interacted with.

I spent the next few hours methodically trying to produce a minimal repro case. I boiled a file several pages long into three lines, none very long, which meant that my version was failing at a truly basic level -- even though I could walk it through some other extremely long sequences of commands, and had been for days, without trouble. It was just this one, seemingly innocuous case, where it failed. I walked through the code over and over again -- extremely unpleasant, given that a large part of the failing code was seriously asynchronous and another part involved a highly-optimized and therefore highly-incomprehensible parser -- until I finally noticed that I was specifying one object, but two were being created.

That was only really the beginning of the end. It took me another hour or two, with some major help from my next-door-neighbor, to turn that clue into an actual explanation of what was going wrong. It was one of those bugs that crosses so many national borders that it's hard to pin blame on any one person. One guy wrote a function that was potentially unsafe, on account of the idiosyncratic, but probably justifiable, behavior or another component. One other guy managed to call that function under the precise circumstances that brought out that danger. And for some unknown reason, my computer was lighting the fuse that everyone else was ignoring. We had a weapon, we had a suspect, but no motive. Why me?

I think Steve and I noticed it at the same instant. It was the damn debug output code. I'd been debugging this one segment of code the day before, under different circumstances, and had turned on the logging for a particular object. The different circumstances had meant that I hadn't stumbled across the particular screw case until I returned to my normal course of action today -- without turning off the logging. In other words, under this particular scenario, running with the debug output on caused the product to crash, but if you ran without the debug output (either by running retail, or, like any sane person, i.e. someone other than me, not setting this particular flag), you could go forever without seeing anything wrong.

Of course, the debug tool was precisely the missing piece of state I'd been looking for. The piece of missing state I'd been looking for by doing fresh builds and reinstalls, then firing up the program and the debug tool, which so ever-so-thoughtfully remembers your settings across runs. A real time-saver during debugging, no? It was a real Purloined Letter moment. Gnaaaaarrrrrrrgh!