Tim's Weblog
Tim Strehle’s links and thoughts on Web apps, software development and Digital Asset Management, since 2002.

The story of my favorite bug

In the fifteenth year of my software developer career, I encountered a remarkable bug that would “entertain” me for weeks.

“Garbage in, garbage out”?

It started off rather innocuously: At a recently-installed customer site, I noticed that a few XML files sent from the editorial system weren’t imported correctly into our DAM system. The files were imported, but parts of the data looked bad.

My first assumption was that the editorial system had generated faulty files – “garbage in, garbage out.” But they seemed no different from successfully-imported files, and when I reimported the files that originally failed, the problem was gone.

I had no idea where to look and decided to live with the problem for a while. Each day, I checked for failed imports and manually triggered a reimport. But sooner or later I would have to find and fix the cause.

Reproduce it

I tried to reproduce the problem on a development server and on the customer’s test server, but was unable to. I could reproduce it only when doing test imports of large batches of files on the production server.

I found out that the bad data after import was due to an XPath expression in an XSL transformation not returning what I was expecting – some of the time. I tweeted:

“Weird #XSLT bug keeps haunting me: Each day, one or two of some 100 transformations w/ #XPath a/* return a/b, a/b/c instead of just a/b.”

It took me many hours to boil the pretty complex XML and XSLT files down to a tiny test case that let me still reproduce the bug. I ran that test case on as many Linux servers as I could get my hands on, learning that only the SuSE Linux Enterprise 10 (SLES 10) operating system was affected. (The test server had openSUSE instead of SLES installed, that’s why I didn’t run into the problem there.)

Still, I had no idea where this was coming from. My next tweet, slightly desperate, linked to the test:

“I’m running the same #XSLT transform 1,000 times, and get a wrong result 29 times: http://www.strehle.de/tim/check-for-xslt-bug.phps … (SLES 10 only) #PHP #XML #heisenbug”

Root cause

By now I was pretty sure that this weird behaviour was not my fault. Finally, I discovered a report on the Web that linked strange Libxml2 symptoms to a bug in the low-level glibc library. That broken glibc version was shipped with the SLES 10 release we had on the production servers.

The fix

It was great to know the root cause, but updating the operating system on the production servers was not going to happen for several months.

Many parts of the DAM workflow relied on complex XSL transformations. Rewriting XSLT logic in PHP would be a ton of work. But how could I keep using XSLT when I couldn’t trust the XSL processor? It worked most of the time, but I never knew when it would fail.

Suddenly, I had a revelation – the previous sentence already contained the solution: “It worked most of the time”! I simply had to run all XSL transformations in a loop and use the result that was returned most often.

This approach fixed the problem once and for all, and the workaround was active for years. Two months after noticing the first symptoms, my final tweet said:

“Weird hack: The heisenbug happens only sometimes, so I run each XSL transform 20 times and use the most common result…”