How would you diagnose the following bug?
- A number of checkboxes representing user interests (Football? Music?); user can select/deselect their interests.
- Software has worked well for years in production.
- Suddenly intermittent reports start coming in that sometimes some checkboxes get unchecked by themselves.
- You test live, you test on the test server, all is good.
But the reports keep on trickling in, leading you to suspect that it isn’t just user foolishness (e.g. not understanding how to use a checkbox, which I wouldn’t normally put past users..)
This happened at Uboot around 2005.
I’ve forgotten how it came to pass that we fixed the bug. But the bug was this. Perhaps it sounds obvious once explained, but it was anything other than obvious at the time, especially given the fact it was completely unreproducable.
The software was coded in Perl5, and I hadn’t coded this particular screen myself. But, I had no reason to suspect that anything was wrong with the code, as, as I say, it had worked well live for years.
In Perl, “everything is a string”, apart from that, in fact, that’s not true at all. “everything appears to be a string; but might not be” describes the situation better. If you have a string “foo” then it’s stored as a string. If you have a string like “45″ then it’s stored as an integer internally and converted to/from a string as needed. If you have a string like “45.6″ then it’s stored as a double internally and converted to/from a string as needed. (Or it might be more complex than that, I’m not sure, perlnumber)
Supposedly this makes everything “easier” if you treat everything as a string. But I have no anger towards the junior developer who coded this screen who believed, as things appeared to be strings, things actually were stings. I mean, why wouldn’t you think that?
The checkboxes were stored in the software as a huge bit field. Perhaps this wasn’t the best representation, as that doesn’t scale (if you want to store 100 interests, you’re going to have to change the approach). But, that was the representation that had been chosen and, as I said, this had been online and worked well for years.
At some point, someone had decided to augment our 64-bit servers with some 32-bit servers. So you can imagine the rest. 2 out of 30 servers were 32-bit, 2 out of 30 clicks went to those servers, our dev server was the older 64-bit server, all our software had been developed on the old 64-bit servers. So 2 out of 30 clicks, all interests apart from 32 of then got lost.
Lesson learned (or not, as the lesson would have to be learnt by the programming language community): If things appear to be strings, make sure they actually are strings. Or, make sure it’s obvious that they’re not strings.
Had the software been written in Java, this couldn’t have happened as, independent of machine word size (32-bit or 64-bit) each Java numeric data type has a defined width, and is guaranteed to behave identically on any JVM implementation.