You can’t work out what the type of data is by looking at the data itself

By Adrian Smith10 Apr 2008500 words2 mins to read

I was reading this article about how certain data gets messed up when one imports it into Excel (certain data looks like a date and thus gets converted into one), and it reminded me of a problem I had when transferring data over an XML protocol from Perl (the SOAP library was inspecting the hex data I was transferring, but a small percentage of hex numbers look like "123e123", which looked like a floating point number to the library)

I think both problems are actually the same problem. It can be traced back to the necessity to make exactly one of the following two decisions when creating data-processing systems:

  1. Either you try and work out what the datatype of a piece of data is by looking at e.g. the data's string representation. E.g. this data is "abcd" so it's a string, this data is "123" so it's a number.
  2. Or you explicitly store and state, external to the data, its data type. E.g. store not only that the data is "123", but that it's a number.

Option 1 seems attractive as it's simpler as you only need to store one piece of data. It also feels more normalized, as one piece of data is generally better than two (e.g. what if they are inconsistent, e.g. data is "abcjzh" and type is "number"?)

The trouble with option 1 though is it doesn't work.

But in fact the problems with option 1 are worse than that. Option 1 seems to work, yet does not actually work in all case (and you want your software to work in all cases). That's more dangerous that if it simply and clearly didn't work.

The authors of the SOAP library in my example presumably believed their software worked. And I believed my software, built on top of the SOAP library, worked. It worked in my unit tests and when I tested it by clicking-through the front-end. Only 0.6% of users had a code with a hex string that looked like an exponent, so it's understandable that I just didn't hit it when testing. But with e.g. 2M users in the database, some of my users will hit it. And that means that the software I released didn't work (working meaning working 100% for everyone.)

But I like my software to work. The way I achieve that is to avoid errors which are difficult to detect. Making errors is human; if they are easy to catch, one can spot them and then correct them.

This article was written by Adrian Smith on 10 Apr 2008

Follow me: Facebook | Twitter | LinkedIn | Email

More on: Software Architecture | Language Design | FAIL