Archive for the ‘Software Design’ Category

If you’re going to change databases, do it in one go

Tuesday, June 25th, 2013

At Uboot, there was the suggestion that we should change database technology. This was because we would save money with by moving away from Oracle to e.g. MySQL.

It was further decided, as we had 200k+ lines of code, and 100+ tables, that if a migration was to occur, it shouldn’t happen  “all at once”, but instead one would have some parts of the system migrated with other parts of the system not yet migrated.

But the strategy of migrating databases piece by piece has the following problems:

  1. Data consistency. Initially, all important data are within the source database. It is possible to guarantee certain consistency constraints (e.g. all photos are owned by existing users). However, if e.g. photos are migrated to MySQL, while users remain in Oracle, no such consistency can be guaranteed. Inconsistency has the consequence that software must accept not only valid data, but also invalid data. It is impossible to test the software for all combinations of invalid data, and with the rule that “software which is not tested does not work”, one can deduce that the software will then not work. The set of inconsistencies will increase over time, so the set of users who experience bugs will also increase over time. Afterwards, going from an inconsistent state to a consistent state is nearly impossible.
  2. Reporting over multiple system areas. With one database, one can ask a question such as “how many photos belong to users logged in in the last month”. A “SQL join” is used. However if the tables reside in different database, no such query can be asked. One would have to export the data from one database to another, in order to ask the query. Or export all data into a centralized data warehouse. Or write software to do what a single line of “join” would do. All this effort is saved, by using a single database.
  3. Point-in-time recovery. With one system, if the database crashes or an error is made (e.g. “drop table account” is executed by an employee), a “point-in-time recovery” can be performed. Data is restored to its consistent state as it was at a certain point in time. If one uses more than one database, a point-in-time recovery will be more difficult. For example if the databases are out-of-sync by 1 second, then some photos will exist for which there are no users, or vice versa, and data consistency problems exist as described in point 2.
  4. Spending money on migration; no saving in license fees. If a migration is done, time/money is spent. Not only in doing the migration, but fixing performance problems after the system is done. If money can be saved on the Oracle license, perhaps this time is worth it, however if one is using multiple databases then one is still paying for Oracle.

The solution is either to change databases in one go, or simply don’t change databases. You want to have your data in one place, at all times.

Map keys: Only allow strings, don’t allow “any object”

Monday, March 18th, 2013

Associative arrays. Maps. Hashes. Dictionaries. They are an important cornerstone of data modelling in software. Every language has them.

Question: What types should the keys be?

The answer depends on what language you’re using. I have spent a lot of time programming Perl and Java, and they answer this question differently. Perl’s hashes force you to have string keys. Java’s Map allows you to have any objects as keys, and you can override (amongst others) the equals method to allow the map to know if two keys are the same or not. C++ STL’s “map” is the same as Java, but in addition allows you to supply code, when you create the map, to determine if two key objects are the same or not.

The Java and C++ STL way definitely seem to be more general. But after years of programming in both worlds, I’ve come to a conclusion: The way Perl does it, allowing only strings as keys, is better.

It’s obvious whether two strings are equal or not. It’s not obvious whether two arbitrary objects are equal or not.

Take a User object. This represents a user in the system. It has an id, a name, a picture and so on. You want information against users, so you create a Map with Users as keys, and the information as values. The code Map<User, MyInfo> reads easily enough. Adding entries and fetching entries from this map reads easily enough.

But what does it actually do? Under what circumstances are two User objects equal? When their IDs are equal? When all fields are equal? When certain key fields such as names are equal? What if two users haven’t been inserted into the database yet and thus don’t have IDs yet?

This is a whole can of worms. I have encountered subtle bugs due to object equality not being what I expect it to be in a certain situation (even though it looked reasonable enough when the class was defined). These bugs have been difficult to trace and fix, and aren’t always the sort of bug which even have a manifestation quickly or all the time.

Adding information to the map with map.add(, myInfo) is also easy to read. You know exactly what it does. You don’t have to program those annoying equals and hashCode methods, so your business logic becomes less diluted by implementation artefacts. Those methods also can’t have bugs if you haven’t programmed them.

A disadvantage: If you’re using a language supporting static typing and generics, then extra readability and compiler-verification can be gained by stating the type of the key of a relationship. Just using “String” doesn’t allow this (is it a user id? the name of the user?). But I still think that the simplicity aspect described above is more important.

Perhaps one could argue that numbers should be allowed as keys too. But I think, for simplicity, there should only be one type of key. A number can be converted to a string easily enough (as long as one makes sure to remove any leading zeros, which would make two identical numbers appear to be different strings). More complex data, that one might want as a key, can be more easily converted to a string than to a number.

Regarding sets and finding unique objects, the situation is similar. Perl has no facility to do this. Java has the ability to add objects to a Set, the elements’ equals method is used to determine if objects are the same or not. C++ STL’s “set” is the same as Java, but also allows you to supply a function to determine if two elements are the same or not.

Again, if you want to find which (unique) users certain information pertains to, in Java you’d create a Set<User> and put the users in there. Again, what makes one User the same as another? Again, subtle bugs can be introduced when your understanding of what makes a user unique doesn’t coincide with the author’s of the user class.

This is the way Perl does it: There are only maps, no sets. You place objects in the map as values, and the unique aspect is the key of the map. At the end you just take all the values of the map, and ignore the key. map.add(, user) is easy to write, it’s clear what’s going on.

Futures rock

Friday, February 22nd, 2013

I like things happening in parallel but which don’t require mental effort thinking about synchronization and race conditions etc; i.e. where one can write code that basically looks like sequential code but is actually parallel. Futures, backed by threads, which are trivial to implement in Java, are great at this.

My employee Martin’s written some code to fetch data (friends) from various connected accounts, which currently can include:

  • Facebook
  • XING
  • LinkedIn
  • Google Contacts
  • and potentially others in the future

These lookups could take time:

  • Facebook’s systems might take e.g. one second to deliver the information,
  • and there is potentially latency between our system and these various external systems.

To allow a faster user experience, he suggested doing all these lookups across all networks simultaneously in threads, and only return when all threads are finished. Good suggestion.

I further suggested not writing a routine to wrap all social network (“social network lookup code”) which starts then waits for all threads to finish. Instead, having each network start a thread of its own, and immediately returning a future which, only when some client actually accesses the data, blocks until the thread is finished.

Why? There is also a huge graph full of data displayed over all users (maybe up to 1k users), and “friends” are highlighted in this graph. Displaying the graph involves a massive SELECT to the database. Using futures means the SELECT can also do its work in a thread, and return immediately with a future. This means, without any social network knowing about one another or the SELECT, these things can run in parallel.

So far so good, but it gets better. He suggested using a Wicket AjaxLazyLoadPanel, which means the page gets displayed, and immediately upon loading, the page does an AJAX request to fetch the data within the panel. It’s a nice Wicket feature, and you program it the same as any other panel. The idea is that even if all the above operations take a long time, the page can be loaded quickly (albeit with a “loading” sign for the data).

So, what we’re going to do is start all those futures off when the original page is loaded, and the implementation to display the contents of the panel just uses the results of the futures (blocking if they’re not finished). This means all of the following are done in parallel:

  • Hitting various external systems for friends: FB, LinkedIn, ….
  • Hitting our database to get information for the graph
  • Latency of the user’s internet connection to deliver the original page, and request the contents of the panel.

And all of these are things that use different resources (FB’s servers, our DB server, client’s internet connection) so they will really happen in parallel (in contrast e.g. to performing two disk-bound database operations in parallel.)

This algorithm rocks :-) Futures rock.

P.S. The algorithm admittedly only works on one server. But our web framework Wicket stores a lot of state in the server’s memory so it’s already a necessity to use sticky sessions (i.e. each click for each session goes to the same server). Fail-over (one server dies, session gets replicated and picked up by another server) won’t work for these futures. That’s an unlikely event, one that will be solved with one refresh of the browser, and is a price I’m willing to pay.

P.P.S. Caching the information would be another approach to speeding up the algorithm, and it’s something we might need to look into in addition. But caching has disadvantages  Does FB T&C allow us to cache their friends? Even if so, what about stale data? For the big database query, how do we invalidate that cache when a relevant row in the database gets changed? Caching will never parallelize the work with the client’s internet’s latency. What about the first request? (where no information is cached)

SMS trailers at Uboot

Tuesday, January 8th, 2013

We developed a cool algorithm to add informational trailer texts to SMS at Uboot. It was online from mid 2000 to mid 2011 (when Uboot stopped doing SMS). I developed this algorithm together with my colleague Mike Weinzettl. So that it doesn’t get lost forever, here is a description of it :-)

The general situation was this:

  • The users may send SMS from Uboot to phones
  • SMS have fixed length (e.g. single part SMS contains 160 characters)
  • The user may well type fewer than the maximum number of characters (e.g. 100 characters, resulting in a 1-part SMS with a maximum length of 160 characters)
  • We can use this space to promote Uboot (e.g. add “Sent by” text)
  • We realized we needed a trailer system to define what text should be added to which SMS under which circumstances


  • Texts should be stored in a separate file/files or DB (i.e. not in the middle of the source code)
  • Different SMSs have a different number of free characters at the end (from zero characters upwards). It would be nice to put longer texts on those SMS which have more characters free.
  • Users speak different languages (English, German, ..)
  • It would be nice to put variables in the text (e.g. “Sent by USERNAME”)
  • The text depends on various other factors: Is the recipient the telephone number of a registered user?

We came up with the following solution.

  • We would have a set of candidate texts, combined with the conditions under which they may be used (e.g. recipient is known user, which language)
  • Conditions could either be a single value (e.g. “en” language), a list of acceptable values, or an indication that any value is acceptable (no condition)
  • Inspired by crontab, we used a file, one line of the file per rule.
  • Each line had a set of columns for the conditions (language,..) with values such as “de” (single value), or “en,de” (multiple values), or * (any value is acceptable)
  • The last column of the file was the text
  • The text may contain variables such as ${USER}

For example:

language  recipient-registered    text
en        yes                     Sent by ${USER} on Uboot
en        no                      Sent by ${USER} on
en        yes                     Sent by Uboot
en        no                      Sent by 
*         *                       Uboot
*         no            
de        yes                     Verschickt von ${USER} auf Uboot

The algorithm would then do the following:

  • Search through the file, finding all rules that matched the condition, take the texts
  • Expand variables in the texts
  • See how many characters are available in the user’s SMS
  • Throw away any texts (with variables expanded) which are longer than the available space
  • Take the longest text out of the remaining text
  • If there are more than one texts with the same length, choose one at random
  • If no text matches, simply use no trailer (e.g. if only 1 character is free, it’s unlikely a useful trailer will be defined with that length)

The advantage of this algorithm is, given two users who send an SMS with the same amount of space free, different trailers may be the longest, depending on how long their username is.

I’m proud of this algorithm :-)

Store users’ birth dates, not ages, in the database

Wednesday, January 2nd, 2013

If you store your users’ ages in the database, in one year’s time their ages will be wrong. If you store your users’ birth dates in the database, and calculate their age from that, this age will always be correct.

Some sites wish to ask the user their age and display it. It would seem simplest to just store this number in an integer field alongside the user in the database. But you can be sure that, in one year, this value will be wrong.

Instead you should ask the user their birth date. Store that in the database. Always calculate their age by seeing how many years have passed between their stored birth date and the current date. As they get older, their age will always be displayed correctly.

In case you really wish to ask the user their age only, one trick we did at uboot, was to calculate an approximate birth date. Assuming they’re half way between their birthdays, and they say they are n years old, assume their birth date was n+½ years ago (or 12n+6 months ago). Calculate the age to display as described above. On the day they enter their age, the display will be correct. One year thence it’ll be correct as well. In between it’ll be, well, an approximation. But better than displaying their entered age forever—showing them never ageing.

Offer-Ready Publisher

Monday, December 24th, 2012

At Offer-Ready, we allow companies to find their optimal mobile phone tariff. We’ve developed a bunch of proprietary algorithms to search over millions of potential combinations of base tariffs and optional packs in under a tenth of a second.

We need to store which tariffs are available, and what they cost, in a format that’s easy for us to maintain. However, the calculation process needs it in a format which makes the calculation as quick as possible. Transformation from the first form to the second involves quite a few different steps, quite a lot of pre-calculation, and potentially quite a few hours of computer work.

Recently I created the “publisher” front-end software, which ties all these transformation steps together and makes them accessible via a GUI. The requirements were roughly this:

  • The configurations are stored in a version control system. That way, if an error occurs, we can easily roll back to a version that works.
  • Publishing takes some time, perhaps a few hours. Therefore a “synchronous” solution, where the user clicks and the HTTP response delivers the “success” or “failure” status, is not acceptable; HTTP requests should not last a few hours.
  • During these hours, the calculation system should not be offline. Until a new publish is successful, the “last successful publish” is the previous publish; this configuration should be online until, in an “instant”, the old successful publish is replaced with the new successful publish.

A further advantage of storing the configuration in a version control system is that after a calculation has occurred, we know what version of the configuration it executed against. If a customer rings up and says “this calculation chose the wrong tariff!” we know which configuration that calculation was executed against. We can check the VCS logs: perhaps there was an error in the configuration which has already been corrected?

Here’s a screen shot of the publish screen:

The software design was cool, basically it’s this:

  • Every time someone clicks publish, a process gets started to do the publish in the background.
  • This process creates a new directory where it does its work (So all other directories, including previous successful publishes, are not affected.)
  • This process writes a logfile
  • The front-end simply reads this logfile, and displays it on the screen. (This is a simplification; various other files are written by the process so its status can be determined.)
  • There is a database table describing which directory contains the last successful publish. Once a publish is successful, the process simply updates this table, to set its directory name as the successful publish.
  • Before each calculation request, this database table is queried, and if the successful publish has changed since the last request, it “just-in-time” reloads the configuration with the newly published version.
  • There are multiple web servers serving the customers and performing these calculations. This architecture means they query the central database, the central publisher doesn’t have to contact them. The provisioning process for a new web server doesn’t involve changing the central publisher at all.

This is the next generation of the algorithm described here:

United Youth Symbol Page

Monday, December 17th, 2012

This was a cool project I implemented Summer 2011, I can now talk about it, now that my customer has taken it online. Here it is: United Youth Symbol Page.

The requirements were thus:

  • The task was to have a logo in the browser.
  • The user should be able to alter the colors in various parts of the logo.
  • The user should also be able to add gradients as opposed to just flat colors.
  • The user should be able to download the logo with the colors they’ve chosen, for printing on T-Shirts etc.
  • Ideally – in my opinion, as the logo might be printed out in a large format, we should offer a scalable vector download, not just a bitmap which will get jaggies when scaled up.
  • Ideally – this should all happen interactively in the browser
  • Ideally – I wouldn’t have to completely re-write my program each time the graphic designer altered the Logo.

This was what I did:

  • The graphic designer gave me the logo as a vector PDF. I imported this into Inkscape and exported it as SVG.
  • SVG can be displayed in the browser by modern browsers.
  • The program is written in Javascript and runs directly in the browser.
  • SVG is XML. I identified the nodes which controlled the colours, and added extra attributes, like adrian:node="color1". (Inkscape preserves these attributes when you open/save the SVG XML file.)
  • The program, in Javascript, when the user changes a colour, uses XPath to identify the parts of the logo needing a change, identified by the adrian:node attributes, and changes them. (This means the program doesn’t need to understand the structure of the XML SVG document, and this structure can be changed without altering the program.)
  • To download the logo as PDF, a form submit is made to the server, with the chosen colours. The same logo file exists on the server as SVG. The same XPaths are used to change the SVG, and then it’s converted to PDF using Java. The PDF file is sent as the response to the HTTP request.
  • Not all browsers support SVG, notably Internet Explorer at the time of programming. Using a test to see if SVG is supported, I fall back to using an <img> for the symbol in that case. Every time the colour is changed, a server call is made, and the library used on the server can produce a PNG as easily as a PDF.
  • At the time of writing the Java server and the website were on different domains. But this is no problem as both form submit (for download PDF) and <img> tags (for interactive editing on IE) pre-date the same-domain policy. (If only some tags are restricted by the same-domain policy, it makes the policy rather useless, but that’s another rant for another time..)

Strong Passwords: Education not enforcement

Sunday, December 16th, 2012

A website I’m working on allows the user to choose passwords, and also can suggest passwords for the user. I’m fully in control of this program and its password generation and checking strategies, I started to think about what would make the best strategies.

Here are some constraints which I’ve seen on various systems:

  • Passwords must have at least one lower-case, one upper-case, one digit character
  • Some passwords must also include special characters such as “!”, “$” etc.
  • Passwords must be longer than a certain length
  • Passwords must be not longer than a certain length
  • Passwords are not displayed on the screen, but replaced by **s
  • Case is significant (If you create your password with “A” and enter it as “a”, it’s wrong)

But here are some observations:

  • If you read or write down a password, you’re going to get the digit 0 and the letter O confused, etc.
  • If you have the Caps Lock key pressed accidentally, you’re going to type the password in capitals by mistake. You won’t even notice because there are only **s on the screen. (Some systems e.g. Windows get around this by warning you, but websites can’t give such warnings)
  • If you are typing on a foreign keyboard, then “!” and “$” are almost certainly somewhere else, or might not even be available. You are going to type them wrong. You won’t even notice because there are only **s on the screen
  • Obviously the longer and the greater the set of allowable characters, the greater the space to be searched in case of a brute-force attack. I have no doubt that’s why case sensitivity is considered a good thing. And obviously minimum lengths.
  • If you enforce too many rules, and the user finds it difficult to find a suitable password, they’re going to write it down and stick a post-it to their monitor, or add it to “passwords.doc” on their desktop. Such a situation decrease, not increases, password security.
  • Humans will strive to get around rules. If they want to use weak passwords, they will. For example I knew someone on a system which didn’t allow the last 5 passwords to be used again. So he simply changed his passwords to “a”, “b”, “c” etc. until it had forgotten the last 5 passwords so he could use the one he wanted.

So here are my recommendations:

  • Allow the option for the user to see their password, check a checkbox to see “abc” not “***”. Like on mobile phones. Usability is increased for the case there is nobody standing behind the user’s PC (the normal case).
  • Consider upper-case and lower-case to be the same. (Normalize passwords to lower-case when hashing them and comparing them)
  • Consider the letter O and the digit 0 etc to be the same. (Replace all letter Os with digit 0s, etc. before hashing them and comparing them)
  • No maximum length
  • No forbidden characters. If they want to use “!” then they should be able to do so, as it might be a character in their favorite password, and we don’t want the post-it note situation. But don’t force them, for the same reason.
  • Education is more important than rules (as rules will be got around). Point out what will happen once their account gets compromised. Point out how longer passwords are better, pet’s names aren’t good, etc.
  • Generating a password? Generate a long one (16-20 chars), upper-case only (easier to read in case you write it down), no special characters (problems with keyboard layouts)

I’m not sure about minimum length. I mean if there are no rules about mandatory characters, and if the system is case insensitive, then without a minimum length users could simply type in “a”? But minimum length might prevent a user from using a password they already know. And given users won’t accept rules, if one enforces a minimum length then users will just use 10x “a” or “12341234….”. I think, again, education is the key here.

This is quite different to how most websites and most of our industry works. What do you think? Are there resources on this topic I’ve missed?

Clearly another option is not to maintain passwords at all but e.g. use a SSO system like oauth with a provider who does maintain passwords e.g. Facebook. But then one is giving a certain amount of power to those providers, and perhaps one doesn’t want to do that.

Yet another alternative, given there’ll be some “reset password via email” link, is just to say, ownership of the email address is sufficient to indicate authorization. I mean that certainly is the case if you provide such a link; so why not only provide such a link? Every time the user logs on, allow them to type in their email address and click a button, a one-time link gets delivered to their inbox.

Off-topic: For password hashing, use bcrypt, which can be made to deliberately take a long time, to thwart brute-force attacks against your hashed passwords in case of a database leak.

How many days until the end of the month?

Wednesday, July 27th, 2011

It’s a common enough problem, you want to find out if a date is in a particular month, or you want to know how many days there are left in the current month, and so on.

I’ve seen the following awkward solution plenty of times at different companies: To find out if a day is within a month, you see if the date in question is greater than the start of the month and less than the end of the month; this seems logical enough. To determine the number of days left, you subtract the date in question from the last date in the month; this also seems reasonable enough.

The trouble is it’s not particularly easy to find out the last date in a given month. I’ve seen at least the following solutions:

  • Using database functions to find the last day of the month
  • Having an array of 12 elements, maintaining the number of days in each month of the year
  • Simply using the date “YYYY-MM-31″.

Using database functions is the best, for sure, as they’ll do the calculations right, but it’s still a certain amount of work to connect to the database just for this purpose. Obviously there are libraries such as Java’s Calendar which will do the same work but again, it’s a certain amount of effort. The array isn’t good as you have to introduce further logic to handle leap years. The last approach is obviously ridiculous as there are plenty of months for which such a date is invalid. (But MySQL 4 allows such dates and MySQL 5 doesn’t; which as part of a database migration project has brought this point to my attention.)

However, there is a much simpler solution, that is to compare the date with the first day of the next month. Finding the next month is easy (there are always 12 months in the year).

Atomic operations over filesystem and database

Sunday, November 21st, 2010

I had the situation recently where I needed a “button”; when the user clicks the button, then:

  • An “svn up” should occur, to update some XML config files + data files (CSV)
  • If the config files or CSV files have errors, the operation should be aborted and an error printed
  • If correct, the CSV data should be imported into the database ready for processing
  • If correct, the config files should be made available for use

The config files are a bunch of XML files; each file has a complex structure, and they themselves are in a complex directory structure. One option would be to put them into a database as well but I felt that a simpler solution was just to leave them on the disk where they could be parsed. Directories, files and XML files more naturally fit the hierarchical nature of the configuration in question. I didn’t want to have multiple representations of this configuration (one being directories/XML; the other being a database).

This is all OK, but how to make all of that (config files update + database work) atomic? i.e.

  • If anything goes wrong, the entire operation fails and rolls back with an error message
  • If it succeeds then the entire operation succeeds
  • If the server crashes, then an update was either successful or rolled back (but nothing in between)

I came up with a nice solution:

  • A database transaction is started
  • The “svn up” process programmatically connects to the Subversion server and does a brand new checkout into a directory with a random name (i.e. actually “svn co” not “svn up”).
  • The XML files are parsed and checked; if there is an error, the operation is aborted
  • The CSV files are loaded into the database within the transaction, if there is an error, the operation is aborted (& rolled back)
  • The new random directory name is written into a table “current directory name”, also in the transaction
  • The transaction is committed
  • Each time access is needed to the XML files, a lookup is first done in the table to retrieve the directory name, the files are then read from this directory.

This has the consequence that:

  • Either the “commit” occurs and the new XML files are “live” (table points to new directory name) and the CSV data has been imported
  • Or a “rollback” (or crash and auto-rollback) occurs in which case the CSV import is rolled back, and the table points to the old version of the XML files which are still on the disk

I’m pleased with this solution :-)

One should really also write a clean-up script, deleting directories which (a) aren’t in the database and (b) were created a certain amount of time ago e.g. a few days. That way old directories, or directories checked out which had errors or failed, are removed, but directories which are in the process of being checked out are not removed.

P.S. Allowing the user to submit new versions of the data by checking it into Subversion is a nice idea:

  • It’s a bunch of structured data (bunch of XML and CSV files in a particular directory format)
  • This is easy to visualize with Windows Explorer and TortoiseSVN
  • This would not be easy or nice to do with multiple <input type=file> file upload buttons
  • The “svn co” command in the software is easy; there are various libraries for programmatic access to SVN: I used svnkit: I found it well documented, seemingly bug free, it just worked first time.
  • Obligatory link about why I didn’t use Git ;-) (the person involved can use TortoiseSVN but wouldn’t have been able to handle git)

Update: User-interface for this algorithm described here: