Archive for the ‘Software Design’ Category

The Internet: It’ll get slower before it gets faster

Thursday, January 31st, 2008

For the 3 weeks I was in the UK recently I used a UMTS modem (i.e. like a 3G phone) to surf the web and do all my work. Going round to my friend Robin’s house, who also works in IT, he does all his surfing through a cable from his phone to his computer: i.e. also UMTS.

At least in the UK, this is extremely popular. Also in Asia it makes a lot of sense; they have excellent high-speed mobile phone networks there and all ones preconceptions about the Asians having the latest handset devices: I can confirm first-hand that they’re all true.

As we all know and have been experiencing since about 2000, more and more phones are going to get more powerful and have larger screens. Full browsers will (and do) run on them. They will also be UMTS devices.

And for those people who don’t surf via UMTS, nearly everyone I know surfs at home using WLAN. A lot of offices use WLAN too. And obviously all the surfing at airports, coffee houses, hotels, conferences etc. all goes on via WLAN.

UMTS and WLAN have high bandwidth, but they have extremely high latency compared with a cable connection. That means that although the bytes flow fast once they’ve started, it takes a long time for the first byte to arrive.

I am quite proud of the fact that when I designed the “Uboot Joe” software (Windows software which ran on the user’s PC, sat on the notification area by the clock, and communicated with Uboot) I took this into account. Every action you do with the Joe is at most one client-server round trip. For example to view all the thumbnails in a folder, there is a single request from Joe to the server like “get all data in folder_id” and the return structure is a) information about the folder, b) information on all the photos within the folder, and c) all the binary JPEG data for the thumbnails of all those images. You can try using the Uboot Joe on a UMTS link, and it works faster than any website.

Contrast this design with HTML. The first response from the server contains <img src=xx> tags and only once that has been received can the browser make the further requests necessary to retrieve the images. If the first bytes of every response take a long time to arrive, then the user experiences that “long time” twice before they see the data they requested; first to get the HTML page then second to get the images.

In fact it’s worse. If a page has 50 embedded images, it doesn’t open up 50 concurrent connections to the server (for good reason). Instead it opens e.g. 4 connections. Which means that e.g. image number 5 has to wait for the “long time” of fetching image 1 to complete. (Some sites try to get around this by having lots of servers with different names e.g. img341.domain.com and distributing the images over these servers.)

And it’s even worse than that. Even if the application only does one round-trip to the server, the underlying protocols might do more round-trips, for example firstly to contact the DNS server to get the IP address for the domain name used in the URL; and then secondly to request the data from the server.

In addition to this being a problem with UMTS and WLAN, one also has to take into account that the Internet is global. When I’m in Macau accessing European servers I get a round trip of about 300ms. So if one adds three “long times” to an otherwise extremely fast request—easily done—one has added a whole second on to the time the user has to wait. And Jakob Nielsen says that after 1 second in total, users start to lose focus on what they’re doing.

So to design applications in this age, one needs to be aware of the number of serial server round-trips (i.e. the number of times you need to ask the server for something, and only once it’s been delivered, must you ask the server for something else).

For example:

  1. An HTML page which contains an external CSS file, and this CSS file contains URLs to images.
  2. Pages with many images. The browser only requests a few files from the same server at once, so again the response to image number 1 must be finished before the request for image number 3 can begin.
  3. Javascript software which does multiple serial calls to the server, e.g. “get session token for username/password” then “get info to display on page for session token”.
  4. A form which submits data to a piece of software. The software does something but instead of returning a result page, returns a redirection command to a “real” result page. Often done to allow one to hit “refresh” safely on the result page, or make the URL of the result page look nicer.

GWT is excellent in this regard. It has the ability to download lots of those small icon-size images in one request (it makes one big image on the server and chops them up again on the client) and it makes you explicitly aware of the number of server round trips by forcing you to define interfaces for client-server interactions – as opposed to some automated scheme where you write code and the framework decides when to insert client-server round-trips. (Wicket makes client-server round trips easy with AjaxLink; my fear is it might be too easy, and one might do them too often, and lose the overview of how many are happening).

Pre-caching is a good idea too. E.g. if you are a photo viewer application, with a photo shown full screen with a “next” button, it makes sense to load the image on the “next” page even before the user’s clicked on it. That download won’t interfere with the rest of the activity the user is doing, as the bandwidth is not the bottleneck, just the time between starting the download and the bytes starting to arrive at the client. (Although one can’t download too much without the user noticing, as some people pay per MB!)

But the most important point, I think, is: these days, one must test ones web applications on a high latency connection. Generally speaking, historically I have tended to develop locally (everything installed on my laptop), or I develop in an office with a network cable and high-speed Internet and a link to the data center where the test server sits—and the office is in the same country as the server. Maybe this sounds strange, but I think one should develop web applications while using a UMTS card.

Web software front-end test cases

Wednesday, January 16th, 2008

Recently I was working on a website which was developed in PHP without a web framework. A lot of things were programmed manually which would normally be taken care of by a web framework (for example things like: in case of an error, the HTML fields on the form are re-populated on the response page).

So I came up with an extensive set of test cases, and made sure I used them all on every field on every page.

These are the tests, in no particular order.

  1. On a form with radio buttons and text fields next to the radio buttons, clicking on the radio button should position the text cursor in the text field (as typing there is the next thing you’re always going to want to do).
  2. Clicking on the text field by a radio button should select the radio button.
  3. Checkboxes and radio buttons which have text beside them must have the text in a <label> so that clicking the text selects the checkbox or radio button.
  4. On a <form>, pressing return must do the same as clicking “OK”.
  5. Pressing the TAB key on the keyboard should progress from one field to the next in a reasonable order.
  6. If you go to a page with a form, is the text cursor already in the first field? Or do you have to reach for the mouse and click on the first field in order to use the form on the page?
  7. Every action where something is done (e.g. delete an FTP account) should have a confirmation text on the result page like “ftp account XYZ deleted”. I think it’s important to include the name of the object being acted upon in the message.
  8. All forms should use “post” and “get” appropriately. I mean various people have various strong views about when to use one and when to use the other. But for me the difference is the browser’s “are you sure you want to repost?” message when you click refresh. Do you want it? If so use “post”, otherwise use “get”. Also bookmarkability.
  9. Is there a reasonably small amount of HTML generated? E.g. in uboot to display the address book page requires about 1MB of HTML to be downloaded to the browser (not including any external CSS, graphics etc.) That’s too much.
  10. The back button should work everywhere. (E.g. Uboot multimedia gallery: click on a picture, click back, you’re at page 1 of the gallery rather than the page you were on.)
  11. Strings to long? Then “…” should be displayed, to avoid breaking the layout.
  12. If breadcrumbs are used, (i.e. main page > hosting > ftp accounts > ftp account ‘x’) then they must work. Ideally they should contain data such “ftp account ‘x’”.
  13. While loading a page on a slow connection, is the text readable before the background image loads? E.g. white text on a black background image may not be readable before the image loads. There should be an alternative flat-colour background of a similar colour. (Thanks Helge for pointing this out a few years ago!)
  14. While loading the page, does the page move around a lot due to width=x height=x attributes missing on an <img> tag? It’s annoying when you start to read the text on the page and then it moves a few seconds later just because some logo has finally loaded.
  15. Every time there is an error in the form and the page is reloaded with the error: is the original data still in the form?
  16. In all places where the user may enter free text: Do weird non-Latin1 characters work correctly, both entering them and displaying them? Do ” ‘ work correctly? Are < > & displayed correctly? Is all of this written to the database correctly? (Sometimes the browser sends &#123; style code to the server. If this isn’t escaped on the display code then the character may appear to have been processed correctly. But you don’t want &#123; strings in your database data.)
  17. Type in long values into text fields. Values should be neither truncated nor should an internal error be produced (default behaviour if you’re using MySQL or Oracle respectively)
  18. If there are any id=nn type parameters in the URL, then adding or subtracting 1 from the URL should not allow you to view other people’s content.
  19. Is the <title> tag set usefully? This is necessary when you use the down-arrow by the “back” button on the browser, to determine how far back you want to go. A whole lot of options such as “MyWebsite”, “MyWebsite”, “MyWebsite” is not very helpful.
  20. Test in a high-latency environment. Such as over a UMTS connection. If there are a lot of redirects, or images referenced from CSS referenced from HTML, or Javascript making AJAX calls, processing the result then making more calls, then the page will be slow, but work fine over a LAN connection.
  21. Test in an unreliable environment, e.g. where packets get lost. Google Spreadsheets, when you type in a value, lets you continue editing the page while it’s sending the value to the server. But if there’s an error or timeout with the sending you see an error, and the cell is reverted to its previous value. You simply have to type in the same data all over again. Instead of that it should remember your data, and display an warning “can’t connect to server right now; retrying…”
  22. If you type in URLs in capitals or mixed case: do they still work? (Thanks again Helge!)
  23. AJAX progress indicators: Is it the case that you’ve written an AJAX site, and when the user clicks an action, absolutely no visual feedback is given to the user that he’s clicked? You need to have some kind of feedback, e.g. the “loading…” of gmail.
  24. Are the buttons large enough to be easily clicked on? E.g. confirmation page with “Yes”, “No” options displayed as links in a tiny font. They’re hard to click!
  25. Do the colours work even if you view your laptop at a weird angle? A site I was using recently used a white background (what a surprise..) and to highlight the tool you had selected used a light-grey background. Worked fine on their monitors I’m sure, but at the airport with your laptop on your lap, they’re difficult to see.
  26. Does the site work, or at least fail gracefully on old browsers? An error message immediately is preferable to allowing the user to type in lots of text then lose it when the user presses “OK”, due to some browser incompatibility issue (e.g. MediaWiki on Safari 3 beta for Windows)
  27. Does the site look good on both LCD monitors and conventional monitors? Some colour combinations (e.g. dark green on a light green background) are perfectly readable and look nice on LCDs, but are completely unreadable on conventional monitors.
  28. What about small screens? My parents computer uses 800×600 and I use my Laptop in 1064×600 normally. In Google Reader I can hardly see any feeds at that size. The whole screen is taken up with toolbars, menus, etc.
  29. Is the session timeout compatible with the company’s policy? E.g. do you really need the user to log in again after 10 minutes, i.e. when he just had to nip off for a meeting?
  30. If the session expires, what happens to the user’s data? Composing a long email and clicking “send” only to receive the response page “please log in again!”, and losing the email, is wrong.
  31. If there is a possibility to log in on every screen (e.g. “logged out” at the top-right of the screen, or like on uboot), then logging in should take you back to the screen where you were. Because that’s what the user would want.
  32. If you are logged out and go to a particular screen e.g. via URL or bookmark sent while a user was logged in, do you get useful information? A redirect to the homepage is also OK, but “general error” is not.
  33. If a browser window is open, and a user is logged in, and from another browser (or directly in the database) that user’s password is changed, the user deleted, or disabled, is the first browser immediately logged out?

Unit testing and configuration files

Sunday, September 9th, 2007

I used to think of a function as something which would convert some input value into some output value (potentially with some side-effects). And thus unit testing a function would involve passing particular inputs into the function and checking that the results were as expected (potentially setting up some database rows or something to test that the side-effects were executed properly).

But sometimes a function relies on a particular piece of global configuration. That’s an input to the function too. For example the tax rate.

public int calculateVat(int cents) {
    double vat = config.getDouble("vatRate");
    return (int) Math.round(cents * vat);
}

Initially I would just test the function with the current settings of the config file.

// VAT is currently 20% in Austria
assertEquals(20, obj.calculateVat(100));

However that’s obviously not a great solution as that will break when the config file changes. And after all, configuration files are there to extract the things that likely will change from the otherwise often very long but hopefully reasonably static domain logic.

So the solution I use now is to extend such configuration accessing classes with methods such as “setValueForTesting”. The “forTesting” part of the name indicates clearly its purpose is for test programs only.

config.setDoubleForTesting("vatRate", 0.2);
assertEquals(20, obj.calculateVat(100));

That code feels much better. There are actually two advantages:

  1. Obviously the test code will not break if the config file changes.
  2. But also there is more locality. Everything you need to understand about that test is there in the test program’s source file, in two easy-to-read lines.

But this approach feels somewhat unorthodox. How do other people do it?

3-dimensional photo organization

Monday, September 3rd, 2007

I have just viewed some photos on Facebook. They were of a friend's trip to Malaysia.

  1. Facebook has a limit of 60 photos per album; meaning you have to split photos up into albums with names like "Malaysia 1", "Malaysia 2" etc if you want to upload more than 60 photos in total.
  2. Each album, as is current practice in web design, is divided into pages with "page next" buttons to get to the next page.
  3. Each page of each album, as was introduced with windowing systems, has a scroll bar (vertical only, unless one makes the window really small)

OK now fundamentally a set of photos from a holiday are one-dimensional. I can think of many ways to lay out photos but I'm sure these three dimensions would not be the dimensions I would choose.

The scroll bar is quite a good device. It was well thought through. It was specifically developed to solve the problem of "you have more data than can fit on the screen". You can move slowly up or down using the arrows at the end which are deliberately easy to understand even for novices unfamiliar with windowing systems. You can see how far down the available data you are. You can drag the bar with your hand/mouse to move either fast or slow in a natural motion.

I have heard that some web novices find "next page" easier to use than using the scroll bar. But this wouldn't be the case if there were no "next page" links. And knowing how to use scroll bars is non-optional, if you want to use any other system other than photo browsing websites. For example when using the compose interface of an email website, there is no "next page" button once you've typed text equal in length to the size of the window the user interface designers assume you are using.

Scroll bars are so much better than "next page" links, and even if they weren't, displaying 1-dimensional data using 1 data navigation tool is better than displaying 1-dimensional data using 3 different navigation tools.

Email Boxes need to be stored in DB, but also call IMAP, APIs, etc.

Wednesday, June 6th, 2007

I find myself often modelling the situation that there are rows in the database (e.g. “email boxes” for a user), and these rows represent things that exist elsewhere as well (e.g. IMAP accounts to back up these email boxes). There can be multiple ways of accessing these external resources, e.g. to delete an email box one does an deletes files on some server, to find out how much space is used there is an http-based protocol. And in the case of creation and deletion (and changing of password) these operations should not be done synchronously from the web interface, but are queued. This is not a contrived example, I am programming exactly this right now. All of the above are givens.

To not just stuff all the various API clients and other functionality into one huge class, there needs to be different objects representing:

  • The “email box” row in the database (and a persistence mechanism)
  • A “filesystem” object to represent operations on the filesystem such as “delete email box”. This object knows the directory layout used. This object can be shared between other objects which need to perform filesystem operations, such as a filestore accessible via FTP accounts (in this case). It’s convenient to program all these filesystem operations in one object.
  • A client for the HTTP-based protocol, to find out the box’s used size. In this case the protocol can do other functions, such as finding the space used in the FTP filestore. Again, it’s convenient to put all these operations in one class: one can create private methods to connect to the server, or for common API requirements such as response parsing which will be the same for all the commands, etc.
  • Persistable Queue objects, and QueueProcessor objects representing the programs or tasks to change the password, create/delete the boxes, etc.
  • Some Facade object to simply access to all the above?

Once one has come up with this objects in the system, there are a number of possibilities for how to combine them. E.g.

  • When one asks the HTTP protocol client object to find out the space used for a box, should one pass the parameter (of which box) as a Box object, or the name of the box and password as a String?
  • Should an application program (e.g. web interface) instanciate and use the HTTP protocol client object directly, to find the space used? Or should it call a method on the Box object, which calls the HTTP protocol object? Should both possibilities be available?

On the one hand, to simplify all objects, it would make sense to have the application program talk to the HTTP protocol object, and not to have this code in the Box object at all. And to always pass Box object, as this encourages strict typing.

However, I have found time and time again that the following solution works best:

  • Not have multiple ways of performing the same action.
  • Have a main “Box” object, which acts as a Facade. This represents a particular box. (i.e. not a BoxService stateless facade object, which each time takes a BoxId as a parameter to every function.)
  • Optionally have other objects to delegate to, concerning persistence of the box and its attributes to the database (although I prefer not)
  • A Box object knows the life cycle of a Box, and knows when to write things to queues etc. This will also need to be exposed in its interface (e.g. addCreationRequestToQueue) and explained in the class Javadoc. If this lifecycle changes (e.g. queue introduced for a certain operation) the interface will change and clients will have to be updated. But that’s OK, as probably there will be a requirement in the front-end to display “performing…” as long as the operation is in the queue. So lots will have to change if you change the life cycle.
  • This object also knows how to perform the operations which are normally queued, e.g. “delete”, in terms of simply calling the “filesystem” object. It may also need to update some internal flags to note that the filesystem no longer exists. These methods are normally only called from QueueProcessor objects, but are also handy to call from JUnit test scripts (e.g. in case of “create”), to put the system in some state that is necessary for further tests. The QueueProcessor does not do much, apart from just call the methods on the Box to perform the operation.
  • Applications call Box for all its requests and never call Filesystem. That way if the implementation changes (no longer direct “rm” but now over the HTTP API) the application does not need to change (note that such changes are ones which do not affect the life cycle of the Box, or introduce extra states such as “in queue but not done yet”). But more importantly I just think it’s a lot more readable to say “Box b = getBox(); b.getUsedSizeBytes(); b.deleteFromFilesystem()”.
  • The individual objects such as the “filesystem” object take Strings not Boxes as parameters. This makes those classes marginally simpler. More importantly one doesn’t feel right when there’s a two-way dependency, i.e. Box needs Filesystem (to call it to implement “delete” methods) and Filesystem needs Box (in its method signatures). And the only place that the Filesystem is going to be called is from Box instance methods, and the Box has all the information such as username, password, and any other information, within its instance variables.

When is a software project done?

Tuesday, May 29th, 2007

A software project is defined, for the purposes of this blog entry, as a set of people working to produce a new software system, or to modify an existing software system.

The result (exit condition) of a software project is a set of artifacts and other assertions:

  1. Document (or wiki etc) describing what the software should do, i.e. requirements. This will include subtle details, about what the system does, that will not immediately be obvious by looking at the front-end, or reading software design documentation. This should be a complete description, which is useful for the future, not just a “delta” from the last version.
  2. Software architecture documentation, in words. Simply looking at 1,000 Javadocs will not enable a new team member to understand the system. Documentation should also include which other options were evaluated and not chosen, and why, to avoid future teams considering the same things.
  3. (Obviously) the source code for the software. Including the front-end, back-end, any HTML, etc.
  4. Unit test scripts for all back-end classes needing them.
  5. Front-end tests. Either a document (simple statements such as “Click on Submit without enough money on account. See error message”), or configuration of a front-end testing program.
  6. Performance tests done and the software to perform them, if appropriate.
  7. Configuration (or creation) of a monitoring system to monitor the system once it’s live, if it’s a service (e.g. web site).
  8. Administration system for customer care, if it’s a service.
  9. Management reporting. Especially just after a system goes live, management are always very curious about key statistics, such as number of users, number of items sold etc. That needs to be analyzed in advance and the system in place when the system goes live.
  10. Class diagrams
  11. Javadoc to describe the purpose of individual classes and methods (where this is not obvious from the names). For scripting languages: parameter and return types (as this cannot be deduced from the source code).
  12. If this is not the first version of a system, migration concept including scripts to install software, migrate schema, filesystems containing user data, and anything else.
  13. System uses appropriately international character set such as UTF-8. (This is not particularly modern, the WinNT team decided to do this in 1988.) Java does this out of the box, but it’s more than just the programming language. This includes any database, any data stored in flat-files, any APIs (within the system or to/from external systems), and so on.
  14. All of the above under version control
  15. Not only the software installed on a live system, but also the existence of test and staging systems. If one uses the live systems for testing, then, once one’s gone live, one has no way to fix bugs in a testing environment. And bugs will happen, and they need to be fixed fast, so one had better have thought of this in advance.
  16. Bug tracking, or wiki system, or some way that the team is trained and rehearsed in using, to track and assign errors as they occur.
  17. Understood and tested data backup and recovery process. (What happens if the live DB crashes? Better have thought about recovery before that happens.)
  18. The team must sleep e.g. 2 days before a release. After a release (bug fixing) is the most stressful time of a project and where the team must be at its most alert (as fixing is time-critical). It’s important to sleep beforehand, and not e.g. work 7 days a week then in the evening finally release, then go to bed. (You can be certain that 1 hour after you’ve gone to sleep the site will be offline due to some problems, and you weren’t there to fix them.)

gettext is so broken

Friday, May 18th, 2007

Working on a PHP project recently, there was the requirement for text localization. The standard way to do this in PHP is to use the standard way to do this in C, which is gettext.

I’ve worked with various translation systems, including one I built myself for uboot, involving a hierarchy of languages going from most specific to most “international”, and with each string having a hierarchical id such as “myprogram.errors.disk-full”.

Java Properties files are simple but also work well (simplicity being a positive thing in this case). The lines are key-value pairs, and using a convention such as “myprogram.errors.disk-full” the key is almost as good as if it actually were a key hierarchy. The file is in Latin1 but Unicode characters can be used via an escape syntax, and there are many editors where one can just type Unicode text and which take care of the escaping.

So I was looking forward to using gettext. This format was created by GNU, the creators of GCC (a highly respected program). gettext is itself well respected and authors of systems such as PHP have chosen it as their localization system.

But alas, it is broken in so many ways.

(1) The file format. Whereas Java’s file format is to have lines such as “key=value”, gettext’s “.po” format (where did that extension come from?) has two lines for every string, like

msgid “key”
msgstr “value”

As one inevitably places a blank line between one key-value pair and the next, the file is immediately 3 times as long as a Java properties file storing the same information. And what if you want to have double-quotes within your string?

(2) Compilation (for performance reasons). I work with scripting languages, where there is no compiler. This can be a good or a bad thing; but independent of that, it is a fact. However the editable “.po” files of gettext have to be converted into binary “.mo” files before they work. Thus I have to introduce a compilation step into my otherwise compilation-free edit-and-that’s-it test environment.

In fact I don’t understand this compilation requirement at all. According to the gettext manual, gettext was developed in 1994. Surely computers were fast enough back then to parse the gettext format, store the whole lot in a hash?

And what I further don’t understand is how/if GNU programs were localized before then. I suppose they just weren’t.

(3) What about Unicode? I have no idea how to introduce Unicode characters into the editable “.po” files of gettext. The manual doesn’t help me. Supporting only 8-bit characters, and assuming/hoping that the encoding of the “.po” file is the same as the encoding that the user is using in viewing the output of your program, is simply a terrible solution. Microsoft designed Windows NT to use Unicode internally in 1988. Java uses only Unicode since its inception in 1991.

Unbelievably there is a reason given for not using Unicode.

However, we don’t recommend this approach for all POT files in all packages, because this would force translators to use PO files in UTF-8 encoding, which is – in the current state of software (as of 2003) – a major hassle for translators using GNU Emacs or XEmacs with po-mode.

(4) Using natural language keys. The “best practices” usage of gettext have English texts as the keys. This is supported by the utility tool “xgettext” which extracts strings automatically from your source.

This sounds nice, but I don’t like having English-text (or, in our case, German text) as the keys for translation files. If the text is e.g. “Click here for more info” and then the new style guideline for the site becomes “More Information”, then you end up having

// mypage.php
echo gettext(“Click here for more info”); // prints “More Information” # mypage.po
msgid “Click here for more info”
msgstr “More Information”

I dunno, that’s just confusing for me. I’d much rather have a text-neutral key such as “more-info”.

Update: This article also shows why you can’t use English-langauge text as translation keys.

(5) Referencing usages from the translation file. The “xgettext” utility writes lines such as the following into the “.po” file

#: mypage.php:47
msgid “Click here for more info”

msgstr “Click here for more info”

I don’t in any way like having the source file name and line number in the translation files. In principle it looks like it helps you to find the usage of a particular string, but in fact:

  1. It is not hard to find all the usages of the key “myprog.error.disk-full”. That string is hardly going to appear in a non-translation context by accident. A recursive search will tell you where its usages are.
  2. What if I change “mypage.php”? (which is pretty likely). For example inserting some lines before line 47. Then the information is not only irrelevant, but in addition wrong.

It is a principle of mine that not only should databases be normalized, but software source also. Every piece of information should be in exactly one place. And that place is where it’s technically needed (in this case, in the PHP file, as otherwise the string wouldn’t get displayed). As that’s (the only place) where it’ll get updated.

(6) Parameters. We all need strings such as “The file ‘$FILE’ has been successfully deleted”. It seems that the standard way to do this in gettext is to use sprintf-type placeholders (e.g. “%s”). However as soon as you have more than one of those, and you translate the string into French, you’ll find you need the parameters the other way around. Oops. That didn’t work. So gettext is only suitable a) for Western European languages (due to character set constraints) and b) only for the subset of those languages which have grammars where placeholders will be needed in the same order.

The first thing I did was write a wrapper around gettext to accept $0, $1 style parameters, so one could swap their order on a per-translated-string basis. (Although $FILE named parameters might have been better; but that would have made the calling code longer.)

So nice one, they managed to invent, for the purposes of translation, a system which has a file format more difficult to use than a simple key-value pair, yet offering no advantages. It can’t handle Unicode. Good work.

GUI Programming: Always perform network requests asynchronously

Monday, April 23rd, 2007

Why does one feel ones so much more in control, when using Firefox, than Internet Explorer?

When you select a slow link in Internet Explorer, the whole program hangs for about 1-2 seconds. Firefox doesn’t. Although 1-2 seconds is hardly a large % of ones life, it makes a big difference to the experience one has when using Firefox.

Recently I wrote something similar to an IM client (written in Java). It sits on the Windows tray. You can log in, and open a window where you can do various things. The data stored on a website (communication over XML-RPC).

MSN Messenger has one tray-icon for the user being logged out, a different one for the user being logged in, and amazingly (I thought), a third one for during the time the program spends communicating with the servers to log the user on.

In my system, log on is just one single XML-RPC call, with all the necessary data returned in the response. This was a design goal, to never have more than one client-server request to represent a particular user action.

The back-end to this XML-RPC call is a simple perl script which uses a few objects to represent things such as Users. These objects are simple enough, they just make a few SELECTs against our super-fast database. So I thought, as any request to our back-end takes say max 0.2 seconds, I needn’t make that asynchronous to the UI of the Java program. And I certainly don’t need a separate icon to display during that time!

If I’d ever stated that decision out loud, I would have heard myself saying it, and realized what a nonsense that is.

  • While it may only take 0.2 seconds on the server, there’s latency to consider, i.e. the time for the packets to flow from the client to the server and back again.
  • One can’t take into account how slow the user’s network connection might be.
  • There may well be more than one request from client to server, multiplying the latency. Just because there is one XML-RPC request doesn’t mean there are no other requests going on underneath, for example DNS lookup of the hostname to connect to.
  • If there is a queue of HTTP requests in Apache, waiting for the FCGI to answer the XML-RPC request, then the time the HTTP request to wait in the queue will also be added to the duration of the call perceived by the user.
  • What if there is a server-problem, and all requests take 2 seconds? A design not tolerant of things going wrong is a bad design.
  • Even 0.2 seconds is noticeable in a front-end.
  • Programming asynchronously in Java is not difficult. So it need not be avoided.

So now, every time I log on using that program, the dialog to log in opens, one clicks connect, and … wait … until the success or failure response is shown. And in that time, the program is just dead. It doesn’t even redraw its windows. It may only be 0.5 seconds, but you notice it.

Lesson – it may sound obvious – but it’s still worth stating: In a GUI Program (Windows, Mac OS X, etc.), any user interaction over a network, must be performed asynchronously (i.e. in a thread or in a separate process).

Releasing working code

Tuesday, April 17th, 2007

I spend a lot of my time getting annoyed by errors in other people’s software (e.g. Windows). Errors which, when you see them, you wonder how on earth they could have been overlooked. But recently I released of a piece of software which contained a major bug (it was only a small mistake, but the consequences were big).

So I set about thinking, what sequences of actions lead, in my experience, to software which works? A lot of these are obvious, yet I’ve found myself often enough not following them, due to time or pressure reasons. And the result: is stuff which doesn’t work.

(1) Unit test scripts: Make them easy to run. For one product I work on a lot, there’s are a whole bunch of test scripts, testing all sorts of classes. In fact there are over 21k lines of unit tests! This is a good thing. But sometimes the person running them has to compare the value printed by the program with the expected value (i.e. has to know the expected value, not easy 2 years after the program was written). And not all classes are tested at all. But there are still a good few which do good tests and print “ok” if the result is correct. This is good, but it’s so much work to run them all. The solution is to chain them altogether, as is easy to do with JUnit, and create one simple command or click to test them all. If it’s simple and convenient and creates value, people will do it.

Also, having a framework into which to put tests – for example, having a convention that a class called “X” has a test class called “XTest”, and that methods like “operationY” on “X” have a method “testOperationY” in “XTest” – encourages people not to be scared to write tests. (But forcing people to write tests, e.g. one test per method, is a waste of time. Not every method needs a test.)

(2) Know what the important features are. Most websites really have many many features. It’s impossible to test them all, without restricting oneself to 6 month release cycles and 1 month test phases. But there are usually a bunch of features would would be show-stoppers if they didn’t work. Can a new user register? Can they upload a photo? (For a photo website). Can they send an SMS (For an SMS website). Write these show-stopping features down. Before the release, go through and test them on the pre-production server. After the release, test them again on the live system. Writing them down helps one not to forget the ones one can’t be bothered to test.

It doesn’t matter if this list is long. Maybe there really are a ton of features which simply cannot not work. Then you’d better have tested them all.

(3) For unimportant operations, ignore failure. Recently I wrote a program which writes a ZIP file. As a small extra feature, if a file in the ZIP hasn’t changed since the last time the program ran, the timestamp of the file in the new ZIP file should be the same as in the old one. This isn’t a very important feature, but it’s there. Once, when it ran, there was a file I/O problem reading the old file, and the program aborted. But this isn’t an important enough feature to abort execution: the program should have continued, and just given all files in the new ZIP a new timestamp.

Consider this when writing all code: if this goes wrong, does it matter? If not, when something goes wrong (any Throwable), log the exception and continue. You’ll kick yourself if failure of one part doesn’t matter, yet it brings down the whole program.

(4) Restart everything. You’ve only change one small piece of code, why incur the cost of restarting all Apaches and all robots? Well, software’s strange, and any change, however localized, can break any functionality. Any programmer knows this to be true. If you don’t restart everything, how will you everything still works? How will you test it?

(5) Look at the log files after releasing. Even if, after a restart of the live servers, everything seems fine, what are the users seeing? They’re testing different paths than you. If you log uncaught Exceptions, take a look at the log file before the release, and again after the restart after the release, and see if there are more errors. For example, SQL errors which weren’t there beforehand. This could alert you to a problem you’ve overlooked.

(6) Static checks are good. Programming languages such as Smalltalk and LISP popularized the notion that it’s cool to do everything, such as method lookup, at runtime. “It’s gives you more flexibility.” While this is certainly true, there are a lot of errors which you’ll then only find at runtime. (The same is true of SQL strings in program code: You will only know if you’ve misspelled a column name in the SQL when you run the particular piece of code.) This is not helpful to minimize your errors. I appreciate that taking code online which hasn’t even been run once is hardly a good idea, but I’ve seen it happen often enough.

Java and Hibernate are a good combination in this respect. If the Java program compiles then you know you’ve got all your variable names, function names, type-casts and Exception checking right. If the Hibernate program starts then you know the classes map to existing tables correctly. (But HQL, represented as strings within your program, are bad again, as you could make a spelling mistake, and it will only cause an exception when the particular code is executed.)

If one has to have SQL strings in the program, and thus an error in it will only be detected once the code path is executed, maybe a prepare of the statement can be placed in a static constructor of the class? That way at least when the class is loaded (at the start of the program’s execution, most likely) one will find out about the problem.

(7) Be aggressive about cleaning old code. The more code there is, the more complex a system is to understand. If one has a new chat system, why is code which communicates with the old chat system still there? What if that code relies on classes which you’re about to change? What if it communicates with an old chat server and the results aren’t displayed anywhere any more, and then that old chat server goes away? The motto “clean code that works” does not involve having 100k lines of old junk around, which no one understands, no one wants to take the time to learn (as it’s no longer relevant), and will break randomly.

(8) Compile all of the program. It’s obvious, before one releases a Java program, one does a “clean all” then a compile, just to check that one hasn’t changed a class and forgotten to recompile a client of it, which will result in a runtime error, e.g. a MethodNotFoundException. Why doesn’t one do the same in scripting languages? Admittedly scripting language compilers don’t check as much, but they still check some things (e.g. syntactic correctness). One “unit test” should be to go through every program file – every library, every CGI, every PHP page, and do a compile check on it.

(9) Release emails. It’s easier to delete an email which one’s not interested in, than to find out information from an email one didn’t receive and doesn’t know was ever sent. If a service like a website suddenly breaks, it’s important to fix it as soon as possible, and that probably means contacting the person who caused it to break. Before (not after) a release, write an email to all concerned – operations engineers, software developers, support agents, managers – and let them know that a change is going live.

(10) Be contactable. There’s nothing worse, for creating a perception of negativity, than when someone’s made an error, and you can’t contact them. Make sure mobiles are on loud. If you’re not reading email for some reason, make sure you’ve told everyone in advance, and set up an auto-responder. Want to be contacted less? Make fewer errors.

(11) Monitoring. For each robot and front-end program, one needs to define what acceptable conditions are and what not. E.g. what logs must be written by the correctly-running program, and which logs must not be written. Monitor them. This takes quite a lot of effort, a) the monitoring software b) defining what are acceptable and unacceptable conditions c) tuning the software to actually produce logs which are usefully monitorable. But it’s necessary. If it’s not done, there will be errors written to the logs and nobody will see them.

We don't need these users – let's move them to an "archive" table!

Thursday, March 29th, 2007

For one of the customers I currently work for, when we first designed the platform in Q1/2000, there was the "account" table, there we stored our users. There were always various pressures to move "inactive" users to a separate "archive" table. I was always against this decision.

In Q4/2005, during a period of my absence, it was decided to implement this decision. A bunch of users were to be deleted, but "not quite", in case we needed their data again. Their data was to be moved from the "account" table to an "account_archive" table.

This was really the worst decision ever made. I said that before, and now I see the consequences. I want anyone who considers such an operation good, to understand the consequences. So I list them here.

  • More and more, bosses and business people require we do operations on "all" users, which includes the "account_archive" table. This generally involves a "union" of both tables.
  • Now I have to create a real-time data interface to a slave system. This also including archived users. That means I have a "who has changed" table (input queue for the process exporting changed users to the slave system). This table references account_ids, but I can't create an FK from this table to "account", as sometimes an "account_id" references "account" and sometimes "account_archive".
  • There are classes which model a User, and this uses the "account" table as the underlying table. This enables me to build logic functions on the User class, and this has been done. However, at the time the class was built, there was only "account", so I can't use this class to model users who are stored in the "account_archive" table. (I'm not going to extend the User object to include the "account_archive" table, that will make this critical code too complex)
  • Now I have to allow users to "unsubscribe" from a newsletter, and "archive" people can receive newsletters, if they elected to receive them while they were active. Again, I can't use the User objects to do that. So I have to just program in plain SQL in an fcgi (or create a second class MaybeArchivedUser to model a user which could be in either table, and then duplicate some instance methods – that's what I chose to do).
  • It was suggested "maybe we archived the wrong users". But it's nearly impossible to re-create them as the schema is different, and some information has not been kept on in "account_archive". Their nicknames, which are unique amongst active users (but not amongst archive users) might have been reused in the meantime.

It would have been necessary to decide on one of the two courses of action:

  • We actually will never need these users again: we delete them
  • We might or do need them in the future. In which case we set a special "status" in the "account" table. They can't log in. But we can build User objects. We can re-enable them if necessary. We can even let them log in to some mini-platform where they can do a few things such as delete themselves or request their reactivation.

It has been said that removing users from account "increases performance". However:

  • It's more probable that two accounts will be read from the same disk block, after a defragmentation has occurred (did any defragmentation run? I don't think so)
  • If there are half the number of users, that's one less binary-index level. If there are 2M users that's 21 index branches. If 1M users that's 20 index branches. Hardly a big saving.
  • Although backup (and recovery) no doubt became quicker
  • This hardly makes up for the other disadvantages

Splitting a table up into two tables, for "performance", or whatever reason, is never a good thing to do. Add a status flag.