Archive for May, 2007

When is a software project done?

Tuesday, May 29th, 2007

A software project is defined, for the purposes of this blog entry, as a set of people working to produce a new software system, or to modify an existing software system.

The result (exit condition) of a software project is a set of artifacts and other assertions:

  1. Document (or wiki etc) describing what the software should do, i.e. requirements. This will include subtle details, about what the system does, that will not immediately be obvious by looking at the front-end, or reading software design documentation. This should be a complete description, which is useful for the future, not just a “delta” from the last version.
  2. Software architecture documentation, in words. Simply looking at 1,000 Javadocs will not enable a new team member to understand the system. Documentation should also include which other options were evaluated and not chosen, and why, to avoid future teams considering the same things.
  3. (Obviously) the source code for the software. Including the front-end, back-end, any HTML, etc.
  4. Unit test scripts for all back-end classes needing them.
  5. Front-end tests. Either a document (simple statements such as “Click on Submit without enough money on account. See error message”), or configuration of a front-end testing program.
  6. Performance tests done and the software to perform them, if appropriate.
  7. Configuration (or creation) of a monitoring system to monitor the system once it’s live, if it’s a service (e.g. web site).
  8. Administration system for customer care, if it’s a service.
  9. Management reporting. Especially just after a system goes live, management are always very curious about key statistics, such as number of users, number of items sold etc. That needs to be analyzed in advance and the system in place when the system goes live.
  10. Class diagrams
  11. Javadoc to describe the purpose of individual classes and methods (where this is not obvious from the names). For scripting languages: parameter and return types (as this cannot be deduced from the source code).
  12. If this is not the first version of a system, migration concept including scripts to install software, migrate schema, filesystems containing user data, and anything else.
  13. System uses appropriately international character set such as UTF-8. (This is not particularly modern, the WinNT team decided to do this in 1988.) Java does this out of the box, but it’s more than just the programming language. This includes any database, any data stored in flat-files, any APIs (within the system or to/from external systems), and so on.
  14. All of the above under version control
  15. Not only the software installed on a live system, but also the existence of test and staging systems. If one uses the live systems for testing, then, once one’s gone live, one has no way to fix bugs in a testing environment. And bugs will happen, and they need to be fixed fast, so one had better have thought of this in advance.
  16. Bug tracking, or wiki system, or some way that the team is trained and rehearsed in using, to track and assign errors as they occur.
  17. Understood and tested data backup and recovery process. (What happens if the live DB crashes? Better have thought about recovery before that happens.)
  18. The team must sleep e.g. 2 days before a release. After a release (bug fixing) is the most stressful time of a project and where the team must be at its most alert (as fixing is time-critical). It’s important to sleep beforehand, and not e.g. work 7 days a week then in the evening finally release, then go to bed. (You can be certain that 1 hour after you’ve gone to sleep the site will be offline due to some problems, and you weren’t there to fix them.)

Memorable URLs

Thursday, May 24th, 2007

One thing I have to say I really like about uboot is that I can always remember the URLs to the various places in my nickpage. I can just type them into an IM conversation and don’t even have to click on them to make sure I got them right. (I don’t have to go to the nickpage, copy-paste the URL)

I appreciate uboot isn’t the only website to have URLs which one can remember, but it’s something that Uboot’s really got right.

gettext is so broken

Friday, May 18th, 2007

Working on a PHP project recently, there was the requirement for text localization. The standard way to do this in PHP is to use the standard way to do this in C, which is gettext.

I’ve worked with various translation systems, including one I built myself for uboot, involving a hierarchy of languages going from most specific to most “international”, and with each string having a hierarchical id such as “myprogram.errors.disk-full”.

Java Properties files are simple but also work well (simplicity being a positive thing in this case). The lines are key-value pairs, and using a convention such as “myprogram.errors.disk-full” the key is almost as good as if it actually were a key hierarchy. The file is in Latin1 but Unicode characters can be used via an escape syntax, and there are many editors where one can just type Unicode text and which take care of the escaping.

So I was looking forward to using gettext. This format was created by GNU, the creators of GCC (a highly respected program). gettext is itself well respected and authors of systems such as PHP have chosen it as their localization system.

But alas, it is broken in so many ways.

(1) The file format. Whereas Java’s file format is to have lines such as “key=value”, gettext’s “.po” format (where did that extension come from?) has two lines for every string, like

msgid “key”
msgstr “value”

As one inevitably places a blank line between one key-value pair and the next, the file is immediately 3 times as long as a Java properties file storing the same information. And what if you want to have double-quotes within your string?

(2) Compilation (for performance reasons). I work with scripting languages, where there is no compiler. This can be a good or a bad thing; but independent of that, it is a fact. However the editable “.po” files of gettext have to be converted into binary “.mo” files before they work. Thus I have to introduce a compilation step into my otherwise compilation-free edit-and-that’s-it test environment.

In fact I don’t understand this compilation requirement at all. According to the gettext manual, gettext was developed in 1994. Surely computers were fast enough back then to parse the gettext format, store the whole lot in a hash?

And what I further don’t understand is how/if GNU programs were localized before then. I suppose they just weren’t.

(3) What about Unicode? I have no idea how to introduce Unicode characters into the editable “.po” files of gettext. The manual doesn’t help me. Supporting only 8-bit characters, and assuming/hoping that the encoding of the “.po” file is the same as the encoding that the user is using in viewing the output of your program, is simply a terrible solution. Microsoft designed Windows NT to use Unicode internally in 1988. Java uses only Unicode since its inception in 1991.

Unbelievably there is a reason given for not using Unicode.

However, we don’t recommend this approach for all POT files in all packages, because this would force translators to use PO files in UTF-8 encoding, which is - in the current state of software (as of 2003) - a major hassle for translators using GNU Emacs or XEmacs with po-mode.

(4) Using natural language keys. The “best practices” usage of gettext have English texts as the keys. This is supported by the utility tool “xgettext” which extracts strings automatically from your source.

This sounds nice, but I don’t like having English-text (or, in our case, German text) as the keys for translation files. If the text is e.g. “Click here for more info” and then the new style guideline for the site becomes “More Information”, then you end up having

// mypage.php
echo gettext(”Click here for more info”); // prints “More Information” # mypage.po
msgid “Click here for more info”
msgstr “More Information”

I dunno, that’s just confusing for me. I’d much rather have a text-neutral key such as “more-info”.

Update: This article also shows why you can’t use English-langauge text as translation keys.

(5) Referencing usages from the translation file. The “xgettext” utility writes lines such as the following into the “.po” file

#: mypage.php:47
msgid “Click here for more info”

msgstr “Click here for more info”

I don’t in any way like having the source file name and line number in the translation files. In principle it looks like it helps you to find the usage of a particular string, but in fact:

  1. It is not hard to find all the usages of the key “myprog.error.disk-full”. That string is hardly going to appear in a non-translation context by accident. A recursive search will tell you where its usages are.
  2. What if I change “mypage.php”? (which is pretty likely). For example inserting some lines before line 47. Then the information is not only irrelevant, but in addition wrong.

It is a principle of mine that not only should databases be normalized, but software source also. Every piece of information should be in exactly one place. And that place is where it’s technically needed (in this case, in the PHP file, as otherwise the string wouldn’t get displayed). As that’s (the only place) where it’ll get updated.

(6) Parameters. We all need strings such as “The file ‘$FILE’ has been successfully deleted”. It seems that the standard way to do this in gettext is to use sprintf-type placeholders (e.g. “%s”). However as soon as you have more than one of those, and you translate the string into French, you’ll find you need the parameters the other way around. Oops. That didn’t work. So gettext is only suitable a) for Western European languages (due to character set constraints) and b) only for the subset of those languages which have grammars where placeholders will be needed in the same order.

The first thing I did was write a wrapper around gettext to accept $0, $1 style parameters, so one could swap their order on a per-translated-string basis. (Although $FILE named parameters might have been better; but that would have made the calling code longer.)

So nice one, they managed to invent, for the purposes of translation, a system which has a file format more difficult to use than a simple key-value pair, yet offering no advantages. It can’t handle Unicode. Good work.

Making progress with introduction of unit tests to Uboot

Monday, May 14th, 2007

The old uboot code had, amazingly enough, 21k lines of unit tests. But they were not useful unit tests, as one had to run each program individually, and they each had a bunch of (different) prerequisites, such as account_id 3 existing and having an empty inbox, and so on. And with the older tests, their output would be a bunch of print statements (e.g. insert message; print count of messages), and one would have to compare the printed output with the expected results (which weren’t documented anywhere).

I am converting them to PerlUnit (which is a clone of JUnit) so that we can automatically and easily run as many tests as possible before each release. This is an incredibly productive task, as I don’t even need to write new unit tests (and think about testing strategy), I’m just converting the lines to a format enabling them to be convenient to run!

So far 3.6k lines in 86 test functions in 33 test classes :)

$ ./test.pl
...................................................
...................................
Time: 48 wallclock secs ( 8.65 usr  0.56 sys +  0.02 cusr  0.25 csys =  9.48 CPU)

OK (86 tests)

Transfering some hex. Sometimes gets replaced by string "INF". Why?

Thursday, May 10th, 2007

This was never going to work out. Data transfer interface. Our side in Perl and their side in PHP. Both scripting languages (bad) and not even the same scripting language (incompatible badness).

Over the data transfer interface, we are transferring users. Including a code to enable them to unsubscribe from an email newsletter. The first 7 characters of the code identify the users (digits) and the rest of the code is a hex string containing some security information.

All works great. But some users can’t use the code? It turns out on the destination system they have “INF” in the field instead of the code.

It turns out that some of these users have e.g. 1234567 to identify the user, and e.g. 123e1234567 as their hex code. That makes the security code “1234567123e1234567″. And that “looks like” a floating point number to Perl. But quite a big one. Almost as big as Infinity in fact, so might as well call it that.

I hardly think the flexibility we “won” through every data instance having its own type based on what its data “looks like” hardly compensates the anger of a segment of our users not being able to unsubscribe from their newsletter, or the extra expense to the company of the time to debug this problem (which was then an urgent problem, as it was only discovered after the system went live, as it only affected 0.6% of our users).

P.S. my solution was to put a space in front of the code, which is taken off by the receiving system, so the data always “looks like” a string. But I wouldn’t like to guarantee that what “looks like” a string won’t change with the next version of the Perl SOAP client libraries we are using.

Class names repeating information stated in the package name

Sunday, May 6th, 2007

Classes in modern programming languages can be arranged in hierarchies, e.g. a perl class might be called “Uboot::Message::Mail” or a Java class “com.uboot.message.Mail”.

In some programming languages (e.g. Perl) one always refers to the class by its full name (such as “Uboot::Message::Mail”) and never by its leaf name (e.g. “Mail”). For example:

use Uboot::Message::Mail;
my $mail = Uboot::Message::Mail->new();
print "it's a mail" if ($mail->isa("Uboot::Message::Mail"));

In other langauges (e.g. Java) one almost always refers to classes via their leaf-name, such as:

import com.uboot.message.Mail;
class MyClass {
   public void static main(String[] args) {
      Mail mail = new Mail();
      if (mail instanceof Mail) System.out.println("it's a mail");
   }
}

For those languages such as Perl, which require using the class’ full path at all times, it’s not necessary to repeat information in the leaf name that has been specified already in the path. For example, a class to model an entry in a Uboot address book might be in a directory called “Uboot/ABook” in which case the entry class can be called “Uboot::ABook::Entry”.

But in Java, you don’t want to have a class called “Entry” because, as soon as the “import” statement scrolls out of sight, you’ll not know if your instance, helpfully statically typed to be an “Entry”, is an address book entry, a guestbook entry, a blog entry, or any other conceivable type of entry. In that case the class needs to be called something like “com.uboot.abook.ABookEntry”.

Class names like “Uboot::ABook::ABookEntry” or “Uboot::Monitoring::MonitoringResult” are (only in langauges such as Perl) needlessly redundant and long.

perl / switch statement: Cool Limitation

Wednesday, May 2nd, 2007

Look at the documentation for the Perl switch statement. Look down the bottom at the “limitations” section. Look at the last limitation.

vi

Tuesday, May 1st, 2007

Here I am, programming using “vi” and, as usual, it’s annoying me. Why am I using it?

It’s just occurred to me, I remember from my childhood, my father would come home from work and complain about “vi”.

I wonder if my children will use “vi”?