Don’t use gettext

By Adrian Smith18 May 20071100 words5 mins to read

Working on a PHP project recently, there was the requirement for text localization. The standard way to do this in PHP is to use the standard way to do this in C, which is gettext.

I've worked with various translation systems, including one I built myself for uboot, involving a hierarchy of languages going from most specific to most "international", and with each string having a hierarchical id such as "myprogram.errors.disk-full".

Java Properties files are simple but also work well (simplicity being a positive thing in this case). The lines are key-value pairs, and using a convention such as "myprogram.errors.disk-full" the key is almost as good as if it actually were a key hierarchy. The file is in Latin1 but Unicode characters can be used via an escape syntax, and there are many editors where one can just type Unicode text and which take care of the escaping.

So I was looking forward to using gettext. This format was created by GNU, the creators of GCC (a highly respected program). gettext is itself well respected and authors of systems such as PHP have chosen it as their localization system.

But alas, it is broken in so many ways.

The file format is verbose

Whereas Java's file format is to have lines such as "key=value", gettext's ".po" format (where did that extension come from?) has two lines for every string, like:

msgid "key"
msgstr "value"

As one inevitably places a blank line between one key-value pair and the next, the file is immediately 3 times as long as a Java properties file storing the same information. And what if you want to have double-quotes within your string?

Unnecessary compilation

I work with scripting languages, where there is no compiler. This can be a good or a bad thing; but independent of that, it is a fact. However the editable ".po" files of gettext have to be converted into binary ".mo" files before they work. Thus I have to introduce a compilation step into my otherwise compilation-free edit-and-that's-it test environment.

In fact I don't understand this compilation requirement at all. According to the gettext manual, gettext was developed in 1994. Surely computers were fast enough back then to parse the gettext format, store the whole lot in a hash?

And what I further don't understand is how/if GNU programs were localized before then. I suppose they just weren't.

Lack of support for Unicode

I have no idea how to introduce Unicode characters into the editable ".po" files of gettext. The manual doesn't help me. Supporting only 8-bit characters, and assuming/hoping that the encoding of the ".po" file is the same as the encoding that the user is using in viewing the output of your program, is simply a terrible solution. Microsoft designed Windows NT to use Unicode internally in 1988. Java uses only Unicode since its inception in 1991.

Unbelievably there is a reason given for not using Unicode.

However, we don't recommend this approach for all POT files in all packages, because this would force translators to use PO files in UTF-8 encoding, which is – in the current state of software (as of 2003) – a major hassle for translators using GNU Emacs or XEmacs with po-mode.

Natural language keys

The "best practices" usage of gettext have English texts as the keys. This is supported by the utility tool "xgettext" which extracts strings automatically from your source.

This sounds nice, but I don't like having English-text (or, in our case, German text) as the keys for translation files. If the text is e.g. "Click here for more info" and then the new style guideline for the site becomes "More Information", then you end up having

// mypage.php
echo gettext("Click here for more info"); // prints "More Information"

# mypage.po
msgid "Click here for more info"
msgstr "More Information"

I dunno, that's just confusing for me. I'd much rather have a text-neutral key such as "more-info".

This article also shows why you can't use English-langauge text as translation keys.

The code should reference the strings, not the other way around

The "xgettext" utility writes lines such as the following into the ".po" file

#: mypage.php:47
msgid "Click here for more info"
msgstr "Click here for more info"

I don't in any way like having the source file name and line number in the translation files. In principle it looks like it helps you to find the usage of a particular string, but in fact:

  1. It is not hard to find all the usages of the key "myprog.error.disk-full". That string is hardly going to appear in a non-translation context by accident. A recursive search will tell you where its usages are.
  2. What if I change "mypage.php"? (which is pretty likely). For example inserting some lines before line 47. Then the information is not only irrelevant, but in addition wrong.

It is a principle of mine that not only should databases be normalized, but software source also. Every piece of information should be in exactly one place. And that place is where it's technically needed (in this case, in the PHP file, as otherwise the string wouldn't get displayed). As that's (the only place) where it'll get updated.

Parameters

We all need strings such as "The file '$FILE' has been successfully deleted". It seems that the standard way to do this in gettext is to use sprintf-type placeholders (e.g. "%s"). However as soon as you have more than one of those, and you translate the string into French, you'll find you need the parameters the other way around. Oops. That didn't work. So gettext is only suitable a) for Western European languages (due to character set constraints) and b) only for the subset of those languages which have grammars where placeholders will be needed in the same order.

The first thing I did was write a wrapper around gettext to accept $0, $1 style parameters, so one could swap their order on a per-translated-string basis. (Although $FILE named parameters might have been better; but that would have made the calling code longer.)

So, they managed to invent, for the purposes of translation, a system which has a file format more difficult to use than a simple key-value pair, yet offering no advantages. It can't handle Unicode. Good work.

This article was written by Adrian Smith on 18 May 2007

Follow me: Facebook | Twitter | Email

More on: FAIL | Tools | PHP | Coding