gettext is so broken

Working on a PHP project recently, there was the requirement for text localization. The standard way to do this in PHP is to use the standard way to do this in C, which is gettext.

I’ve worked with various translation systems, including one I built myself for uboot, involving a hierarchy of languages going from most specific to most “international”, and with each string having a hierarchical id such as “myprogram.errors.disk-full”.

Java Properties files are simple but also work well (simplicity being a positive thing in this case). The lines are key-value pairs, and using a convention such as “myprogram.errors.disk-full” the key is almost as good as if it actually were a key hierarchy. The file is in Latin1 but Unicode characters can be used via an escape syntax, and there are many editors where one can just type Unicode text and which take care of the escaping.

So I was looking forward to using gettext. This format was created by GNU, the creators of GCC (a highly respected program). gettext is itself well respected and authors of systems such as PHP have chosen it as their localization system.

But alas, it is broken in so many ways.

(1) The file format. Whereas Java’s file format is to have lines such as “key=value”, gettext’s “.po” format (where did that extension come from?) has two lines for every string, like

msgid “key”
msgstr “value”

As one inevitably places a blank line between one key-value pair and the next, the file is immediately 3 times as long as a Java properties file storing the same information. And what if you want to have double-quotes within your string?

(2) Compilation (for performance reasons). I work with scripting languages, where there is no compiler. This can be a good or a bad thing; but independent of that, it is a fact. However the editable “.po” files of gettext have to be converted into binary “.mo” files before they work. Thus I have to introduce a compilation step into my otherwise compilation-free edit-and-that’s-it test environment.

In fact I don’t understand this compilation requirement at all. According to the gettext manual, gettext was developed in 1994. Surely computers were fast enough back then to parse the gettext format, store the whole lot in a hash?

And what I further don’t understand is how/if GNU programs were localized before then. I suppose they just weren’t.

(3) What about Unicode? I have no idea how to introduce Unicode characters into the editable “.po” files of gettext. The manual doesn’t help me. Supporting only 8-bit characters, and assuming/hoping that the encoding of the “.po” file is the same as the encoding that the user is using in viewing the output of your program, is simply a terrible solution. Microsoft designed Windows NT to use Unicode internally in 1988. Java uses only Unicode since its inception in 1991.

Unbelievably there is a reason given for not using Unicode.

However, we don’t recommend this approach for all POT files in all packages, because this would force translators to use PO files in UTF-8 encoding, which is – in the current state of software (as of 2003) – a major hassle for translators using GNU Emacs or XEmacs with po-mode.

(4) Using natural language keys. The “best practices” usage of gettext have English texts as the keys. This is supported by the utility tool “xgettext” which extracts strings automatically from your source.

This sounds nice, but I don’t like having English-text (or, in our case, German text) as the keys for translation files. If the text is e.g. “Click here for more info” and then the new style guideline for the site becomes “More Information”, then you end up having

// mypage.php
echo gettext(“Click here for more info”); // prints “More Information” # mypage.po
msgid “Click here for more info”
msgstr “More Information”

I dunno, that’s just confusing for me. I’d much rather have a text-neutral key such as “more-info”.

Update: This article also shows why you can’t use English-langauge text as translation keys.

(5) Referencing usages from the translation file. The “xgettext” utility writes lines such as the following into the “.po” file

#: mypage.php:47
msgid “Click here for more info”

msgstr “Click here for more info”

I don’t in any way like having the source file name and line number in the translation files. In principle it looks like it helps you to find the usage of a particular string, but in fact:

  1. It is not hard to find all the usages of the key “myprog.error.disk-full”. That string is hardly going to appear in a non-translation context by accident. A recursive search will tell you where its usages are.
  2. What if I change “mypage.php”? (which is pretty likely). For example inserting some lines before line 47. Then the information is not only irrelevant, but in addition wrong.

It is a principle of mine that not only should databases be normalized, but software source also. Every piece of information should be in exactly one place. And that place is where it’s technically needed (in this case, in the PHP file, as otherwise the string wouldn’t get displayed). As that’s (the only place) where it’ll get updated.

(6) Parameters. We all need strings such as “The file ‘$FILE’ has been successfully deleted”. It seems that the standard way to do this in gettext is to use sprintf-type placeholders (e.g. “%s”). However as soon as you have more than one of those, and you translate the string into French, you’ll find you need the parameters the other way around. Oops. That didn’t work. So gettext is only suitable a) for Western European languages (due to character set constraints) and b) only for the subset of those languages which have grammars where placeholders will be needed in the same order.

The first thing I did was write a wrapper around gettext to accept $0, $1 style parameters, so one could swap their order on a per-translated-string basis. (Although $FILE named parameters might have been better; but that would have made the calling code longer.)

So nice one, they managed to invent, for the purposes of translation, a system which has a file format more difficult to use than a simple key-value pair, yet offering no advantages. It can’t handle Unicode. Good work.

4 Responses to “gettext is so broken”

  1. Pikku-Orava Says:

    WoW you’re so full of shit.

    The files are in UTF-8 encoding always, and they support parameter reordering via %$1s syntax.

    And wonderful plural support has no analogs.

  2. Dean Says:

    1. What does it matter how long the file is? Also you can use a po-editor to handle any quote marks escaping – what’s easier when translating by hand into e.g. Greek – escaping ” or escaping every character you write?

    3. The problem with unicode is to do with bugs in a po file editor eight years ago – your translators probably aren’t going to use emacs. It’s only mentioned as it’s the GNU preferred editor.

    5. One of the gettext suite will remove any inaccurate source file comments

  3. Peter Says:

    _all_ of your points are invalid o_O you might do some better research next time

  4. David Zentgraf Says:

    Indeed you have not understood the gettext system. You’re bashing something that you do not understand. None of the points you make are actual issues, there’s an answer for everything. One of the more important ones is that gettext with its file format and toolchain scales to large projects with distributed developers/translators/maintainers. The .po file format offers many options for embedding extra information, which is necessary for localizing large projects (domains, categories, context, flags, comments). As a file format it goes way beyond simple key=value pairs. If you did not find a need for those extras yet, you have not done very complex localizations.

    Learn all the tools, including xgettext and msgmerge. The latter is vital in the whole workflow of working with gettext. Developers develop the code, adding comments, context, domains, categories to the source. All this is extracted into .po files automatically, you do not write those by hand. You distribute them to the translators. You keep changing the code. The translations come back, you merge them back in, you update the translation files from the source, you distribute them again. Rinse, repeat. Gettext fits into this workflow.

Leave a Reply

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

For inserting HTML or XML please remember to use &lt; instead of <