Archive for the ‘Broken’ Category

gettext is so broken

Friday, May 18th, 2007

Working on a PHP project recently, there was the requirement for text localization. The standard way to do this in PHP is to use the standard way to do this in C, which is gettext.

I’ve worked with various translation systems, including one I built myself for uboot, involving a hierarchy of languages going from most specific to most “international”, and with each string having a hierarchical id such as “myprogram.errors.disk-full”.

Java Properties files are simple but also work well (simplicity being a positive thing in this case). The lines are key-value pairs, and using a convention such as “myprogram.errors.disk-full” the key is almost as good as if it actually were a key hierarchy. The file is in Latin1 but Unicode characters can be used via an escape syntax, and there are many editors where one can just type Unicode text and which take care of the escaping.

So I was looking forward to using gettext. This format was created by GNU, the creators of GCC (a highly respected program). gettext is itself well respected and authors of systems such as PHP have chosen it as their localization system.

But alas, it is broken in so many ways.

(1) The file format. Whereas Java’s file format is to have lines such as “key=value”, gettext’s “.po” format (where did that extension come from?) has two lines for every string, like

msgid “key”
msgstr “value”

As one inevitably places a blank line between one key-value pair and the next, the file is immediately 3 times as long as a Java properties file storing the same information. And what if you want to have double-quotes within your string?

(2) Compilation (for performance reasons). I work with scripting languages, where there is no compiler. This can be a good or a bad thing; but independent of that, it is a fact. However the editable “.po” files of gettext have to be converted into binary “.mo” files before they work. Thus I have to introduce a compilation step into my otherwise compilation-free edit-and-that’s-it test environment.

In fact I don’t understand this compilation requirement at all. According to the gettext manual, gettext was developed in 1994. Surely computers were fast enough back then to parse the gettext format, store the whole lot in a hash?

And what I further don’t understand is how/if GNU programs were localized before then. I suppose they just weren’t.

(3) What about Unicode? I have no idea how to introduce Unicode characters into the editable “.po” files of gettext. The manual doesn’t help me. Supporting only 8-bit characters, and assuming/hoping that the encoding of the “.po” file is the same as the encoding that the user is using in viewing the output of your program, is simply a terrible solution. Microsoft designed Windows NT to use Unicode internally in 1988. Java uses only Unicode since its inception in 1991.

Unbelievably there is a reason given for not using Unicode.

However, we don’t recommend this approach for all POT files in all packages, because this would force translators to use PO files in UTF-8 encoding, which is - in the current state of software (as of 2003) - a major hassle for translators using GNU Emacs or XEmacs with po-mode.

(4) Using natural language keys. The “best practices” usage of gettext have English texts as the keys. This is supported by the utility tool “xgettext” which extracts strings automatically from your source.

This sounds nice, but I don’t like having English-text (or, in our case, German text) as the keys for translation files. If the text is e.g. “Click here for more info” and then the new style guideline for the site becomes “More Information”, then you end up having

// mypage.php
echo gettext(”Click here for more info”); // prints “More Information” # mypage.po
msgid “Click here for more info”
msgstr “More Information”

I dunno, that’s just confusing for me. I’d much rather have a text-neutral key such as “more-info”.

Update: This article also shows why you can’t use English-langauge text as translation keys.

(5) Referencing usages from the translation file. The “xgettext” utility writes lines such as the following into the “.po” file

#: mypage.php:47
msgid “Click here for more info”

msgstr “Click here for more info”

I don’t in any way like having the source file name and line number in the translation files. In principle it looks like it helps you to find the usage of a particular string, but in fact:

  1. It is not hard to find all the usages of the key “myprog.error.disk-full”. That string is hardly going to appear in a non-translation context by accident. A recursive search will tell you where its usages are.
  2. What if I change “mypage.php”? (which is pretty likely). For example inserting some lines before line 47. Then the information is not only irrelevant, but in addition wrong.

It is a principle of mine that not only should databases be normalized, but software source also. Every piece of information should be in exactly one place. And that place is where it’s technically needed (in this case, in the PHP file, as otherwise the string wouldn’t get displayed). As that’s (the only place) where it’ll get updated.

(6) Parameters. We all need strings such as “The file ‘$FILE’ has been successfully deleted”. It seems that the standard way to do this in gettext is to use sprintf-type placeholders (e.g. “%s”). However as soon as you have more than one of those, and you translate the string into French, you’ll find you need the parameters the other way around. Oops. That didn’t work. So gettext is only suitable a) for Western European languages (due to character set constraints) and b) only for the subset of those languages which have grammars where placeholders will be needed in the same order.

The first thing I did was write a wrapper around gettext to accept $0, $1 style parameters, so one could swap their order on a per-translated-string basis. (Although $FILE named parameters might have been better; but that would have made the calling code longer.)

So nice one, they managed to invent, for the purposes of translation, a system which has a file format more difficult to use than a simple key-value pair, yet offering no advantages. It can’t handle Unicode. Good work.

Transfering some hex. Sometimes gets replaced by string "INF". Why?

Thursday, May 10th, 2007

This was never going to work out. Data transfer interface. Our side in Perl and their side in PHP. Both scripting languages (bad) and not even the same scripting language (incompatible badness).

Over the data transfer interface, we are transferring users. Including a code to enable them to unsubscribe from an email newsletter. The first 7 characters of the code identify the users (digits) and the rest of the code is a hex string containing some security information.

All works great. But some users can’t use the code? It turns out on the destination system they have “INF” in the field instead of the code.

It turns out that some of these users have e.g. 1234567 to identify the user, and e.g. 123e1234567 as their hex code. That makes the security code “1234567123e1234567″. And that “looks like” a floating point number to Perl. But quite a big one. Almost as big as Infinity in fact, so might as well call it that.

I hardly think the flexibility we “won” through every data instance having its own type based on what its data “looks like” hardly compensates the anger of a segment of our users not being able to unsubscribe from their newsletter, or the extra expense to the company of the time to debug this problem (which was then an urgent problem, as it was only discovered after the system went live, as it only affected 0.6% of our users).

P.S. my solution was to put a space in front of the code, which is taken off by the receiving system, so the data always “looks like” a string. But I wouldn’t like to guarantee that what “looks like” a string won’t change with the next version of the Perl SOAP client libraries we are using.

Mozilla Thunderbird sucks

Thursday, April 12th, 2007

Really, Thunderbird is a terrible mail client. I'd been using Outlook for about 5 years when I first tried it, so I thought maybe the reason I didn't like it was simply because it was different, in which case I should continue to use it to get used to it. One year on I still hate it and recently it just ate half my mail. So I'm going back to Outlook.

While downloading a large message using POP over a slow connection recently, the download bar (slowly progressing from 0% to about 50% at the time of the crash) simply went away (without error). Clicking "Get mail" button again did nothing (without error). Restarting the program showed the "Inbox" to be blank for a very long time, but it seemed to be doing something, and after about 1-2 minutes the list of messages appeared. But only the mails received between the time I started using Thunderbird and about mid 2006-10 were there. Mails from mid 2006-10 to now (mid 2007-04) are just gone. So that'll be the mailbox corrupted then. Imagine you relied on Thunderbird as the only storage place for all your mail. Well, thankfully I don't. And thankfully I won't even be using Thunderbird for one of the storage places for my mail in the future.

Here are the reasons I didn't like Thunderbird from the beginning.

  • When you click "reply", the cursor inviting you to type a response to the quoted mail is at the bottom of the mail, not the top. It turns out there is a preferences option where you can change that, but it took me about 6 months to find it.
  • The HTML mail composer sucks. You have the cursor blinking away somewhere, press a key expecting the character to be inserted where the cursor is, but no. The cursor suddenly moves somewhere different (e.g. a line down) and inserts the character there.
  • If you send a rich text message, it asks you "do you want to send this mail as plain text (recommended), html, or both?". Text is rarely so long that the bandwidth required for a multipart/alternative would be a problem. And multipart/alternative is there so you, as the sender, don't have to know what formats the recipient can read. So this dialog box is just broken. Also: why is plain text recommended, do we want to be stuck in the 70s forever? Let's all go to the disco and send (recommended) plain text emails using Firefox.
  • In Outlook, if you click "send" and you are offline, the message is stored locally temporarily. As soon as a connection is available, it is sent. With Thunderbird, however, the situation is more complex. At the time of sending, you have to select "send" (which yields an error if you are offline), or "send later" (which is available when you are online, even though you'd never want it). When you go online you have to select "send emails now", as opposed to that happening automatically. However, I thought I could make this all go away when I found the option "if you go online, Thunderbird can send offline emails immediately". I clicked that but it didn't work. It turned out "go online" referred to the Thunderbird menu options "go online". If, every time I connected to the internet, I had to go through each application and use its menu option "go online", well, that would be a bad situation. Probably why other applications don't work like that.
  • Search results are unsorted. Search happens in the background (good) and adds mails to the search results window as it continues and finds them. If you click on a column heading in the results, e.g. "date", to sort the (initially unsorted) search results, then during search (as more emails are found) they are simply added to the bottom of the search results. So you have to click the column heading again, to do a sort including the newly found emails.
  • The UI to do search is terrible. If you open the drop-down with the keyboard, allowing you to select "sender", "recipient" (i.e. which field must match in the search), use the cursor keys to select the field you want, then press tab to move to the text field (to type the value of the field which much match, which works in other applications), the drop-down list of fields closes, but the field you had selected is forgotten.
  • Full-text search takes ages. No indexing. Why?
  • If you are composing an email, and want to send it to someone whose address you've forgotten, you can go to another window, find a mail from them, right click their address and say "add to address book". Go back to your compose window and try and use the address book: it doesn't contain the new entry. You have to close the compose window, open a new open, copy/paste the entire body and all other recipients over, then the new window knows about the current address book.
  • Emails you send using the HTML editor are in Times (not Helvetica/Arial as in Outlook), which makes ones emails look terrible, and also marks one out as a person using "strange" non-Outlook technology, to all ones recipients.

SMTP is not new. POP is not new.Win32 is not new. Surely in the time between the creation of those technologies and now, one must have been able to do better than this.

Mouse reboot

Wednesday, March 28th, 2007

I have been using a trusty wireless mouse for about 3 months now. (I didn’t want a wireless mouse, but here in Macau, I didn’t know what was going on, so I walked into an expensive hardware store—the only hardware store I knew—and they only had wireless mice. Well I thought, it may be twice the price but even twice the price isn’t expensive, and I need a mouse…)

It suddenly stopped working while I was using it.

  • The light under the mouse was on, so the mouse thought it was working.
  • The touch pad built into the laptop worked, so Windows was still working and accepting pointer-movement instructions.
  • I took out the USB device, which communicates with the mouse, and put it back in. “Detecting new hardware” etc. But it didn’t start working again.
  • I plugged the USB device into a new port. Even more “Detecting new hardware” etc. But it still didn’t start working again.

Then I took the batteries out of the mouse (which has no on/off switch), then put them back in again. Then it started working again.

My mouse had crashed, and needed a reboot.

Windows path length limit

Wednesday, March 21st, 2007

It really seems that Windows does indeed have a path limit.

While checking some files into a subversion repository:

  • The repository was D:\Adrian\my-respository
  • Within the repository I had quite a deep directory structure, to access this particular project
  • Within this particular project, the IDE I was using had a few levels of directories, to include "work/src" and so on
  • The class path of the Java classes was quite deep, "com/company/project" etc
  • Subversion itself puts a few levels of dirs in ".svn/text-base" and so on

While checking in all this stuff, I got the error "path invalid" suddenly. And opening the created .svn directory in Windows Explorer, right-clicking, and chooseing "new directory" immediately brought up the error:

Cannot create 'New Folder': path invalid

So it seems paths have a limit in Windows. The existing working path was 220 characters long, with 20 directories including the working directory and the hard disk's root directory.

This is all very annoying, as I can't really do anything about any of the above reasons why the path is so long.

What does this error mean?

Tuesday, March 20th, 2007

While moving a folder "old-cvs-data", with many subdirectories, to the Recycle Bin under Windows XP…

Maybe each file that is stored in the recycle bin has a "original path" attribute, with a max length 256 chars, and that stores the original path like "Dir1\Dir2\Dir3\file.txt". Maybe if files are nested too deeply that attribute cannot hold the value. But that's just a guess.

Maybe it really is time to get a Mac.

Database error messages

Wednesday, February 21st, 2007

Database error messages in general are very bad. Why?

Oracle Version 8 had lots of messages such as "Invalid column name" where they meant:

  1. Column name not found in the table in question. The word "invalid" is the wrong word as it implies illegal characters or something like that.
  2. Which column? Which table? The parser surely knows this at the time it generates the error message. But it helpfully chooses not to inform the user.

Thankfully Oracle 10 has improved its error messages a lot. They include the statement in question and the point in the statement producing the problem. And error messages contain which foreign key constraint has been violated, and so on.

But I have the following problem with MySQL. I try to create a table with InnoDB with a foreign key constraint and it says:

ERROR 1005 (HY000):
Can't create table './myschema/mytable.frm' (errno: 150)

What it means is: the statement has an error in it. But what is the error? In this case, there was a foreign key constraint and the column didn't exist in the referenced table. But why couldn't it tell me this?

How MySQL reduces error messages in your program

Thursday, February 1st, 2007

Ah MySQL (at least MyISAM) so isn't a real database!

Firstly, when doing an insert, I did some arithmetic. The numeric column was of a certain width. If the result of the arithmetic is larger than the maximum allowed value my number was just getting turned into that maximum allowed value, without warning or error. A large number suddenly becoming some other large number may sound good in the philosophy of "errors are bad - we want to minimize errors!" but literally it's never what you want. Oracle gives an error if a number is too big to be stored in a column. Which is what you always want.

Secondly, due to above arithmetic overflow errors, my insert statement was failing (as multiple values that should have been distinct, but beyond the maximum, were then identical, equal to the maximum). I kept on doing it and it kept on failing. Then I looked at the table and each time I'd done such an unsuccessful insert (a single statement to insert maybe 10k rows) some rows (but not all - due to the error) were getting inserted. Having half a statement succeed is never what you want! Oracle sets an invisible checkpoint before each statement and if the statement fails, rolls the database back to that checkpoint. That's always what you want!

The SUM(col) of zero rows is

Wednesday, November 29th, 2006

This just annoys me so much. The sum of an empty set of integers is zero, not undefined.

However neither Oracle nor MySQL understand this. I can only assume this variation from common sense and mathematics is considered the “best practices” definition of the SQL SUM function.

mysql> desc email_box;
| box_size_bytes | int(10)     |

mysql> select count(*) from email_box;
|        0 |

mysql> select sum(box_size_bytes) from email_box;
| NULL |

bugs

Monday, April 3rd, 2006

well really a lot of things were far from optimal about the software. lots of bugs but a lot of things which were integration troubles, i.e. one bit of software worked 95% and another software worked 95% and together they worked 0%. today and yesterday sat with smo and went through a whole bunch of software from a whole bunch of people and just hacked away until it worked. now there are last 5 galleries etc on the start page which is quite cool.

new galleries is not as cool as it could be, as they are "checked", i.e. when the customer care agents go home then there is no new checked content. but the newest blogs are interesting as they are not checked.

surprisingly people tend to use the new blogs much as the old galleries, i.e. just lots and lots of photos. surely the gallery is the more appropriate forum for such content. but hey, if the users like it, that's all i care about.