Unicode characters in file names

It's amazing that the following all works:

That means, for my uses, I can absolutely use Unicode characters in file names. That's a cool situation.

(In this particular situation, the user of my program should choose a "report" from a drop-down of possible report types. Each report has a directory on the disk, with some files in a standard layout inside the directory. There is no additional data, the file names do not need to be localized, so rather than creating an extra config file, which could get out-of-date with the directories on the disk, it is much more convenient and normalized to simply scan the directory from the program. The reports are created on Windows, my program is running on Linux, and the communication between the two is the Subversion VCS.)

There was one slight problem, which I didn't notice at first, which is that perl can't read the Unicode file names correctly on Linux. I didn't notice it because, as is often the case with character set situations, there were two errors which cancelled on another out, to make it look like it had worked. Perl read the file name thinking that each UTF-8 two-byte character was actually two characters, and by default outputted Latin1 even though the terminal was UTF-8 so the two "characters" were output and interpreted by the terminal as the single original character in the file name. In such situations, I find the only way to debug and test such things is to output the length in characters as a number, as then such cancelling-out errors cannot occur.

[1] If you're experiencing problems on the unix command line, using a UTF-8 terminal try the command "export LANG=en_US.UTF-8"!

This article is © Adrian Smith.
It was originally published on 26 Jan 2009
More on: Operating Systems | Windows