Entries tagged “netatalk”

Unicode precomposition and decomposition

As a result of recent Mac troubles, I moved my iTunes library to a Linux file server and setup iTunes on my old TiBook to access the library over an AFP share using netatalk.

This worked unexpectedly well, until I noticed something very odd: I could no longer access any file whose name contained an accented character such as “é”. These files showed up in directory listings but were not readable. The filesystem complained that the file just did not exist. After a whole evening lost trying to find fault with everything from Mac OS X to netatalk, I found myself in unfamiliar Unicode territory:

It turns out there are two ways to represent certain accented characters such as “é” in Unicode, either using unique code points (U+00E9, “latin small letter e with acute”) or using a regular ASCII character “e” with a combining diacritical mark (U+0065, “latin small letter e” followed by U+0301, “combining acute accent”). The first form is known as “precomposed” and is the standard for filenames on Linux, while the latter “decomposed” form is standard on Mac OS X.

The Mac approach is unusual but has the advantage of making accent insensitive search easier. A string search for “cafe” will also match “café” because the last character is really two; “cafeteria” can match for “caféteria” if one simply strips out diacritical marks. Doing this with precomposed strings is much harder. (Thanks to @deepakg for identifying this.)

Mac OS X enforces the decomposed form for filenames, but Linux doesn’t. On Linux, precomposed UTF-8 is expected but not enforced. The netatalk AFP server recognises this difference and transparently translates filenames between what it calls UTF8 and UTF8-MAC. This is where I ran into trouble. I had transferred my files using rsync and ended up with decomposed filenames on Linux. These showed up fine over AFP, but when Mac OS X attempted accessing them, netatalk did the transparent translation to precomposed names and could no longer find the files. The solution? Rename all files on the Linux side:

convmv -r -f utf-8 --nfd -t utf8 --nfc ./* --notest

And in future, when rsyncing files from Mac OS X to Linux, ask it to translate the filenames with this additional option (reversed for Linux to OSX):

rsync --iconv=UTF8-MAC,UTF8