Wednesday, October 14, 2009
Unicode precomposition and decomposition
As a result of recent Mac troubles, I moved my iTunes library to a Linux file server and setup iTunes on my old TiBook to access the library over an AFP share using netatalk.
This worked unexpectedly well, until I noticed something very odd: I could no longer access any file whose name contained an accented character such as “é”. These files showed up in directory listings but were not readable. The filesystem complained that the file just did not exist. After a whole evening lost trying to find fault with everything from Mac OS X to netatalk, I found myself in unfamiliar Unicode territory:
It turns out there are two ways to represent certain accented characters such as “é” in Unicode, either using unique code points (U+00E9, “latin small letter e with acute”) or using a regular ASCII character “e” with a combining diacritical mark (U+0065, “latin small letter e” followed by U+0301, “combining acute accent”). The first form is known as “precomposed” and is the standard for filenames on Linux, while the latter “decomposed” form is standard on Mac OS X.
The Mac approach is unusual but has the advantage of making accent insensitive search easier. A string search for “cafe” will also match “café” because the last character is really two; “cafeteria” can match for “caféteria” if one simply strips out diacritical marks. Doing this with precomposed strings is much harder. (Thanks to @deepakg for identifying this.)
Mac OS X enforces the decomposed form for filenames, but Linux doesn’t. On Linux, precomposed UTF-8 is expected but not enforced. The netatalk AFP server recognises this difference and transparently translates filenames between what it calls UTF8 and UTF8-MAC. This is where I ran into trouble. I had transferred my files using rsync and ended up with decomposed filenames on Linux. These showed up fine over AFP, but when Mac OS X attempted accessing them, netatalk did the transparent translation to precomposed names and could no longer find the files. The solution? Rename all files on the Linux side:
convmv -r -f utf-8 --nfd -t utf8 --nfc ./* --notest
And in future, when rsyncing files from Mac OS X to Linux, ask it to translate the filenames with this additional option (reversed for Linux to OSX):
rsync --iconv=UTF8-MAC,UTF8
jean.o.matic — Jun 11, 2010 9:57:03 AM — # ↩
Thanks very helpful. Doing about the same: daily backups from a Mac to Netgear NAS using rsync and hoping to use AFP share on same volume to restore specific files. By the way, same issues from CIFS/SMB share since UTF-8 and UTF-8 (Mac) translation is also needed. Unfortunately no work around for now since I would have to hack the NAS to install ssh to set default encoding. rsync reports this error: rsync: –iconv=UTF8-MAC,UTF8: unknown option
Kiran Jonnalagadda — Jun 21, 2010 9:34:54 PM — # ↩
Turns out the –iconv option is not available in the OSX version of rsync. It may be available in custom versions (via Fink, Macports or Homebrew), but I don’t have a Mac around to test with.