Hell is paved with charsets: pphaneuf

pphaneuf

Hell is paved with charsets

Mar 13, 2006 07:33

Oh, my goodness. Having fought in the FidoNet charset wars (I was part of the NETDEV echo, way back then), Unicode was supposed to be my saviour, or something.

Behold, trying to keep a two-way rsync of my music library between my Mac OS X laptop and my Linux workstation. Beside the obvious duplication induced by such genius as the interactions between the case-remembering filesystem of Mac OS X and the case-sensitive filesystem of Linux (yeah, "U2" and "u2" are two totally different bands, didn't you know?), charsets come to bite my arse once more, as if I hadn't done my share already.

Some bands, albums or songs with accented characters in them, they were in ISO-8859-1 charset somewhere and UTF-8 in another. At this point, I was all happy of Linux distributions finally having switched over to UTF-8, and Mac OS X being UTF-8 as well, thinking those were old leftovers (they were) and that I just needed to rename them over to UTF-8 in order to regain my sanity.

No. Of course not. How dumb was I?

The latin accented characters can be represented in two ways using UTF-8, using the ISO-8859-1 codepoints, or using some sort of "dead character". This means that using strcmp might mark two identical-looking strings as being different.

Now, what would be your guess on Mac OS X and Linux using the same method to represent latin accented characters? Or, say, the chances of either of them using something more sophisticated than strcmp to compare strings (not that I blame them, this sounds like a ridiculously complicated problem, of the kind we were trying to get rid by kissing the charsets goodbye)?

*sobs*

P.S.: Thank Bob for Firewire and the bandwidth of a hard disk in an enclosure.