Spellchecker, Unicode, and XCompose

Jun 13, 2008 08:24

When spellchecking my previous post, I noticed that the LJ Spellchecker does not grok Unicode. I never noticed this before because it wasn’t until sometime in the last year that I learned how to type things like curly apostrophes and proper dashes and the like at my keyboard.


For context, I’m currently running xorg server under Debian Linux. This should work on any unixy system running xwindows, though. It might even work under the older xfree86 server. It’ll probably use different file paths in other distributions or in non-Linux unices. I have no idea if Macs can do anything of the sort.

The breakthrough came when I learned how xwindows thinks about such things. Unsurprisingly, it’s based in history. In particular, some older and less‐mainstream keyboards have a Compose key. With supporting software, a user can press his Compose key, then tap two or three other keys in sequence, and the system would generate some other special character that’s simply not on the keyboard. That way users who need to type special characters like à or ç or ¿ or « don’t need to get special keyboards to type them. They just configure appropriate Compose key sequences to generate the characters they need.

XWindows lets users hijack a more traditional key on their keyboard to use as a Compose key. Except xwindows sometimes calls it a Multi_key to keep users on their toes. There’s a way to define a system‐wide one in your xorg.conf, but that wasn’t the method I went with, so I’m not going to say anything more about it. Instead, I used the equally‐poorly‐documented xkbsetmap.

In my .xinitrc, I included the following line:
setxkbmap -option compose:rwin
This tells xkb, the xwindows keyboard mapper, to hijack my right Windows‐logo key as a Compose key. On my Windows‐key‐less laptop I use -option compose:ralt to hijack my right Alt key.

Now, when I press my shiny new Compose key, xwindows listens for a couple more characters and decides how to compose them into a nifty new symbol. In my case I’ve written a file in my home directory called .XCompose. It looks like this:

include "/usr/share/X11/locale/en_US.UTF-8/Compose"
: "‐" U2010 # HYPHEN
: "-" U2012 # FIGURE DASH
: "―" U2015 # HORIZONTAL BAR
: "‽" U203D # INTERROBANG

: "…" U2026 # HORIZONTAL ELLIPSIS
: "̀" U0300 # COMBINING GRAVE ACCENT
: "́" U0301 # COMBINING ACUTE ACCENT
: "̂" U0302 # COMBINING CIRCUMFLEX ACCENT
: "̃" U0303 # COMBINING TILDE ACCENT

: "̈" U0308 # COMBINING DIAERESIS

The include grabs all the data from the system‐wide defaults. The other lines define key combinations. All of them start with because that’s the name of the Compose key. Sometimes, in a poorly‐documented sort of way. The and other key names are probably defined somewhere, but I just grabbed them by example from the system‐wide default file I included above. The thing in quotes is the symbol I want it to generate. The next field is the Unicode code point for the character I want to generate. After the # is the official Unicode name of the character, drawn from the same place I got the code points.

And with that in place, it Just Works. My curly quotes are defined in the global default file. I just type my Windows key, a ', and a >, and I get ’. Windows ' a gives me á. Windows . . . gets me …. It’s a couple extra keystrokes, but I like the look. And really, isn’t the interrobang worth the trouble‽

Now, with the poor Unicode support in the Spellchecker, I wonder how well it handles storing and sending those characters on plain ol’ webpages….

(LJ Spellchecker Genius of the Day: asciicircum -> Eskimo)

tech, geek, xkb, unicode, spellchecker genius, unix

Previous post Next post
Up