Unicode and utf-8

Sep 04, 2008 17:22

As I was wandering through the UTF-8 specification for encoding unicode as bytes, I came upon the Wikipedia entry for UTF-8.

According to my previous knowledge, technically speaking, a unicode character can take on the value of 0...2**32-1, or a 32-bit unsigned integer, though only something like the first 1,114,111 characters are even considered valid, and anything after the first 64k characters are beyond the "standard bitmap plane" and are rare.

But looking at the bit sequences that result in multi-byte characters, all of the standard 1-4 byte characters result in a little over 2 million character options; plenty for all of unicode, with 3 bytes resulting in everything in the standard bitmap plane.

But there are extended utf-8 encoding methods for 4, 5, and 6 byte characters for representing another ~2**31 characters. I understand the intent (represent as many of the 2**32 potential characters in utf-8 as possible), but it seems a little overboard to attempt to represent over 2 billion characters, over 99.9% of which aren't part of the spec, over 99.99% of which will likely never be used (even if we come into contact with intelligent alien life forms).

About the only use that I can see for the extended character options is in the case where you wanted a golomb-like coding for integers using full bytes rather than bits (which is arguably what utf-8 is for the "integer characters" in the unicode spec).

Remind me to tell you all some day about a hilarious spec I'm thinking about writing called utf-12. It would support all 1114111 standard unicode characters (plus a few more), and would result in 1/4 less overhead for all non-Latin European alphabets, matching the performance in the rest of the bitmap plane, and 1/4 less overhead for any character beyond the standard bitmap plane and the rest of the spec (ignoring the 4 bits of trailing padding at the end of half the strings). Yes, the shortcomings are why it's funny.

Ok, enough out of me for today.

software

Previous post Next post
Up