kindle 3 and chinese text

Nov 09, 2010 22:50

A few weeks ago I started a beginner level Chinese (Mandarin) evening class at the local community college. We're about four weeks in and I'm really enjoying it, will definitely do more.

I thought it would be an interesting project to load a Chinese-English dictionary onto my Kindle for reference. I've already played with kindlegen, which takes a ( Read more... )

Leave a comment

lithiana November 9 2010, 10:48:35 UTC
My understanding is that Unicode is not unambiguous for CJK characters -- it has a single codepoint that refers to a character written slightly different in each language, and the application is supposed to detect which character to display based on the user or document's indicated locale. However, I don't see why that would result in a character not displaying at all...

Reply

pne November 9 2010, 12:31:18 UTC
My understanding is that Unicode is not unambiguous for CJK characters -- it has a single codepoint that refers to a character written slightly different in each language, and the application is supposed to detect which character to display based on the user or document's indicated locale.

That's true, too, due to "Han unification".

Most shapes are identical or at least very similar, but in some cases, there are noticeable differences. (Perhaps a comparison in Latin script terms, which allows for different character shapes, too, is hand-written vs. printed lower-case "a" or "g".)

However, I don't see why that would result in a character not displaying at all...

Because the font it uses doesn't cover every single Unicode character but only the ones it thinks will be needed?

If that were the reason, though, then it would mean that Amazon-provided eBooks (where the characters show up) must use a different font than self-made ones or plain-text UTF8 files (where certain characters don't).

Reply

ghewgill November 9 2010, 18:18:15 UTC
I have a vague understanding of Han unification, the 28 MB CJK Unified Ideographs PDF at unicode.org shows six columns with slightly different renderings for each code point (and many places where a code point might be shown for just one or a few languages).

Perhaps the Kindle prefers to show characters in a consistent style based on the language setting, rather than mixing up different styles because not all characters are available in each style. Since it's a reading device, I can appreciate that.

Reply

goulo November 10 2010, 09:50:07 UTC
Surely it's better to see some of the characters in a different style/font/whatever than to see an empty box...? (Like sometimes I visit a poorly designed webpage that has text in one font, except all the Esperanto characters are in some other font. Ugly, but far better than not even showing the Esperanto characters.)

I don't know much about this particular issue. It sounds like a disaster, if there are Unicode characters that don't have a single meaning but depend on having barely documented settings set correctly in one's device. Was there really not a better way for Unicode to be designed? I thought the whole point of it was to avoid this code of alternate character set problem...?

Reply

mskala November 10 2010, 17:22:49 UTC
Han unification has always been controversial, but there are some reasons for it. The claim of its proponents is that having two different styles of the same character is a much different thing from having two actually different characters. Look, for instance, at the lowercase letter "a" in English. In some typefaces that's a basically circular shape with a line added along the right-hand side (single-decker); a vertical line down the middle passes through the letter twice. In other typefaces, the loop is pushed to the bottom and the line along the right-hand side also curves up and over, forming a second, open, loop (double-decker). A vertical line through that kind of "a" passes through the letter three times. Single-decker is more common in handwriting and double-decker is more common in print. This is a bigger difference than other differences (like that between C and G) that are considered to represent "different letters" by users of English; nonetheless, the two versions of "a" are considered to be "the same letter" by ( ... )

Reply

goulo November 10 2010, 17:58:58 UTC
Thanks for the explanation! That is helpful, but also leaves me confused. I always thought (perhaps quite erroneously) that at its core, Unicode was not so much about mere appearance as about the logical function/purpose of the characters. The zillions of different possible appearances of the letters "a" (depending on which typeface one uses) are all still the same character, surely. I cannot imagine the disaster if every distinct "a" from every typeface (new ones being created daily) should be considered a distinct character ( ... )

Reply

mskala November 10 2010, 18:34:28 UTC
Unicode was not so much about mere appearance as about the logical function/purpose of the characters. - Yes, and that's why they did Han unification; to give the same numbers to characters they decided were "the same" in some important way even if they looked different in different languages. The consequence is that if you use a font designed for one language to write another (within CJK), it ends up looking wrong even if it has glyphs for all the necessary code points; then there's the separate issue of whether it DOES have glyphs for all the necessary code points ( ... )

Reply


Leave a comment

Up