[LINK] Aditya Mukerjee on the problems of Unicode on a multilingual Internet

Aug 04, 2015 13:34

At Model View Culture, Aditya Mukerjee has a post with the arresting title of "I Can Text You A Pile of Poo, But I Can’t Write My Name". Mukerjee makes the convincing case that the underpresentation of non-Western languages in Unicode, especially South and East Asian ones, is a serious problem for the Internet. That undeciphered scripts like Linear A are fully included is, well, odd.

My family’s native language, which I grew up speaking, is far from a niche language. Bengali is the seventh most common native language in the world, sitting ahead of the eighth (Russian) by a wide margin, with as many native speakers as French, German, and Italian combined.

And yet, on the Internet, Bengali is very much a second-class citizen - as are Arabic (#5), Hindi (#4), and Mandarin (#1) - any language which is not written with the Latin alphabet.

The very first version of the Unicode standard did include Bengali. However, it left out a number of important characters. Until 2005, Unicode did not have one of the characters in the Bengali word for “suddenly”. Instead, people who wanted to write this everyday word had to combine three separate, unrelated characters. For English-speaking teenagers, combining characters in unexpected ways, like writing ‘w’ as ‘\/\/’, used to be a way of asserting technical literacy through “l33tspeak” - a shibboleth for nerds that derives its name from the word “elite”. But Bengalis were forced to make similar orthographic contortions just to write a simple email: ত + ্ + ‍ = ‍ৎ (the third character is the invisible “zero width joiner”).

Even today, I am forced to do this when writing my own name. My name is not only a common Indian name, but one of the top 1,000 names in the United States as well. But the final letter has still not been given its own Unicode character, so I have to use a substitute.

A few other characters that were more common historically, though still used today, were also missing for the first decade of Bengali’s existence in Unicode. It’s tempting to argue that historical characters have no place in a character set intended for computers. On the contrary, this makes their inclusion even more vital: rendering historical texts accurately is key to ensuring their survival in the transition to the age of digital media.

china, internet, south asia, bangladesh, globalization, english language, india, korea, chinese language, popular culture, east asia, links, japan

Previous post Next post
Up