14 March 2008

Unicode : its time has come.

I've started to use Unicode a lot more in this and other places. In fact you will need to use Unicode to read this post. What is it, and why use it? Unicode is a standard for the encoding of letters and other written characters. In the old days the Americans (bless 'em) created a standard way of encoding English which is known by it's acronym: ASCII. Each letter was a assigned a number and this made encoding text much easier for computers. ASCII used one byte (8 bits) which gave it a limit of 256 characters. This just about does it for English, but of course many other languages are in use in the world, and some of them don't stick to the plain Roman characters familiar to English language speakers. Unicode solves this problem by using two bytes giving 256 x 256 = 65536 possibilities. Even this places limits, but it does mean that encoding non-Roman scripts is a possibility, and it also allows us to use a full range of diacritic marks - which is where it gets interesting for me.

I falteringly read Pāli, and I use a lot of Pāli and Sanskrit terms in my writing. Diacritics do matter in the writing of Indic languages. For instance the retroflex unvoiced stop ( ṭ ) is different from the dental unvoiced stop ( t ). Compare them in Devanāgarī for instance: ट and त are not at all alike, and are clearly distinguished in pronunciation. However in the early days of popular writing about Buddhism, publishers, who did not have readily available fonts to cope with the diacritics, nor proof readers who knew what they meant, just decided to do without them. Unfortunately this became the fashion. Scholars used them of course and this became a bit of a dividing line - serious Buddhist writing uses diacritics, but popular Buddhism does not. There is no good reason to continue this, but it's become a habit.

Until quite recently the internet reinforced this bad habit. HTML simply could not cope with anything other than ASCII (and it's one byte descendants). Several ASCII based encoding systems were invented. Let's say I want to write paṭicca-samuppāda. Two of the common methods of doing it in text look like this:
velthuis: pa.ticca-samuppaada
ITRANS: paTicca-samuppAda
Neither is very easy to read compared to properly printed text. Real problems emerge for nasals ṅ, ñ, and ṇ, and the sibilants ṣ and ś. One way around the problem was to create a special font that had to be installed before pages could be read. This works OK, but these home-made fonts use parts of the ascii scheme that are seldom used in English, and they do it idiosyncratically so that they are not interchangeable. If I use the Vipassana Research Institute font that comes with their CD of the Pāli Canon I get this if I change fonts:
VriRoman Pali: paµicca-samupp±da
Unicode solves this problem, and it is getting easier to use. On my visiblemantra.org website I used to hand code all of the extra characters. So for example ṭ = &#7789 and ā = &#257. This is time consuming, taxes the memory, and makes the source code difficult to read, but it results in a full set of Indic letters. And what's more the will display correctly in any Unicode font.

Unicode has not completely superseded the old style ASCII fonts. Since the sequence that contains the numbers and upper and lower case Roman letters are the same, for most people there is no incentive to change. We have our favourite fonts and we don't want to change. And actually there are still not many Unicode fonts to choose from. Windows and Mac both ship with a couple of Unicode fonts (For Windows Arial Unicode MS and Lucinda Sans Unicode) but not a version of Times Roman. Some fonts only implement a subset of the Unicode character set - so Times New Roman does have some extra characters, but not all the ones we need for Sanskrit.

When I set up visiblemantra.org I made the decision to use diacritics throughout the site. I believe that it is important to accurately represent the mantras. So you can't really read the site without setting a Unicode font in your browser options. I'm an early adopter and this will mean that some of the 200 or so visitors each day cannot read some of the text, but I hope I am making it more sensible for everyone to start using Unicode. Its a bit like DVDs or any of those new technologies. Some people hold out for as long as they can, but there comes a time when it just makes more sense to go with the new. I believe the time has come. I have used the occasion diacritic on this site before, but fudged it at times by leaving them off. From now on I plan to use diacritics all the time - which is to say that I intend to spell Sanskrit and Pāli words as they should be (taking into account my appalling spelling of course).

Two things have made the difference for me. Firstly I managed to get hold of a copy of the Windows Unicode font Times Ext Roman which has all the diacritics I need, and looks good both on screen and printed. Secondly I discovered how easy it is to make a keyboard map so I can type them whatever application I am using. I'm making both the font and the keyboard map available on visiblemantra.org, and I'd encourage everyone who reads this to go ahead and make the jump. I also have both a rough, and a detailed, guide to how to pronounce the letters of Sanskrit on visiblemantra.org.

Here's the Sanskrit alphabet in all its glory:

a ā i ī u ū e ai o au aṃ aḥ ṛ ṝ ḷ ḹ ka kha ga gha ṅa ca cha ja jha ña ṭa ṭha ḍa ḍha ṇa ta tha da dha na pa pha ba bha ma ya ra la va śa ṣa sa ha kṣa

See what you've been missing? (Hint: if not set your browser font to Unicode!)

A selection of fabulous Resources which rely on Unicode can be found at the following locations:
