Developing Web Pages in Bangla (Bengali)

I wrote this page to summarize what I learned in a couple of days of crawling around the web, trying to figure out the best way to represent Bangla on a webpage. Most webpages produced in Bangla are composed of static images, and not text. "Live" content often consists of recently (or in exceptional cases, dynamically generated) pdf files. This is particularly true of newspapers and other print-media sites (see, for example, these periodicals). While it is great that some content has found its way to the web, these solutions are clearly stop-gap measures. Network bandwidth, server storage capacity, page display speed and the ability to update or dynamically reformat content will require that Bangla content be served in a character-oriented manner, as is common in most other languages. Once this happens, there should be an explosion of Bangla content commensurate with Bangla's popularity (depending on source, it is considered the fourth to seventh most spoken language in the world). This review was written in November 2004, and hopefully parts of it will become dated quickly as the situation improves.

I have included Bangla characters in this article, if they are not rendered correctly by your browser, skip to the "web-browsers" page.

The problem: Standards and Software

There are two main issues in producing web-content in Bangla: international standards and the adoption of these standards by companies that make computer software. Standards have become more cosmopolitan over the years. In the 1990s, Indian Symbolic Code for Information Interchange (ISCII) was developed as an extension of the venerable US standard, ASCII (ISO 646). More recently, unicode has matured and taken center stage. The unicode standard provides for the encoding of every character that has ever existed in any human language and then some. Unicode is defined in parallel with ISO 10646, which adds to its legitimacy and likelihood of broad international acceptance. A lot of thinking from various camps has gone into its development. In many ways, it is backward compatible with previous character sets (like ASCII), easing the pain of switching from one standard to another. Reassuringly, while unicode may continue to expand to encompass new languages and symbols, once a character is accepted into the unicode canon, it is there forever -- it will not unpredictably vaporize when the standard is revised.

Before unicode, it was still possible to view web pages in Bangla, but it was more difficult. It was necessary to download and install the same font used to generate the page. The browser mapped the letters in the web page to its font set and displayed the result. A page generated in one font was not guaranteed to display properly in other font. Typically, the web page provided a link to allow download of the appropriate font, although in the case of proprietary fonts this sometimes violated copyright. In some cases, the font was sent with the page, although the success of this practice varied from browser to browser, or even within versions of the same browser. It also made the size of the download larger. With this method, it was difficult to obtain text input from the viewer through standard forms.

Unicode has two principle advantages from the viewpoint of the broswer. First, once a document is written in unicode, it can be displayed on any browser that supports the standard, and in any unicode-compliant font. Second, since unicode encompasses all languages, the same font can be used to display other languages (assuming that the creator of the font included entries for each lanaguage, code 2000 is a good example of a polyglottal font). In fact, multiple languages can be mixed on one page. For multilinguage websites, this is a huge benefit.

When a browser requests a document from the web, a server begins its reply with some preliminary information about the type of communication taking place and the type of encoding. Many encoding standards have sprung up, most of which correspond to a relatively narrow spectrum of languages employing similar characters (for example, ISO-8859-1 standard for the Latin1 character set or KOI-8 for Cyrillic fonts). For unicode, the option will be UTF-8 (or related variants like UTF-16; the number describes the number of bits used to represent the character). Next, the web page may specify the recommended font (or font family) to use. One page may even mix different fonts, permitting a wide range of stylistic options. However, if the document fails to specify a font, or if the requested font is not installed on the viewer's computer, a unicode compliant browser should be able to render some intelligible representation of the page using a default font. In recent versions of Windows, for instance, the default font for rendering Bangla is Vrinda.

Bangla-specific issues

Given the advantages of unicode and wide support base already in place, it seems likely that unicode will become the encoding method of choice for Bangla. However, getting a Bangla document into unicode is not an entirely straightforward matter. Because of the structure of Bangla, encoding is more than a simple one-to-one mapping of written characters to individual unicode "code points". Before discussing how Bangla gets converted to unicode, it is helpful to have a little more background about the language itself.

Bangla is the eastern-most member of the Indo-Aryan branch of the Indo-European language group. It is closely related to a number of Brahmic languages, and its Sanskrit roots are apparent even today. The most popular related language is Hindi, and the Devanagari writing system presents the same difficulties as Bangla: vowels may be written before (or on both sides of) the consonants they modify, consonants may fuse with other consonants to produce novel conjuncts, and there are a host of a signs that modify other nearby letters. The traditional order of Devanagari letters is reflected in their arrangement in unicode, and this character set serves as a template for related Indic languages. For the most part, rules that were developed for the ordering, combining and modifying of Devanagari characters are generalized in the unicode standards to each of these languages. The character sets of two languages, Assamese and Mirpuri differ from Bangla by only a few letters (ঌ, ৰ, ৱ).

Bangla is not a static language. It has gone through periods of evolution and reform, consistently moving towards a more streamlined and regular structure. In the last fifty years, several letters have effectively been discarded from the language. For historical reasons, the language has preserved a number of silent letters and phonetically redundant letters (for example, three letters, শ,ষ, and স representing the /sh/ sound. Even if some of these letters eventually retire from the language, they will still be found in unicode in perpetuity.

How Bangla is represented in Unicode

Modern Bangla consists of 39 consonants, 11 "free standing" vowels, and 11 corresponding vowel sounds pronounced after consonants. However, if you pick up a Bangla newspaper, you will be confronted with many more symbols which are compounds of two or three letters (conjuncts or ligatures). Established forms exist for around 250 conjuncts. In both Hindi and Bangla, there is the concept of an inherent (default) vowel sound which is assigned to consonants that are not explicitly modified by a vowel -- this is an /ah/ sound in Hindi, and an /aw/ sound in Bangla. Because of this, it is not equivalent to simply write blended letters sequentially, as that would imply an inherent vowel sound after each one. Just as the inherent vowel is not written in Bangla, it is not represented in unicode.

The diversity of conjuncts is an imposing hurdle for font-makers. Some conjuncts bear no resemblance to their constituent consonants, so there is no programmatic way to draw conjuncts based on the shape of their components. Consequently, any program which renders Bangla needs a library of pre-generated fonts. However, unicode does not have to deal with rendering the fonts -- that is the job of the client program such as a browser. Unicode does not attempt to give each conjunct a unique code. Instead, it represents the basic vowels and consonants and strings them together (along with a handful of special control characters) to represent Bangla words. Unicode's Bangla repertoire consists almost 90 Bangla-related "code points". It is the job of the client application to put forth it's best effort at rendering the letters with the fonts at hand according to a set of rules that are also part of the unicode standard.

In unicode, the basic letters are arranged in a table. Each "code point" has a corresponding number. The numbers are usually written in hexadecimal notation, like 0986h (the equivalent of 2438 in decimal notation). That value, for example, corresponds to the independent vowel "আ". As another example, The value 0041h (65 decimal) corresponds to the roman letter capital "A" (just like in ASCII). Every "letter" has a unique code number.

A web page is essentially a text document that is sent from a server to a client (e.g., a browser) by specific protocol. There are two ways to put Bangla characters into a web page: specifying the numerical code (the "Character Entity Reference") (CER) or by simply typing the character itself in the document. The former approach is useful is you are working on a platform that only supports the latin character set or want the page's HTML source to be readable in latin characters. However, if you have a lot of Bangla text, entering it all as CERs would threaten your sanity. The second approach, using "inline" unicode characters, is more economical -- just type the document directly using Bangla characters where appropriate. Unfortunately, the "inline" unicode characters would appear as gibberish if underlying HTML code were viewed in a non-unicode aware application.

A Bangla Webpage

Looking at some examples is the best way to make this concrete. Let's say that I want to make a webpage that displays the letter "আ", and I am using an editor that lets me type directly in Bangla and save the output in unicode format:

<html>
  <body>
      
  </body>
</html>

If there is a box or other non-Bangla character on the third line, your browser is not set up to view unicode (if so, see the web-browsers page about configuring a browser to display Bangla properly).

All you really need to write a web page is a text editor and an understanding of HTML. I have reviewed several Bangla editing software packages on another page.

There are many commercial products which facilitate the job of making web pages, for example: FrontPage, Netscape Composer, and Dreamweaver. If your operating system handles unicode Bangla keyboard input and the program can save in unicode format, you are set. If these programs do not allow direct entry of Bangla characters, but can save in unicode format, you can try one of the following tricks: The first option is to write the page with some dummy content and then re-open the page in unicode-compliant text editor, replacing your dummy text with Bangla. Alternatively, you might be able to cut-and-paste text from your editor directly into the web-design program.

A more ecumenical approach would be to use the letter's Character Entity Reference CER, although it is much more complicated and makes document larger. The CER's number is written after an ampersand (&) and pound sign (#). The number is terminated by a semicolon. If the number is hexadecimal, it is preceeded by an "x". Without the "x", the letter is presumed to be in decimal notation. The code for an "আ" is 0986h, so:

<html>
  <body>
        &#x0986;
  </body>
</html>

If you go much beyond putting individual letters on a webpage, you need to be aware of more than just the CER codes. For example, what about conjuncts and all the issues about placement of vowels and diacritical marks? As I said above, only a limited number of core characters are part of unicode -- these can be strung together to make any Bangla character, however. It is is the job of the browser to interpret the string that you have written in your HTML document. The rules for interpretation are explicitly listed in the unicode specification, look under "South Asian Languages". Read the section on Devanagari interpretation rules and then the notes about Bangla (Bengali). If you are using the WYSIWYG method, you do not need to consider these issues because the editor produces the proper string of unicode characters. If you are doing it "by hand", however, you will need to follow the rules exactly.

There are three important characters that have special functions under these rules: the hasant (also known as by its Sanskrit designation, "virama"), the zero-width joiner (ZWJ) and the zero-width non-joiner (ZWNJ). The latter two characters are not part of Bangla, but are components of unicode.

Unicode Examples

The formal statement of the rules is a bit complex, but a few examples can convey most of what you need to know to form Bangla words from unicode.


Q. How do you write the letter "o", parts of which are written to both the left and the right of the phonetically preceeding letter?
A. Write the word phonetically, putting the "o" after the consonant. Unicode will take care of writing the left portion of the letter in the proper position.
Example: বোমা (bomb)
Spelling: b-o-m-a
Unicode: 09AC 09CB 09AE 09BE

Q. How do you write an "independent vowel" like an initial "e"?
A. Just use the right unicode character! Unicode distinguishes between "e" at the beginning of a word, and the equivalent character in other positions.
Example: এবং (and)
Spelling: e-b-[inherent vowel]-ng
Unicode: 098F 09AC 0982

Q. How do you form a conjunct (merge consonants)?
A. Write the consonants in their phonetic order, placing a virama (hasant) between the partners.
Example: মুক্তি (victory, freedom)
Spelling: m-u-k-virama-t-i
Unicode: 09AE 09C1 0995 09CD 09A4 09BF

Q. What if I want to make a three-letter conjunction?
A. Same deal, just put a virama between each character involved in the conjunct.
Example: স্ত্রী (wife)
Spelling: s-virama-t-virama-r-ee
Unicode: 09B8 09CD 09A4 09CD 09B0 09C0

Q. How about Ra-phalla (the form of র when it comes last in a conjunct)?
A. It is handled just like any othe conjunct.
Example: প্রীয় (favorite)
Spelling: p-virama-r-ee-y-[inherent vowel])
Unicode: 09AA 09CD 09B0 09C0 09DF

Q. What if র comes first in the conjunct, yielding "reph" (which is drawn above the succeeding consonant)?
A. Again, same rule as other conjuncts, write the world in phonetic order and place a virama between the র and the next letter.
Example: ধর্ম (religion)
Spelling: dh-[inherent vowel]-r-virama-m-[inherent vowel]
Unicode: 09A7 09B0 09CD 09AE

Q. How about the character ক্ষ -- anything funky about it?
A. You just have to remember that even though it is pronounced as a fusion between "ক" and "খ" in Bangla, historically it was derived by fusing a /k/ sound (क) and an /s/ sound (ष). The Bangla equivalents are ক and ষ, so the letter is composed of these two elements just like a normal conjunct.
Example: পরীক্ষা (exam)
Spelling: p-[inherent vowel]-r-ee-k-virama-s-a
Unicode: 09AA 09B0 09CO 0995 09CD 09B7 09BE

Q. How do you form Ja-phalla (or ya-phalla)?
A. Ja-phalla is a special case of the letter য. Put a virama in front of the য (remember -- it is not য়, although you might be tempted to think it might be).
Example: বন্যা (flood)
Spelling: b-[inherent vowel]-n-virama-y-a
Unicode: 09AC 09A8 09CD 09AF 09BE

Q. How do you place chandra bindu (nasalization mark) above a character?
A. Like other modification symbols, it comes after the letter (in this case, vowel) it modifies.
Example: বাঁ (left)
Spelling: b-a-bindu
Unicode: 09AC 09BE 0981

Q. So, when are the strange ZWJ and ZWNJ characters used?
A. They come into play when you are trying to express a "half letter". For example, khanda ta (ত্‍) is the /t/ sound of ত, but specifically lacks an inherent vowel sound and does not enter into conjuncts. It is primarily used to spell words borrowed from other languages.
Example: উত্‍সব (festival)
Spelling: u-t-virama-ZWJ-sh-[inherent vowel]-b
Unicode: 0989 09A4 09CD 200D 09B8 09AC

Q. What about ZWNJ?
A. If you want to "deaden" a regular consonant, that is make it like khanda ta -- no inherent vowel sound, no combining into conjuncts -- this is usually expressed in writing by placing an explicit hasant (virama) below the character. Hasants are not very common, you are most likely to encounter one in a dictionary or in the spelling of a borrowed foreign word.
Example: ক্‌ষ ("k-s")
Spelling:k-virama-ZWNJ-s
Unicode: 0995 09CD 200C 09B7

In addition to these letters there are a number of obsolete characters, modifying signs, and punctuation marks used in Bangla. Also, Bangla employs its own characters for numerals 0 through 9, as well as some special number symbols (infrequently) employed for monetary values. Refer to the Bengali Code Table for details.

Finally, if you are still hungry for more examples of how to implement Bangla on a web-page, take a look at the HTML source of this document using the "view source" option in your browser. The whole page is written with latin characters, for easy viewing.

International Domain Names

There is one other standard that should be mentioned, although it is often an afterthought. What do you do if you are in China and want to type in a URL in chinese? Right now, almost everyone uses latin characters for URLs. There is a reason for this: by established convention, URLs are written in 7-bit ASCII code.

However, one of the purposes of URLs is to make documents easier to find by making the names more "human readable". The standard which permits this is RFC 3492, known as "punycode". If you had a computer that operating "natively" in unicode -- i.e., you could type Bangla on the keyboard, the expected characters would appear on screen -- you should be able to type a URL in Bangla and have it find a Bangla site. Invisibly to the user, the URL would be converted to punycode, resulting in some bizarre-looking URL name in latin charcters which would be sent over the web. Pleasantly, the user would not ever have to look at the URL in its latin-character form. Some such sites might want to have an easily remembered latin-character URL as well, but it is a trivial matter to point both URLs to the same website.

Matters of Style

You can make your page look however you want through creative use of HTML and style sheets, a discussion of which is beyond the scope of this document. In terms of readability, two Bangla-specific style elements should be mentioned: choice of font and character size. Bangla characters are more complicated than latin characters, and may be difficult to read on a computer screen. Your choice of font may be constrained by what fonts are installed by the viewer, but it can not hurt to recommend a font, or a series of alternate fonts through a cascading style sheet. Heavier (bolder) fonts tend to be more legible. You may wish to use a larger font size for Bangla characters, particularly if latin and Bangla fonts are mixed on one page. Where possible, avoid "hard-coding" the size of Bangla fonts. A 12-point font may look great on your monitor, but be illegible on a monitor with higher resolution.

If a page is primarily in Bangla, say so. Since HTML 4.0, documents can declare their primary language -- this may provide a clue to the browser and help it render the document correctly. Most of this document, for instance is English, so the header includes a line:

<!DOCTYPE HTML PUBLIC "–//W3C//DTD HTML 4.01 Transitional//EN">

The "EN" refers to English, using a two-letter abbreviation which is defined in ISO 639. Depending on what version of HTML or XML you are using, there are several ways of indicating the language. If this document were in Bangla, I would have replaced "EN" with "BN".

On the Server End...

The web browser is only one component of the software involved in communicating a Bangla document from a repository to a viewer. The document resides on a computer somewhere, and that computer must have an operating system like Linux, Windows, SunOS. The software that listens and replies to requests for web documents is the webserver, for example, Apache or IIS. If dynamically generated documents are to be prepared on the server-end of the connection, additional components come into play. Many implementations employ an interpreted language like Perl, Python or PHP in combination with a database like MySQL, Oracle, DB2 or Postgres. Unicode-compliance is a consideration with each of these software components.

The server's operating system and web server have minimal impact on delivery of unicode documents. As far as the server is concerned, it could easily spew forth any run of ones and zeroes without worrying about their meaning. The advantage of a unicode-compliant operating system with support for Bangla is that you can see your file names in Bangla when working from the server. With the proper localization options, the menus, dialog boxes and help screens should also appear in Bangla. So far, none of the major operating systems have achieved this level of localization, although Linux is approaching this point.

Generation of dynamic content always involves more work on the server end, and it is here that some conscious decisions must be made about unicode-compliance. Some databases (for example, MySQL version 4.1) are now unicode-compliant. This means that you can store data directly in unicode format, and include unicode characters in queries. A special consideration for databases is collation. Unicode makes no assertion about the order of characters in a language. There is a traditional order of letters in Bangla which is used in dictionaries, but when you think about the rules for this ordering, they are relatively complex. These rules can be defined algorithmically, or by example. These rules are important for sorting lists and comparing values.

Some sort of software must fall in a middle layer, between the database and the web server. If the database is not set up to handle Bangla unicode, this "glue" layer will have to do the job. If the database is not designed to work with unicode, it can still store the information, but the "glue" layer will have to massage the data into the appropriate format for storage and retrieval from the database. In the best situation, the database, glue software, server and operating system are all unicode compliant and aware of the Bangla collation rules.

Back


Valid HTML 4.01!