Feeds:
Posts
Comments

You may know that one cannot read Anandabazar Patrika (ABP) the largest Bengali newspaper from India, using any of the modern browsers such as Mozilla Firefox, Apple  Safari, Google Chrome, Opera, and others (only exception is Microsoft IE). This is because Anandabazar has failed to adopt the international standard for digital Bangla namely the Unicode even in 2010.

I am launching a petition site with a live Unicode proxy that demonstrates — Anandabazar Patrika can be read not only in all modern browsers but also in mobiles phones, if they care to adopt the international standard.

If you support this petition then let your voice be known.

Thanks,

If any scientific invention that has literally touched the lives of more than 4.6 billion people around the world then it is simply the mobile phoneIn India alone, there are now 600 million mobile phone users and it is growing at more than 15 million per month. Furthermore, the number of Indians accessing internet using mobile phone is growing much faster compared to those of desktops or laptops. With upcoming launch of 3G services by private operators, this number could grow even faster.

Mobile phone is also the most economical personal device for accessing internet in India. One can buy an internet-enabled handset almost at the same price of a high-end modem. As mobile phone penetrates the vast rural areas of the country, it also brings internet to the masses. Consequently, the need for local language websites is now greater than ever.

In the technological front, viewing Indian language webpages has become much easier thanks to Opera Mini browser. With cloud-based rendering of complex scripts, Opera enables its users to view Unicode compliant webpages even in low-end phones that doesn’t have Indic rendering capability. The list of supported devices by Opera Mini is huge and it includes Apple’s iconic iPhone, Google Androids, and many more devices made by Nokia and other major manufacturers.

To view Indic contents in Opera Mini, you need to do following two steps (see this or this for detailed instructions):

  1. Type config: in the address bar and then press enter
  2. Set Use bitmap fonts for complex scripts to Yes and then Save

Also, turning on Mobile view in settings helps to load webpage faster. Here is a screenshot of my Nexus One showing Unicode-compliant Rabindra-rachanabali website in Bengali.

Unicode Bangla in Nexus One

Other than Opera Mini, some phone also has native Indic rendering capability either partially or fully. Here are screenshots of a little app that I wrote for accessing Anubadok Online. This app with embedded Lohit-bengali font, should work in Android phones 1.6 or higher.

Anubadok Online on Android

You can download the app from here and source code from here. The matras are not in correct order out of the box (on the left) whereas a little bit of manipulation could display them correctly (on the right). Native Indic rendering in Android may improve soon as Skia graphics library, used in Android, now includes Harfbuzz rendering engine.

Overall, the rise of mobile internet in India may be a boon for digital representation of Indian languages. As the need for local language contents grows, it will widely encourage the adoption of Unicode in India. It may also force many content providers to abandon their non-standard encoding that they continue to use even now.

On the new year’s day, the Anandabazar Patrika, the largest Bengali newspaper from West Bengal, begins one of their editorial with the sentence — “সাম্প্রতিক পশ্চিমবঙ্গের জনপ্রিয়তম শব্দবন্ধ ‘পরিবর্তন চাই’” (“The most popular words in recent Bengal — ‘We want change'” ). The same editorial ends with the proposition — “নূতন বৎসরের মূলমন্ত্র হউক ‘পরিবর্তন চাই’” (“Let the mantra for new-year be — ‘we want change’ “). Leaving aside the politics, there is a serious need of change in technology adoption in West Bengal and that is to help its beloved language Bengali to survive in its digital avatar.

In their own words, the Anandabazar Patrika (ABP) may sound like a champion of change but in practice they are no different. Being a leader in Bengali publishing industry, one might expect them to be in forefront in improving the digital standard for Bengali. Unfortunately, their action speaks just the opposite. They continue to use non-standard, bitstream font technology in their website instead of using international standard, the Unicode. One of their “supported browser” is Netscape Communicator whose official support has ended in 2008. They also recommend the use of Firefox plugin Padma. Being the author of ABP support in Padma, this seems rather strange to me. They are asking users to convert their contents to Unicode (by using Padma) rather than serving their contents directly using Unicode.

It may be mentioned that like many other non-Latin languages, digital representation of Bengali texts suffered from a lack of encoding standard in its early phase. However with the advent of Unicode, the universal encoding standard, this is no longer an issue. The Unicode standard has been widely adopted across different operating systems and all recent versions of Windows, Mac or Linux support Unicode natively.  According to a statistics from the internet giant Google, the Unicode is most frequently used encoding on the internet since 2008.

Nevertheless, there has been a significant increase in Unicode adoption also for Bengali in recent past. Let me mention some of them.

Bangladesh:

In may be noted that Bengali is the national language of Bangladesh and they too suffered from the same problem. However, there has been a dramatic increase in adoption of Unicode lately. The largest news paper from Bangladesh by circulation, the Prothom Alo, has now switched to Unicode. Until recently they were using their own proprietary encoding. Other prominent news papers that have switched to Unicode from proprietary encoding are Amar Desh, Sangbad, Daily Sangram, Manab Zamin, Samakal.

West Bengal:

The West Bengal government has now adopted Unicode 5.0 as the standard encoding for Bengali. Their official website Banglar Mukh has finally switched to Unicode. Furthermore, with their funding the entire literary work of Nobel laureate Rabindranath Tagore has been released using Unicode. Tagore’s works are now in public domain due to the expiration of copyrights. The credits for these encouraging developments must go to the Society for Natural Language Technology Research and the company behind some of these implementations, the MAT-3 Impex.

Coming back to the technology front, there is now a new kid in the great browser arena, the Google Chrome. This snappy browser while supports Unicode natively, currently uses a buggy font for Bengali by default. This causes some Bengali texts to appear garbled. Most of these issues can be solved by simply changing its default font. To do so click on

Wrench-->Options-->Under the Hood-->Change fonts and language settings

and then choose the font of your choice for Bengali. For example in Ubuntu you can choose Freesans or Freeserif. These fonts have nice glyphs for Bengali.

Padma is a Firefox plugin that enables users to read various Indic websites by converting their non-standard text to Unicode text. Padma has supported Anandabazar Patrika (ABP), the largest Bengali newspaper, for more than two years now. However, ABP (also Bartaman Patrika) support in Padma has a major matra-rendering bug which is explained in details here.

Couple of months ago, I made an effort toward fixing the above issue. You can get an improved version of Padma from here. This version resolves most of the matra rendering issue (if not all) for both Anandabazar Patrika (ABP) and Bartaman Patrika.

Update (Jul 9, 2010): To solve this problem at its root, please visit Anandabazar Unicode petition site where you can read Anandabazar using any browser such as Firefox, Chrome, Opera, Safari, IE as well as mobile phones.

This Friday I experienced another extreme event in my life. No, it wasn’t any personal event but rather an extreme weather condition. This morning I saw temperature dipping 34 degree Celsius below zero. Yeah, thats right, it was -34 degree Celsius and to make it worse the wind chill factor was -42 degree Celsius. Wind chill factor is roughly a measure of the temperature that you actually feel because of the wind. Incidentally, I am teaching “Math-3503: Differential Equation for Engineers” course at UNB in this semester and I had a class in the morning at 8:30am. So as you can imagine I had the opportunity to feel this extreme temperature head on :-|. I took this screenshot from my Thinkpad before I went out.

Extreme Weather

Extreme Weather

While waiting at the bus stop, I could literally feel that water vapours are freezing out in my nose as I breath. Later, in my office, I was contrasting this with my days in Chennai. I stayed in Chennai for six years and summer temperature was hitting +44 degree Celsius regularly. In other words, my experience of extreme temperatures now ranges between +44C to -34C. Whoops!!

For last few days I have been experimenting with several Javascript-based virtual keyboards mainly for using in Ankur‘s English to Bengali dictionary project. This dictionary project is aiming for a comprehensive English to Bengali dictionary, freely available to everyone. As of now the project ranks highly in Google search for the keywords “English to Bengali dictionary”. This dictionary project relies on user contributions for enhancement of its database. Thus we needed a browser-based solution aimed at helping users in contributing new dictionary entries (in Unicode Bengali) using standard English keyboard and without using any keyboard layout. It also helps to avoid transliterated contribution in English where it should rather be typed in Unicode Bengali.

We have been using bnwebtools for last one year for the purpose. However, this tool has gone in non-GPL direction recently. So we needed a replacement. I was, nevertheless, looking for not just a replacement but also having a next-generation solution :-).

After playing with few of them I decided to explore the Javascript VirtualKeyboard by Ilya Lebedev. To my surprise, it already had supports for many Indic keyboard layouts. Unfortunately, it didn’t have any Bengali layouts. It appeared that to include a new layout, the layout needs to be described in *.klc file, built using the Microsoft Keyboard Layout Creator tool. Given I had no intension in booting into windows, I wrote a Perl script for creating the *.klc file in the desired format.

To begin with I have converted three Bengali layouts. For Inscript layout, I chose the Baishakhi Inscript used in Baishakhi Linux which is being promoted by Govt of West Bengal (Nevertheless see the recent postings by Sayamindu, Sankarshan, Runa on some controversies surrounding it). Then I converted Ankur‘s Probhat layout which I have been using since the beginning. I also converted, another popular Bengali layout Unijoy.

To see this virtual keyboard in action visit Ankur E2B dictionary project or its virtual keyboard demo page. If you have used this virtual keyboard and have any comments/suggestions on it, then please feel free to post them here.

After a gap of almost two years, I am happy to announce the second official release (version 0.2.0) of Anubadok a free (as in freedom) machine translation system for English to Bengali. Anubadok is written in Perl and it uses Penn Treebank annotation system for natural language processing. To run Anubadok 0.2.0, you need to have Part-of-Speech tagger GPoSTTL installed in your system. The Anubadok system can be accessed online using the interface Anubadok Online run by Ankur.

First official release (ver. 0.1) of Anubadok was an experimental release which mainly served as a proof-of-concept for an open-source English to Bengali machine translation system.

With the release of version 0.2.0, I am glad to upgrade its official tag from “an experimental software” to “a software under development” with clear-and-specific implementation targets. However given the nature of the project, there are no specific time-frames for future releases. Further, given machine translation is considered an open research topic in Computational Linguistic, you should expect to see some surprises ;) even for well implemented situations. Specially, if you are comparing results of machine translations with human translations.

In English, there are four types of sentences: Declarative, Imperative, Interrogative and Exclamatory. These sentence types further fall into four basic sentence type: Simple, Compound, Complex and Compound-Complex.

The table below gives approximate status of implementation for each sentence type in the current release and inversely it gives the targets for future implementations.

Status Table (Version: Anubadok-0.2.0 )
Declar. Imper. Interro. Exclam.
Simple W W W M
Compound M M M M
Complex N N N N
Compound – Complex N N N N

W: Well implemented
M: Moderately implemented
N: Not/Not-well implemented

Anubadok does not yet have any code to handle Complex or Compound-Complex sentences, not even moderately. This is where next push for development is needed.

Few other salient features of this release:

  • The execution method of Anubadok system has been re-written. Anubadok itself has been implemented as Perl module. This means one can now access Anubadok in a Perl program directly by including Anubadok libraries (Perl modules) or in any other program by using appropriate Perl module wrapper.
  • The notion of “testsuites” has been introduced for Anubadok. For a given English sentence, it compares a machine translated sentence with the expected Bengali sentence. This is quite an useful tool while adding new features or doing some experimentations as it would ensure that already implemented algorithm are not affected.
  • Anubadok system can now handle several kinds of input documents including plain text files, any XML documents, HTML files with in-line javascript, CSS. Further, as earlier, it is capable of translating Portable Object (PO) files directly.
  • Anubadok packaging has been completely reorganized to ensure that it has the basic structure of a standard Perl package. Consequently, Anubadok can be installed following the method of standard Perl module installation.
  • Anubadok-0.2.0 comes with an updated dictionary having 15K+ entries in its database. This is almost double the number of entries it had in 0.1 release. Credit for this goes to all the contributors of Ankur English to Bengali dictionary project. Anubadok’s dictionary are now updated regularly using database dumps of Ankur E2B dictionary.
  • Anubadok has now moved to its new website hosted by SourceForge.

    http://anubadok.sourceforge.net

    Latest source codes of Anubadok can be downloaded from the “trunk” branch of its SVN repository.

  • Anubadok Online, the online interface to Anubadok system, has been upgraded substantially. It runs directly using SVN version of Anubadok engine. User contributed new entries though this interface are submitted automatically to Ankur E2B dictionary project.
  • A brief document is now available for download as a PDF file from its website. It describes the internal working and the algorithm used by Anubadok system by considering specific example sentence.
Follow

Get every new post delivered to your Inbox.