Feeds:
Posts
Comments

Archive for July, 2008

After a gap of almost two years, I am happy to announce the second official release (version 0.2.0) of Anubadok a free (as in freedom) machine translation system for English to Bengali. Anubadok is written in Perl and it uses Penn Treebank annotation system for natural language processing. To run Anubadok 0.2.0, you need to have Part-of-Speech tagger GPoSTTL installed in your system. The Anubadok system can be accessed online using the interface Anubadok Online run by Ankur.

First official release (ver. 0.1) of Anubadok was an experimental release which mainly served as a proof-of-concept for an open-source English to Bengali machine translation system.

With the release of version 0.2.0, I am glad to upgrade its official tag from “an experimental software” to “a software under development” with clear-and-specific implementation targets. However given the nature of the project, there are no specific time-frames for future releases. Further, given machine translation is considered an open research topic in Computational Linguistic, you should expect to see some surprises 😉 even for well implemented situations. Specially, if you are comparing results of machine translations with human translations.

In English, there are four types of sentences: Declarative, Imperative, Interrogative and Exclamatory. These sentence types further fall into four basic sentence type: Simple, Compound, Complex and Compound-Complex.

The table below gives approximate status of implementation for each sentence type in the current release and inversely it gives the targets for future implementations.

Status Table (Version: Anubadok-0.2.0 )
Declar. Imper. Interro. Exclam.
Simple W W W M
Compound M M M M
Complex N N N N
Compound – Complex N N N N

W: Well implemented
M: Moderately implemented
N: Not/Not-well implemented

Anubadok does not yet have any code to handle Complex or Compound-Complex sentences, not even moderately. This is where next push for development is needed.

Few other salient features of this release:

  • The execution method of Anubadok system has been re-written. Anubadok itself has been implemented as Perl module. This means one can now access Anubadok in a Perl program directly by including Anubadok libraries (Perl modules) or in any other program by using appropriate Perl module wrapper.
  • The notion of “testsuites” has been introduced for Anubadok. For a given English sentence, it compares a machine translated sentence with the expected Bengali sentence. This is quite an useful tool while adding new features or doing some experimentations as it would ensure that already implemented algorithm are not affected.
  • Anubadok system can now handle several kinds of input documents including plain text files, any XML documents, HTML files with in-line javascript, CSS. Further, as earlier, it is capable of translating Portable Object (PO) files directly.
  • Anubadok packaging has been completely reorganized to ensure that it has the basic structure of a standard Perl package. Consequently, Anubadok can be installed following the method of standard Perl module installation.
  • Anubadok-0.2.0 comes with an updated dictionary having 15K+ entries in its database. This is almost double the number of entries it had in 0.1 release. Credit for this goes to all the contributors of Ankur English to Bengali dictionary project. Anubadok’s dictionary are now updated regularly using database dumps of Ankur E2B dictionary.
  • Anubadok has now moved to its new website hosted by SourceForge.

    http://anubadok.sourceforge.net

    Latest source codes of Anubadok can be downloaded from the “trunk” branch of its SVN repository.

  • Anubadok Online, the online interface to Anubadok system, has been upgraded substantially. It runs directly using SVN version of Anubadok engine. User contributed new entries though this interface are submitted automatically to Ankur E2B dictionary project.
  • A brief document is now available for download as a PDF file from its website. It describes the internal working and the algorithm used by Anubadok system by considering specific example sentence.

Read Full Post »

[Update: Please see at the bottom of this post for a link to an improved version of Padma.]

Anandabazar Patrika (ABP) and Bartaman Patrika (BP) are two (among big four) well-known Bengali news papers that are published from West Bengal, India. In the Internet era, their online versions are not just a matter of convenience rather the only route of access for many of us. Unfortunately, their online versions continue to live in the past by using non-standard, ancient dynamic font technology instead of upgrading to standard Unicode.

The worst part is that to view their website you need to have Internet Explorer installed in your machine. So if you are Linux, Mac (or any non-Windows) users then you are left at your own.

Fortunately, there is now a simple way for Firefox users in Linux and Mac to read these websites using a Mozilla extension named Padma by Nagarjuna Venna and his team. To get Padma working, (a) you need to have Unicode Bengali font (Linux users may already have one. Mac users can get one from Ekushey), (b) you need to have Firefox (version 3 is recommended for Linux but must have for Mac), and (c) and you need to install Padma.

Padma can transform given non-standard encoding to standard Unicode on the fly. Of course, for Padma to work, it must know the font-encoding of the particular website.

As it turns out, I wrote support for ABP in Padma more than a year ago. My job was made simple by an earlier CGI program by Tanmoy Bhattacharya who had already decoded font-mapping for ABP. Couple of months ago, I also added support for Bartaman Patrika in Padma. So, courtesy Tanmoy’s font-map decoding, latest version of Padma (0.4.13) supports both ABP and BP.

There is a known issue of incorrect rendering of Bengali Matras in certain situations. See for example Runa-Sankarshan’s photostream here. Many of these were due to a simple bug and has been fixed in the latest version (0.4.13). However, fixing of the remaining requires significant changes in Padma. ABP and BP both use three different fonts simultaneously. Most ligatures often come from 2nd and 3rd font whereas Matras come from the 1st font. Padma transforms each font separately and doesn’t merge these different fonts elements into a single element. This leads to the incorrect rendering which is hard to solve without changing the core of Padma.

The Bigger Issue however is the need for Padma itself. I tend to agree with the concerns expressed by Sankarshan in a discussion thread here. The real question is then how long are these websites going to keep themselves confined using their own non-standard encoding?

This led me to wonder: don’t their technical staffs realize what they are missing by not upgrading to Unicode? Firstly, by upgrading to Unicode they could readily expand their current user base. Secondly, the use of Unicode will make their contents search-able in search engine like Google. This could lead to additional search-engine generated revenue for them. The number of Bengali internet users is going to increase in coming future, and a significant portion of new internet users will be coming from the interior part. Undoubtedly, many of these users will be more comfortable in searching using Bengali keywords. Thirdly, by continuing the use of non-standard encoding, they are piling up their archive with non-standard contents which would require a big effort by them to bring into standard form. So, in my humble opinion, it would be prudent decision for them to upgrade their website to use Unicode sooner than later.

Nevertheless, there is now a positive sign that Star Ananda, a sister group of Anandabazar Patrika, has started using Unicode (though their defined “charset” doesn’t say so) for their Bengali website. I hope, this marks the beginning of change.

Update (May 9, 2009): Please see this post for an update on the above mentioned incorrect rendering issue.

Read Full Post »