Home Index

AU-KBC RESEARCH CENTRE

Tamil-Hindi Anusaaraka System
 
Image of Tamil->Hindi MT system demo

The choice of Tamil-Hindi MAT is because, both are Free word-order languages unlike English which is a positional language. Ultimately our aim is to built a Human Aided Machine Translation System for English-Tamil.A MT system basically has three major components, viz.

The section below gives the outline of the three components in Tamil-Hindi MT.

Morphological Analyser
 The Morphological Analyser(MA) splits a word into its constituent morphemes. A Morpheme is a smallest unit of a word conveying a meaning. These morphemes collectively describes the word grammatically. Thus complete grammatical information of a word is obtained from the morphemes. A source language sentence is first processed by the MA. MA splits the sentence into words and in turn the words are split into morphemes. The root word is obtained by this process and this root word is given as input to the mapping block along with the other morphemes. The other morphemes includes tense marker, GNP marker, Vibakthi etc. For splitting a word into morphemes the dictionary is used. Typically this dictionary contains the root words and its inflections of Tamil language in its first field. The inflections includes GNP marker, TAM marker, vibakthi.A given word is compared with the words/morphemes in the first field of the dictionary. Matching is done from right to left. Thus the inflections of the words are split and finally we arrive at the root form. Each root word along with its inflections are given as the input to the mapping unit.For example, in the word "patikkiReen" (means studying) 'een'(GNP marker) is chopped off which is followed by 'kkiR'(Tense marker). The remaining morpheme 'pati' is the root word. Though this example is very simple there are words which need sandhi processing during the process of morphological analysis.
 

Mapping Unit
 The root word and its inflections are mapped to equivalent target language terms in this block. Explaining the structure of the dictionary will be very useful at this juncture. Dictionary has seven fields for aiding in the process of mapping. As said earlier the first field contains the Tamil root words and inflections. The second field contains paradigm type followed by paradigm number which are useful in the generation of words. Subsequent field contains the category of the word, equivalent Hindi meaning(s), gender information. The last field contains information about the dictionary which is there for some maintenance work. The gender information is important especially for Hindi because all the nouns in Hindi will be either of the two genders and this information is very helpful for semantic analysis.The corresponding Hindi equivalents of the words are taken and are given as input to the generator part of the MT system. All equivalent Hindi words for a Tamil word are given in the dictionary separated by a /. Nevertheless the first meaning which is more relevant is given to the generator.
 

Generator
 This is the reverse process of analyser. Given a root word and it inflections this generates the equivalent Hindi word. While generating, this takes into account all the information like the gender, tense etc. and the equivalent word is generated accordingly. For the generator to generate the word the input to it should be in some proper order. The order that is followed here is: Hindi root, Category, Gender, Number, Person and finally TAM (Tense-Aspect-Modality if any). The Hindi generator that is being used here is from IIIT, Hyderabad, which is also used for other anusaarakaa products. It is being used here as a black box.
 

Future scope
 This current system mainly does a word by word translation. This does not takes care of semantic analysis. For example, the generated Hindi text may not agree in terms of gender or other grammatical information. This is mainly due to the differences between Tamil and Hindi which belong to different language family (Tamil is a Dravidian language and Hindi a member of Indo-aryan family). Though these disagreements can be resolved, this work is not being taken up here. Future work in this project depends on user interest and those willing to have this product improved further, can contact us for more details.

Advisors:

Dr. Vineet Chaitanya, LTRC, IIIT, Hyderabad <vc@iiit.net>
Dr. Subbiah Pillai, IITS, Chennai
Dr. Renganathan, IITS, Chennai
Mrs.Amba Kulkarni, LTRC, IIIT, Hyderabad <amba@iiit.net>
Mrs.Meenakshi, Dakshin Bharat Hindi Prachar Sabha, Chennai
Dr.Ranjini Parthasarathy,CSE Dept,Anna University(MainCampus) <ranjani-p@eth.net>
 

S Baskaran, B Kumara Shanmugam, S Ramesh Kumar, S Viswanathan