AU-KBC RESEARCH CENTRE
Tamil-Hindi Anusaaraka System

The choice of Tamil-Hindi MAT is because, both are Free word-order languages unlike English which is a positional language. Ultimately our aim is to built a Human Aided Machine Translation System for English-Tamil.A MT system basically has three major components, viz.
Morphological Analyser
The Morphological Analyser(MA) splits a word into its constituent
morphemes. A Morpheme is a smallest unit of a word conveying a meaning.
These morphemes collectively describes the word grammatically. Thus complete
grammatical information of a word is obtained from the morphemes. A source
language sentence is first processed by the MA. MA splits the sentence
into words and in turn the words are split into morphemes. The root word
is obtained by this process and this root word is given as input to the
mapping block along with the other morphemes. The other morphemes includes
tense marker, GNP marker, Vibakthi etc. For splitting a word into morphemes
the dictionary is used. Typically this dictionary contains the root words
and its inflections of Tamil language in its first field. The inflections
includes GNP marker, TAM marker, vibakthi.A given word is compared with
the words/morphemes in the first field of the dictionary. Matching is done
from right to left. Thus the inflections of the words are split and finally
we arrive at the root form. Each root word along with its inflections are
given as the input to the mapping unit.For example, in the word "patikkiReen"
(means studying) 'een'(GNP marker) is chopped off which is followed by
'kkiR'(Tense marker). The remaining morpheme 'pati' is the root word. Though
this example is very simple there are words which need sandhi processing
during the process of morphological analysis.
Mapping Unit
The root word and its inflections are mapped to equivalent target
language terms in this block. Explaining the structure of the dictionary
will be very useful at this juncture. Dictionary has seven fields for aiding
in the process of mapping. As said earlier the first field contains the
Tamil root words and inflections. The second field contains paradigm type
followed by paradigm number which are useful in the generation of words.
Subsequent field contains the category of the word, equivalent Hindi meaning(s),
gender information. The last field contains information about the dictionary
which is there for some maintenance work. The gender information is important
especially for Hindi because all the nouns in Hindi will be either of the
two genders and this information is very helpful for semantic analysis.The
corresponding Hindi equivalents of the words are taken and are given as
input to the generator part of the MT system. All equivalent Hindi words
for a Tamil word are given in the dictionary separated by a /. Nevertheless
the first meaning which is more relevant is given to the generator.
Generator
This is the reverse process of analyser. Given a root word and
it inflections this generates the equivalent Hindi word. While generating,
this takes into account all the information like the gender, tense etc.
and the equivalent word is generated accordingly. For the generator to
generate the word the input to it should be in some proper order. The order
that is followed here is: Hindi root, Category, Gender, Number, Person
and finally TAM (Tense-Aspect-Modality if any). The Hindi generator that
is being used here is from IIIT, Hyderabad, which is also used for other
anusaarakaa products. It is being used here as a black box.
Future scope
This current system mainly does a word by word translation. This
does not takes care of semantic analysis. For example, the generated Hindi
text may not agree in terms of gender or other grammatical information.
This is mainly due to the differences between Tamil and Hindi which belong
to different language family (Tamil is a Dravidian language and Hindi a
member of Indo-aryan family). Though these disagreements can be resolved,
this work is not being taken up here. Future work in this project depends
on user interest and those willing to have this product improved further,
can contact us for more details.
Advisors:
Dr. Vineet Chaitanya, LTRC, IIIT, Hyderabad <vc@iiit.net>
Dr. Subbiah Pillai, IITS, Chennai
Dr. Renganathan, IITS, Chennai
Mrs.Amba Kulkarni, LTRC, IIIT, Hyderabad <amba@iiit.net>
Mrs.Meenakshi, Dakshin Bharat Hindi Prachar Sabha, Chennai
Dr.Ranjini Parthasarathy,CSE Dept,Anna University(MainCampus) <ranjani-p@eth.net>
S Baskaran, B Kumara Shanmugam, S Ramesh Kumar, S Viswanathan