AU-KBC RESEARCH
CENTRE
Research topics in Natural Language Processing
The group is currently working on the following topics:
-
An automated system for Tamil morphological analysis has been implemented
with an API . The morphological analyser
and generator are integrated in a finite-state automata implementation.
The current accuracy of the morph analyzer on the 3 million word CIIL
Tamil corpus is >95%. An online
demo of this is available.
-
A computationally amenable phrase structure
grammar (PSG) for Tamil is being developed. This is done in analogy
to existing PSG's in English, the idea being that transfer rules from one
grammar to another will be used to remap parse trees from source to target
language. Together with feature-based unification and the morphological
analyzer, this forms the essence of the core design of the English -Tamil
bidirectional MT system. An alpha-version of this system is expected by
the summer of 2003. The prototype
version of the English - Tamil MT system is available online.
-
A Tamil Search Engine has been developed
and an online demo
is available. A new version (under development) will have increased speed,
improved ranking and a host of other new features.
-
A named entity extraction module
has been developed to identify chemical names and protein names from medical
abstracts. This modules forms a part of a Information Extraction system.
-
Lexical resources are also being
developed to initiate statistical parsing in Tamil. This is being done
in coordination with other members of the
TransLexGram
project. 5000 English words (in their various senses and categories, a
total of ~8000) with associated sentences have been manually translated
into Tamil. An online
searchable display of the TrandLexGram is available.
-
POS tagging is being carried out as part of the
AnnCorra project. Semi automated tools with GUI has been developed
to tag the sentences. Display tools were also developed to view the parse tree representation of the
sentences.
-
We are collaborating with Dr. S Rajendran of Thanjavur Tamil University,
on a Tamil WordNet. This is being
developed on the similar lines of the English WordNet developed by Princeton
Universiy. Plans are afloat to interconnect this with the Hindi WordNet
and the English WordNet.