Home Index


POS Tagging


Part-of-speech tagging is the process of assigning a part-of-speech like noun, verb, pronoun, preposition, adverb, adjective or other lexical class marker to each word in a sentence.

The input to a tagging algorithm is a string of words of a natural language sentence and a specified tagset( a finite list of Part-of-speech tags). The output is a single best POS tag for each word.

As Tamil is a Morphological rich language, the Morph analyser ( a tool that splits a given word into its constituent morphemes and identifies their corresponding grammatical categories) itself can identify the part-of-speech in most of the cases. But the Morph analyser fails to resolve some of the lexical ambiguities for which we need a POS Tagger. For example in the sentencematraasil naanku varusamaaka irukkireen(I have been in Madras for four years)In this sentence the word 'varusamaaka' splits as varusam(noun)+aaka(adverbial suffix). In Tamil the suffix 'aaka' usually attaches with a noun and forms an adverb. But in this case 'aaka' corresponds to the preposition 'for' in English. Typically stochastic models using information about neighboring words are used to assign the appropriate tags. For example in the sentencenaan pati erineen(I climbed the stairs)Here the lexical item 'pati' could be a noun meaning staircase or it could be the imperative form of the verb 'read'. Here in this sentence its a noun which is more likely to precede a verb.

Tags plays an important role in Natural language applications like speech recognition, natural language parsing , information retrieval and information extraction.

The objective of this project is to identify the ambiguities in Tamil lexical items, which cannot be resolved by a Morph analyser, and to develop a tag set appropriate for Tamil and Indian languages. Finally, we will build an efficient and accurate POS Tagger.

Previous attempts:

Taggers can be characterized as rule-based or stochastic. Rule-based taggers use hand-written rules to distinguish the tag ambiguity. Stochastic taggers are either HMM based, choosing the tag sequence which maximizes the product of word likelihood and tag sequence probability, or cue-based, using decision trees or maximum entropy models to combine probabilistic features.

Lots of work has been done on POS tagging for English. The earliest algorithm for automatically assigning part-of-speech was rule based. The ENGTWOL tagger (Voutilainen, 1995) is a rule based tagger based on two-stage architecture. Probabilities in tagging were first used by (Stolz et al. 1965) and various stochastic taggers were built in 1980s (Church 1988). There were also Transformation-Based Tagging, an instance of the Transformation-Based Learning, a machine learning approach. But all these works has been done for English and a few European languages.

There has not been much work done in POS tagging for Tamil. A likely reason is that Tamil is rich in morphology and most of the information for POS tagging is available as inflections. As a result of this lot of works are being done on Tamil morpher. There exist a Tamil morphological tagger (Vasu Reganathan) with a limited coverage.

Present work:

At the first level a study on the limitations on word level analysis (Morph) would be done. Second the input requirement of various NLP applications would be studied. By these studies we can identify the information requirement of the applications that could not be delivered by a morphological analyser. Then strategies would be developed to identify the methodology by which a tagger can extract / resolve those additional information.

POS tagger would be needed to identify the tag for the words that could not be analysed by the morphological analyser. If the Morph gives multiple (ambiguous) tags for a word, then the tagger could be used to resolve the ambiguity.

The idea is to try different combination of tagging techniques to identify the best tagging scheme for inflectional and free word order languages like Tamil. Transformation-Based tagging method is a hybrid-tagging scheme that uses both rule-based and stochastic techniques. Like the rule-based taggers, Transformation based learning is based on rules that specify what tags should be assigned to what words. But like the stochastic taggers, TBL is a machine learning technique, in which rules are automatically induced from the data. This approach would be tried initially and other techniques would be explored in due course.



B. Kumara Shanmugam