AU-KBC RESEARCH CENTRE
Word Sense Disambiguation
In any language there are words which are ambiguous with more than one sense. For example the English word ‘bank’ has at least two senses, viz:
bank1 - as a financial organization
bank2 - as the border of a water body
The task of Word Sense Disambiguation (WSD) is to determine which of the senses of an ambiguous word is invoked in a particular use of the word. A standard approach to WSD is to consider the context of the word’s use in particular the words that occur in some predefined neighboring context.
WSD is very important for applications like Machine Translation, Information retrieval etc. For example, in a Machine translation from English to Tamil, words like ‘bank’ has to be disambiguated so that they are translated correctly in the target language (i.e. in this case the correct Tamil word is chosen). Also, in Information retrieval system answering a query about ‘financial banks’ should return only documents that use bank in the first sense. In short, whenever a system’s actions depend on the meaning of the text being processed, disambiguation is beneficial or even necessary. Similarly a Question-Answering system should interpret the right meaning of an ambiguous word (in the query) to be able to answer the question correctly.
Earlier attempts on WSD focussed on Supervised learning, by assuming that resources like WordNet(Xiaobin Li et al., 1995) or sense-coded dictionary(Lesk 1986; Dagan et al. 1991) or Thesauri(Yarowsky 1992) or hand-labeled training sets(Hearst 1991; Niwa and Nitta 1994; Bruce and Wiebe 1994) are available a priori. Development of these resources requires huge amount of human efforts and typically takes years for building. As these resources are not available for Tamil, such supervised techniques can not be immediately applied. Few attempts has been made on unsupervised WSD like (Yarowsky 1995), which seeks minimum human involvement, in the form of providing a few seed words that occur with each sense of the ambiguous word for bootstrapping the algorithm. The algorithm then classifies each occurrence of the ambiguous word in a corpus (training phase) into several clusters such that all the occurrences are in the same sense within a cluster. Additional co-occurrences are collected in this process, which are then used for disambiguating unseen texts from the held-out corpus (testing phase).
Here, the idea is to reduce the human effort needed for sense-tagging, when compared to (Yarowsky 1995). This approach is similar to and an extension of Context-group discrimination (Schutze 1998). In our present approach all the occurrences of the ambiguous words are classified into different clusters in such a way that all the occurrences are in the same sense within a cluster. Then co-occurrence words are collected for each cluster. These words are used for manually assigning the sense for each cluster.
Also, it is planned to probe the applicability of the inflections of words in WSD for rich inflectional languages like Tamil. The hypothesis is that, "Each sense of an ambiguous word will predominantly co-occur with words in a particular inflected form". The preliminary investigations reveal that the hypothesis is indeed useful for some senses of an ambiguous word if not for all senses. So, it is proposed to use this information simultaneously with the co-occurrence information explained earlier.
Context based approach
The context, on which the sentence is appearing, provides valuable clues for Sense disambiguation. The 'context' means the nearby words that are present in the sentence containing the ambiguous word. These nearby words provide valuable clues in identifying the right sense for the word. However the notion nearby may not be really 'nearby' as these high informative words may also appear away for the ambiguous word. But it has been true in most of the cases that informative words occur near the ambiguous words and can be used reliably. Here the aim is to identify the context words for each sense of the ambiguous word that will uniquely represent one particular sense for that word in the given context.
Case relation based approach
A new approach is being tried here for WSD, which uses the case markers of both context words and the ambiguous word. The hypothesis is, "Each sense of an ambiguous word with multiple senses is related to the near by words (in a specific window) in a particular fashion". The hypothesis is supported by the fact that each sense of the ambiguous word occurs as an argument of a particular group of verb, and each group in turn takes arguments with different relations (expressed by the case markers). Thus, the case markers of the context words and that of the ambiguous word itself will act as an indicator in identifying the correct sense of the ambiguous word.
Integrating context-based and case-based approaches
In the first phase (training phase) the analyser uses the CIIL corpus to collect the collocations and the prominent cases of each sense of an ambiguous word. The results obtained from this phase are represented in a disambiguation dictionary. This dictionary typically contains the following information for each sense of the ambiguous word. i)Collocational words ii)Prominent case marker(s) of the collocational words iii)Prominent case marker(s) of the ambiguous words In the next phase (testing phase) the analyser uses these information to disambiguate a raw text. Here, the analyser looks for matching context word or the case information in order to choose the right sense of the target word. A weighting function is used to smooth out the differences while picking information from these three sources.
References (reduced list):
1. Bruce Rebecca and Janyce Wiebe. 1994. Word-sense disambiguation using decomposable models. In ACL. (p)139-145. Las Cruces.
2. Dagan et al. 1991. Two languages are informative than one. In ACL. (p)130-137. Berkely
3. Hearst, Marti A. 1991. Noun homograph disambiguation in using local context in large text corpora. In NewOED and Text Research. (p)1-22. Oxford.
4. Lesk M. 1986. Automatic sense disambiguation: How to tell a pine cone from an ice cream cone. In SIGDOC. New York.
5. Niwa Yoshiki and Yoshihiko Nitta. 1994. Co-occurrence vectors from corpora vs. distance vectors from dictionaries. In COLING. (p)304-309.
6. Schutze H. 1998. Automatic Word Sense Discrimination. Computational Linguistics. 24(1) 97-124.
7. Xiaobin Li, Stan Szpakowicz and Stan Matwin. 1995. A WordNet-based algorithm for word sense disambiguation. In IJCAI.
8. Yarowsky D. 1992. Word-sense disambiguation using statistical models of Roget's categories trained in large corpora. In COLING (p)454-460. France
9. Yarowsky D. 1995. Unsupervised Word Sense Disambiguation rivaling Supervised methods. In ACL. Massachusetts