AU-KBC RESEARCH
CENTRE
Noun Phrase Chunking
Introduction:
Text chunking is dividing sentences into non-overlapping phrases. Noun
phrase chunking deals with extracting the noun phrases from a sentence.
While NP chunking is much simpler than parsing, it is still a challenging
task to build a accurate and very efficient NP chunker.
The importance of NP chunking derives from the fact that it is used
in many applications.
Applications:
Noun phrases can be used as a pre-processing tool before parsing the text.
Due to the high ambiguity of the natural language exact parsing of the
text may become very complex. In these cases chunking can be used as a
pre-processing tool to partially resolve these ambiguities.
Noun phrases can be used in Information Retrieval systems. In this application
the chunking can be used to retrieve the data's from the documents depending
on the chunks rather than the words. In particular nouns and noun phrases
are more useful for retrieval and extraction purposes.
Most of the recent work on machine translation use texts in two languages
(parallel corpora) to derive useful transfer patterns. Noun phrases also
have applications in aligning of text in parallel corpora. The sentences
in the parallel corpora can be aligned by using the chunk information and
by relating the chunks in the source and the target language. This can
be done lot more easily than doing word alignment between the texts of
the two languages.
Further noun phrases that are chunked can also be used in other applications
where in depth parsing of the data is not necessary.
Approach to the project:
The approach to the project will be a rule based one. In this method initially
a corpus is taken and it is divided into two or more sets. One of these
divided sets will be used as the training data. The training data set is
taken and manually chunked for noun phrases, thus evolving rules that can
be applied to separate the noun phrases in a sentence. These rules will
serve as the base for chunking. The chunker program will use these rules
and will chunk the test data. The coverage of these rules is tested with
this test data set. Precision and recall are calculated for this and the
result will be analyzed to check, if more rules are needed to improve the
coverage of the system. If more rules are needed then additional rules
are added and the same process as mentioned above is repeated to check
for increase in the precision and recall of the system. The system can
then be tested for various other applications.
S. Thiyagarajan