Home Index


Noun Phrase Chunking


Text chunking is dividing sentences into non-overlapping phrases. Noun phrase chunking deals with extracting the noun phrases from a sentence. While NP chunking is much simpler than parsing, it is still a challenging task to build a accurate and very efficient NP chunker.

The importance of NP chunking derives from the fact that it is used in many applications.


Noun phrases can be used as a pre-processing tool before parsing the text. Due to the high ambiguity of the natural language exact parsing of the text may become very complex. In these cases chunking can be used as a pre-processing tool to partially resolve these ambiguities.

Noun phrases can be used in Information Retrieval systems. In this application the chunking can be used to retrieve the data's from the documents depending on the chunks rather than the words. In particular nouns and noun phrases are more useful for retrieval and extraction purposes.

Most of the recent work on machine translation use texts in two languages (parallel corpora) to derive useful transfer patterns. Noun phrases also have applications in aligning of text in parallel corpora. The sentences in the parallel corpora can be aligned by using the chunk information and by relating the chunks in the source and the target language. This can be done lot more easily than doing word alignment between the texts of the two languages.

Further noun phrases that are chunked can also be used in other applications where in depth parsing of the data is not necessary.

Approach to the project:

The approach to the project will be a rule based one. In this method initially a corpus is taken and it is divided into two or more sets. One of these divided sets will be used as the training data. The training data set is taken and manually chunked for noun phrases, thus evolving rules that can be applied to separate the noun phrases in a sentence. These rules will serve as the base for chunking. The chunker program will use these rules and will chunk the test data. The coverage of these rules is tested with this test data set. Precision and recall are calculated for this and the result will be analyzed to check, if more rules are needed to improve the coverage of the system. If more rules are needed then additional rules are added and the same process as mentioned above is repeated to check for increase in the precision and recall of the system. The system can then be tested for various other applications.


S. Thiyagarajan