Introduction


Named Entity Recognition(NER) Refers to automatic identification of named entities in a text document. Given a text document, named entities such as Person names, Organization names, Location names, Product names are identified and tagged. Identification of named entities is important in several higher language technology systems such as information extraction systems, machine translation systems, and cross-lingual information access systems.

Over the past decade, Indian language content on various media types such as websites, blogs, email, chats has increased significantly. Content growth is driven by people from non-metros and small cities. Need to process this huge data automatically especially companies are interested to ascertain public view on their products and processes. This requires natural language processing software systems which identify entities, identification of associations or relation between entities.Hence an automatic Named Entity recognizer is required.


This is the 2nd edition of this track in FIRE. The 1st edition of this track was conducted successfully last year in FIRE 2013. There were 9 submissions with 5 particpating teams.

In this 2nd edition, we have our focus on embedded/nested named entity recognition. It is a known fact that some of the Named entities contain other named entities inside them. In the field of Named entity recoginition, it is observed that the task of embedded named entity identification has been ignored. Advanatages of embedded named entity recognition is that this helps identifying entity relationships and also in higher NLP applications especially in the development of Information extraction systems. There are very few works in English and some Indian languages.

One of the biggest challenges in embedded named entity recognition is the availability of benchmark data with embedded taging. And especially for Indian languages we have no such data. In this 2nd edition we have made efforts to provide benchmark data for Indian languages with embedded tagging. The data provided here has 3 levels of embedding.

The objectives of this evaluation exercise are:

  • Creation of benchmark data for Evaluation of Named Entity Recognition for Indian Languages.
  • Focus on Embedded Named Entity recognition
  • Encourage researchers to develop Named Entity Recognition (NER) systems for Indian languages.
  • Providing oppurtunity for researchers to have comparison of different machine learning techniques in NER

Challenges in Indian Language NER

  • Indian languages belong to several language families, the major ones being the Indo-European languages, Indo-Aryan and the Dravidian languages.
  • The challenges in NER arise due to several factors. Some of the main factors are listed below
    1. Morphologically rich - identification of root is difficult, requires use of morphological analysers
    2. No Capitalization feature - In English, capitalization is one of the main features, whereas it is not found in Indian languages
    3. Ambiguity - ambiguity between common and proper nouns. Eg: common words such as "Roja" meaning Rose flower is a name of a person
    4. Spell variations - In the web data we find that different people spell the same entity differently - for example : In Tamil person name -Roja is spelt as "rosa", "roja".



Test Corpus


Test Corpus can be downloaded from the below links:
English - Click Here
Hindi - Click Here
Tamil - Click Here
Malayalam - Click Here


The corpus is protected, registered participants will be provided with passcodes through email.

Training Corpus


To Obtain corpus, teams need to register.

Training Corpus can be downloaded from the below links:
English - Click Here
Hindi - Click Here
Tamil - Click Here
Malayalam - Click Here


The corpus is protected, the participants will be provided with passcodes after registering.

Registration


Please register by sending email to sobha@au-kbc.org with details
"Team Leader Name", "Team Affiliation", "Team Contact Person name" and "Email ID", "Languages for which participating"


Submission Format


The training data is in column format, where last three columns are the NE tags for each level of embedding. The test data will be provided in the same format as given in the training, except that the NE tags will not be there. The participants have to submit their test runs in the format as given in training data.
Note: There should be no changes/alterations in the number of rows provided in the test data.

Evaluation Criteria


For the evaluation we will be considering all the embedded tags and not just the outer level tags. The evaluation metrics will be in the lines of precision, recall.More detailed evaluation metrics will be updated soon.

Task Coordinators - Organizing Committee


Computational Linguistics Research Group (CLRG),
AU-KBC Research Centre



Pattabhi RK Rao, AU-KBC Research Centre, Chennai, India.
Malarkodi CS, AU-KBC Research Centre, Chennai, India.
Vijay Sundar Ram, AU-KBC Research Centre, Chennai, India.
Sobha Lalitha Devi, (Chair) , AU-KBC Research Centre, Chennai, India.