Indian languages have their own script, however we find that user generated content such as tweets, blogs and personal websites of people are written using Roman script or sometimes they may use both roman and indigeneous script. The roman script phonetically represents the native language words. The use of non-native script by the users can be attributed to bilngualism and multilingualism existing in the country. And also the people are more used to roman script while usng electronic gadgets such as smartphones, desktops, laptops. Smartphone usage and easy internet accessibility we find that users are using English language words as their own language words not only in spoken language but also in written form. These forms are becoming a new norm of communication in the social media and has seen significant growth of such type of content. These type of text pose a new challenge in the area of text analytics. As there is a need to process such data automatically for various applications.

In this shared task initiatve we present the task of identifying entities in code mix text. We have chosen Hindi-English and Tamil-English content from tweets and few microblogs in the present intiative. Entities are real world elements or objects such as Person names, Organization names, Product names, Location names. Entities are often reffered to as Named Entities. Entity extraction refers to automatic identification of named entities in a text document. Identification of named entities is very important for several higher language technology systems such as information extraction systems, machine translation systems, and cross-lingual information access systems.

The objectives of this evaluation exercise are:

  • Creation of benchmark data for Code Mix Entity Extraction in Indian language found in user generated content.
  • Encourage researchers to develop Named Entity Recognition (NER) systems for Code Mix Content.
  • Providing oppurtunity to researchers to have comparison of different machine learning techniques.

Task Description

The Task is to identify the various entities such as person names, organization names, movie names, location names in a given tweet. The tweets are written in roman script and has code mix, where an Indian Language is mixed with English. In this initiative we have chosen two Indian languages Hindi and Tamil. The data will be in column format. In the training phase we will be providing two files, one which will have tweet IDs and another file containing the annotations. This annotation file (second file), will be column format where it has 6 columns, Tweet ID, User ID, NE Type, NE String, NE character start index and length offset. Below we have provided a sample file, which is the annotation file. The particpants are to submit a similar Annotation file for test data.

For Sample Data Click Here

Training Corpus

Released on Sep 21st 2016 !!
The data will be emailed to particpants after registering and filling up Copyright Form online. The online copyright form link will be sent by email after registration.


Registration is now open !!!
Please register by sending email to with details
"Team Leader Name", "Team Affiliation", "Team Contact Person name" and "Email ID", "Languages for which participating".

Submission Format

The training data will be in column format. The test data will be provided in the same format as given in the training, except that the NE tags will not be there. The participants have to submit their test runs in the format as given in training data.
Note: There should be no changes/alterations in the number of rows provided in the test data.

Evaluation Criteria

Will be announced soon !!!

Task Coordinators - Organizing Committee

Computational Linguistics Research Group (CLRG),
AU-KBC Research Centre

Pattabhi RK Rao, AU-KBC Research Centre, Chennai, India.
Sobha Lalitha Devi, AU-KBC Research Centre, Chennai, India.