AU-KBC RESEARCH CENTRE

icon

(+91) 044 2223 2711
Book an Appointment

icon

MIT Anna University
Chromepet, Chennai-44 India.

AUKBC Tamil Part-of-Speech (POS) Corpus
(AUKBC-TamilPOSCorpus2016v1)
Released on 24th May 2016, at WILDRE 3, co-located with the10th edition of LREC, Portorož (Slovenia). This is the largest POS tagged corpus available for Indian languages with 500K tokens annotated using the BIS POS tagset.


AUKBC Tamil Part-of-Speech (POS) Engine
(AUKBC-TamilPOSEngine2016v1)
This Tamil POS tagging engine will automatically tag a given Tamil text (in UTF-8) with Part-Of-Speech Tags (follows BIS POS tagset). This engine released under the GNU GPL version 3.0 license.

Tamil WordNet

  • Tamil WordNet captures the network of lexical relations between lexical items in Tamil.
  • Lexical items are related to one another as hyponymy-hypernymy and meronymy-holonymy relationship and other relationships such as opposites and synonyms.

  • Please click here to download Tamil WordNet

Event Annotated Data in Indian Language:

  • Event Annotated in Indian language text from various sources such as Newswires, Blogs, Facebook, Twitter etc is made available for the research community. The data is available three Indian languages Hindi, Malayalam and Tamil along with English. This data is made available through the FIRE shared task tracks http://78.46.86.133/EventXtractionIL-FIRE2018/ and http://78.46.86.133/EventXtractionIL-FIRE2017/.
  • This data would help researchers in building applications for disaster management, crime tracking.

Verb Phrase Translation Data in Indian Languages:

  • Verb Phrase Translation Data provide parallel sentences in English to Tamil and Hindi to Tamil with Verb Phrase (VP) indexing. In developing a machine translation system, translation of the verb phrase from source language to target language is a challenging task. This annotated data will help researchers in developing Machine Translation systems for Indian languages. This data is made available through http://78.46.86.133/VPT-IL-FIRE2018/

Code Mix Named Entity Annotated Data in Indian Language (Twitter Code Mix data):

  • The Named Entity annotated data for Code-Mix Indian language data (Twitter data) is made available for the research community. We have Tamil - English and Hindi - English code mix Named entity annotated data.
  • This data is made available through the FIRE shared task track http://www.au-kbc.org/nlp/CMEE-FIRE2016/

Named Entity Annotated Data - Social Media Text corpus (Twitter Data):

  • We have made available Named Entity Annotated data in Indian language Social Media text to the research community. The data is available in Hindi, Malayalam, Tamil and English and is made available through the FIRE shared task track http://au-kbc.org/nlp/ESM-FIRE2015/
  • This data can be obtained by writing mail to us.

Named Entity Annotated Data - Newswire Text corpus:

  • Named Entity Annotated Data in 4 Indian Languages viz., Bengali, Hindi, Malayalam and Tamil is made available to the research community. The data is made available through the FIRE (Forum for Information Retrieval Evaluation) shared task tracks of 2013 and 2014. The corpus can be downloaded from the link below.
  • http://au-kbc.org/nlp/NER-FIRE2013/index.html
  • The access to the data can be obtained by writing to us, as given in the track website.