Home Index

AU-KBC RESEARCH CENTRE

AnnCorra

An Introduction

(PDF file with embedded Hindi fonts)

OUTLINE

1. BACKGROUND

2. GUIDELINES

2.1. AN ILLUSTRATION

2.1.1. WHAT TO MARK

2.1.2. HOW TO MARK

3. DEFAULT CONVENTIONS

4. GRAMMATICAL MODEL FOLLOWED

5. TAGSETS

5.1. TAGSET-1

5.2. TAGSET-2

6. DOs AND DON' Ts

6.1. DOs # CORRECTIONS ??

6.2. DON' Ts

7. SAMPLE INPUT

8. SAMPLE ENTRY

1. BACKGROUND

AnnCorra is a project that was decided to be taken up

for developing Lexical Resources for Indian Languages( LERIL), at

the " Workshop on Lexical Resources for Natural Language Processing",

5 - 8 Jan 2001, held at IIIT Hyderabad.

The name AnnCorra, shortened for " Annotated Corpora",

is for an electronic lexical resource of annotated corpora.

The purpose behind this effort is to fill the lacuna in such

resources for Indian languages. It will be an important resource

for the developement of Indian language parsers, machine learning of

grammars, lakshancharts ( discrimination nets for sense disambiguation)

and a host of other tools.

This is a project of LERIL ( Lexical Resources for Indian

Languages), an open- source, collaborative initiative of several groups

( and individuals) to create shareable resources for Indian languages.

Another project, TransLexGram, is already underway

The AnnCorra effort is being started based on the electronic

corpora available freely for various Indian languages. One such resource

is the English- Hindi Electronic Dictionary developed through a voluntary

collaborative effort Co- ordinated by Language Technologies Research

Centre, Indian Institute of Infromation Technology, Hyderabad. Another

resource is an electronic corpus of Hindi developed by Ministry of

Information Technology, Government of India.

Like TransLexGram the present task is also a collaborative effort

among individuals and institutions. The resources so developed will be

available as a " free" resource under GPL ( General Public License).

The effort is being coordinated by a steering committee coordinated

by the natural language technology team at NCST, Mumbai. If you wish to

join the effort, send an email to < leril ncst. ernet. in>.

2. GUIDELINES

The effort requires you to do the following

1. Analyse the sentences

2. Mark the tags expressing the analysis.( tagset is provided)

3. If a sentence is generally ambiguous, but has a single

meaning in a given context, then only that meaning

should be marked.

The task can be better understood by looking at some examples.

2.1. AN ILLUSTRATION

Here is a sentence from Hindi

0:: rAma ne mohana ko nIlI kiwAba xI

Tree-1 is a representation of verb, argument relationship within

the various constituents of the sentences -

xI

-----------------

| | |

k1| k4| k2|

| | |

rAma_ne mohana_ko kiwAba

|

| adj

|

nIlI

Tree-1

2.1.1. WHAT TO MARK

Since the objective of tagging is to explicitly mark the relationships

between various components of a sentence, therefore, verbs and their

arguments have to be marked. If needed, some relationships between

nouns and other grammatical categories such as adjectives are also

to be marked.

2.1.2. HOW TO MARK

The information represented in the tree above can also be represented

in a linear fashion. The tags showing the branch can be marked after

the constituent they refer to.

- First, the elements forming a certain relationship should be

bracketted within square brackets ([]),

rAma_ne mohana_ko [nIlI kiwAba] xI

- Next mark the appropriate tag markers

rAma_ne/ mohana_ko/ [nIlI kiwAba]/ xI::

NOTE - Symbol '/' denotes an ARC tag and '::' denotes a NODE

tag ( explained in greater details under TAGSETS)

- Then type in the required tagname

rAma_ne/ k1 mohana_ko/ k4 [nIlI kiwAba]/ k2 xI:: v

Following tags ( most of which are based on Paninian grammatical model)

have been used above. ( A more comprehensive list of tags is given under

TAGSETS):

k1 : karwA

k2 : karma

k4 : sampraxAna

V : kriyA

The idea here is to mark only the specific grammatical information.

Certain DEFAULT CONVENTIONS are left unmarked. For example the adjective

' nIlI' of ' kitAba' has been left unmarked in the above example.

Following DEFAULT CONVENTIONS will save unnecessary typing efforts.

3. DEFAULT CONVENTIONS

1) Within paranthesis, right most element is the Head.

Example - in the constituent [nIlI kiwAba] noun 'kiwAba'

would be the Head.

2) In case the noun is followed by a postposition ( vibhakti etc.)

it should be included in the parantheses and the unit noun_ vibhakti

remains the head.

Example - [nIlI kiwAba_meM],

The the noun ' kitAba' is followed by a postposition ' meM',

the head, in this case, is ' kitAba_ meM' and not ' meM' alone.

3) If the number of elements within parantheses is more than one,

then by default all of them are to be taken as modifiers of the

head.

Example - [merI nIlI kiwAba], both ' merI' and ' nIlI' are modifying

the Head ' kitAba'.

4) In case the number of elements within paranthesis is more than

two( Head plus two) and one or more of them do not modify the head

then it should be marked.

Example - [halkI nIlI kiwAba],

Here, ' halkI' can qualify both ' nIlI' and ' kitAba'. In case it

is modifying ' kitAba', say, in terms of light weight, then it should

be left unmarked. But if it modifies ' nIlI', in terms of light shade,

then it SHOULD be marked. Mark this by adding '>' on the right of ' halkI'

[halkI> nIlI kiwAba].

Symbol '>' indicates that the element immediately on its right is

modified.

5) Karakas attach to the nearest verb on the right ( inflected- kriya

or kridanta).

[rAma_ne/ k1 KAnA/ k2 KAkara:: Kr pAnI/ k2 piyA:: v]< s>

There are two ' k2 s', in the above example and two verbs ( Vkr and

V). By default, therefore, first '[ khAnA]/ k2' will attach itself to

'[ khAkar]:: vkr' ( the nearest verb) and '[ pAnI]/ k2' to '[ piyA]:: v'.

6) Karta karaka ( k1) has a special default rule. If there is only one

' k1' and more then one verbs, then the default

is that ' k1' should attach to the main verb. Example -

[rAma_ne/ k1 KAnA/ k2 KAkara:: Kr pAnI/ k2 piyA:: v]< s>

Though, semantically the element '[ rAma ne]/ k1' is the agent for

both ' khAkar' and ' piyA' but sense agreement and its vibhakti are

controlled by the second verb therefore,

it will be considered as attaching itself to the main verb ' piyA'.

Example - [rAma KAnA KAkara pAnI pIwA hE]< s>

[sIwA KAnA KAkara pAnI pIwI hE]< s>

7) kridanta attaches to immediately succeeding noun or verb ( depending

on the type of kridanta)

For example in Hindi ' kar- kridanta' attaches itself to another

verb, eg, [rAma_ne/ k1 KAnA/ k2 KAkara:: Kr: i pAnI/ k2 piyA:: v: i]< s>.

': i' in the above example indicates that ' KAkar' attaches itself

to the other ': i' element, ' V: i'.

But since it is a default it need not be marked. So the entry would

be, [rAma_ne/ k1 KAnA/ k2 KAkara:: vkr pAnI/ k2 piyA:: v]< s>

However, participle form ' tA_ huA', in Hindi, can modify both,

a noun and a verb. For example take the Hindi sentence -

[mEMne/ k1 xOdZawe_hue:: Kr GodZe_ko/ k2 xeKA:: v]< s>

This is an ambiguous sentence having two senses

a) [mEMne [xOdZawe hue]: i GodZe: i ko xeKA]< s> ; as in,

[mEMne xOdZawA_huA GodZA xeKA]< s>

b) [mEMne [xOdZawe hue]: i GodZe ko xeKA: i]< s> ; as in,

[mEMne xOdZawe_hue GodZA xeKA]< s>

As earlier, symbol ': i' in the above sentences ( a and b) indicates the

element to which the participle form 'xOdZawe_hue' attaches itself.

In a) the meaning is ' I saw a horse while the horse was running'

and in b) the sense is ' While I was running I saw the horse'.

Therefore, by default, a) will not be marked but b) WILL BE MARKED.

Please NOTE that in such sentences ( which are ambiguous in isolation),

the user should judge the correct meaning in the given context and mark

appropriately.

4. GRAMMATICAL MODEL FOLLOWED

Paninian grammatical model has been chosen here for sentence

analysis, hence for the tagnames as well. Preferrence for this model

is based on the following factors -

1) Being based on analysis of an Indian language it

can deal better with the type of constructions Indian

languages have. Therefore is more appropriate for Indian

language analysis.

2) The model not only offers a mechanism for SYNTACTIC analysis

but also incorporates the SEMANTIC information ( nowadays

called dependency analysis). Thus making the relationships

more transparent.

5. TAGSETS

The tagsets used here have been divided into two categories -

1) TAGSET-1 - Tags which express relationships are marked by

a preceding '/' .

For example kArakas are grammatical relationships,

thus they are marked '/ k1', '/ k2', '/ k3' etc.

2) TAGSET-2 - Tags expressing type of node are marked by

a preceding '::'

Verbs etc. are nodes, so they will be marked ':: v',

NOTE : a) Items marked '***' in the lists below are OPEN FOR DISCUSSION.

b) More tags can be added as and when the need comes.

5.1. TAGSET-1 ( Expressing relationship labels) Marked '/'

s : Sentence

Example - [rAma ne KIra KAyI]< s>

k1 : karwAú

Example - [rAma_ne/ k1 KIra KAyI]< s>

k2 : karma

Example - [rAma_ne KIra/ k2 KAyI]< s>

k3 : karaNa

Example - [rAma_ne cammaca_se/ k3 KIra KAyI]< s>

k4 : sampraxAna

Example - [rAma_ne mohana_ko/ k4 KIra xI]< s>

k5 : apAxAna

Example - [rAma_ne katorI_se/ k5 cammaca_se KIra KAyI]< s>

h : hewu

Example - [mohana [vyavasAyika lakRya_se]/ h kAma karawA hE]< s>

t : wAxarWya

Example - [mohana paDZane_ke_liye/ t skUla jAwA hE]< s>

k7.1 : kAlAXikaraNa

Examples - [kala/ k7.1 pAnI barasA]< s>

[[usa jZamAne_meM]/ k7.1 mazhagAI kama WI]< s>

[bacapana_meM/ k7.1 vaha bahuwa SEwAna WA]< s>

[pahale/ k7.1 rAma AyA]< s>

k7.2 : xeSAXikaraNa

Examples - [mejZa_para/ k7.2 kiwAba hE]< s>

[havA_meM/ k7.2 TaMdaka hE]< s>

k7.3 : viRayAXikaraNa #yA anya ??? ***

Examples - [bahuwa se yuvA [isa svawanwrawA saMgrAma_meM]/

k7.3 hissA liyA]< s>

[unhoMne apane SiRya ko ASrama kI sevAoM se

[mukwa karane_ meM]/ k7.3 saMkoca nahIM kiyA i.]< s>

k1 ud : karwA-uxxeSya ***

Example - [XaniyA/ k1 ud iwanI vyavahArakuSala na WI]< s>

k1 vid : karwA-viXeya ***

Example - [XaniyA [iwanI vyavahArakuSala]/ k1 vid na WI]< s>

k2 ud : karma-uxxeSya ***

Example - [rAma mohana_ko/ k2 ud buxXimAna mAnawA hE]< s>

k2 vid : karma-vaXeya ***

Example - [rAma mohana_ko buxXimAna/ k2 vid mAnawA hE]< s>

Vjt : jyoM-wyoM/jaba-waba samAnakAlikawva sambanXa

Example - [jyoM-jyoM puswaka kI kImawa baDZa_rahI_hE/ Vjt: i

wyoM-wyoM pATaka kI kraya Sakwi Gata_rahI_hEê]/ Vjt: i]< s>

up : upapaxa

Examples - [rAma/ k1: i mohana_ke_sAWa/ up: i gayA]< s>

[rAma ne kiwAba_ke_sAWa/ up: i pena/ k2: i KarIxA]< s>

[pedZa_ke_Upara/ up pakRI udZa rahA hE]< s>

[rAma_ke_prawi/ up mohana ko SraxXA hE]< s>

sdr : sAxqSya

Examples - [puwra piwA jEsA< s> dr hE]< s>

6 : RaRTI

Examples - [sammAna_kA/6 BAva]

[puswaka_kI/6 kImawa]

[pATaka_kI/6 kraya Sakwi]

k1 udj : 'jo' vAkya vAlA uxxeSya

Example - [[jisane kAma kiyA hE] vaha]/ k1 udj rAma hE]< s>

k1 vidj : 'jo' vAkya vAlA viXeya

Example - [jisane kAma kiyA hE] vaha rAma/ k1 vidj hE]< s>

jovo : Relative clause modifiers

Example - [jisane kAma kiyA hE]/ jovo vaha rAma hE]< s>

adj : viSeRaNa

Example - [kiwAba mEMne xeKI nIlI_sI/ adj]< s>

Krv : kriyAviSeRaNa

Examples - barAbara

wejZI_se

halke_se

?? : UNABLE TO DECIDE ***

Example - [PalawaH/?? vaha asaPala ho gayA]< s>

5.2. TAGSET-2 ( for nodes) Marked '::'

v : kriyA

Kr : kqxanwa

Examples - [mIrA ke Awe_hI:: Kr mohana calA gayA] s

[KAnA KAkara:: Kr rAma ne pAnI piyA] s

vH : hE

Example - rAma aXyApaka hE

nr : nirXAraNa( superlatives meM) ***

Example - [sabase:: nr [mahawvapUrNa praSna]]

vibh : viBakwa

Examples - [rAma_se_jyAxA/ vibh mohana buxXimAna hE] s

Examples - [mohana rAma_se_kama/ vibh bAwa karawA hE] s

qs : praSnavAcaka

Examples - [kOna/ qs AyA hE?] s

[rAma kyA/ qs KA rahA hE] s

inj : interjections

Examples - are!

bApa re!

yo : yojaka

Example - rAma Ora/ yo SyAma

yok2 : vAkyakarma yojaka ***

Example - [rAma ne kahA ki/ yok2 vaha nahIM A pAegA] s

i : co- indexed

6. DOs AND DON' Ts

6.1. DOs # CORRECTIONS

a) In case auxiliary verbs, inflectional suffixes, vibhakti

etc. are written leaving spaces in between ( like in Hindi),

fill the spaces by underscores. Example - jA rahA hE should

be conjoined by underscores, thus the final entry would

be, jA_rahA_hE. Similarly, in Hindi, the vibhakti should

be conjoined by an underscore with the preceding noun, eg,

'rAma ne' should be marked 'rAma_ne'.

b) In case of Hindi, some people use the convention of

attaching vibhakti to nouns. Example 'ladZakene'. For the

sake of uniformity, please insert a '_' between the noun

and its vibhakti. Therefore, 'ladZakene' should be changed to

'ladZake_ne' .

But please NOTE that vibhakti after a pronoun SHOULD NOT be

changed.

For example - 'usane' remains 'usane'. DO NOT make it

'usa_ne'.

c) Correct errors relating to missing spaces between words.

For example - 'sahIsaMKyA' should be corrected as 'sahI saMKyA'.

d) Emphatic markers such as 'hI', 'wo', 'BI' etc, in Hindi

should be included within the parantheses of the preceding

head and should be attached with an underscore.

For example -

[badZe ladZakoM ne hI kiwAba KarIxI] should be marked as

[[badZe ladZakoM_ne_hI]/ k1 kiwAba/ k2 KarIxI:: v]< s>

e) In case you are not sure about the tag that a particular

constituent should take mark it '/??'

Example - PalawaH/?? vaha asaPala ho gayA

6.2. DON' Ts

incorrect except those mentioned in 6.1. Any corrections or

suggestions that you consider should be incorporated, write it

in the COMMENT field provided after each sentence. ( The only

corrections being permitted in the sentence itself are listed

in 6.1.).

7. SAMPLE INPUT

As extracted from the corpus as given to you.

---------------------------------------------------------------

SENT:: prakASaka vyavasAya meM sabase mahawvapUrNa praSna puswakoM kI bikrI

hEMê

COMMENT::

XXXXXXXXXXXXXXXXXXXXXX

SENT:: paTana ruci ( rIdiMga hEbita) kA wo aBAva nahIM WA kinwu

SreRTa puswakoM kA prakASana svalpa mAwrA meM howA WAê

COMMENT::

XXXXXXXXXXXXXXXXXXXXXXXX

SENT:: muxriwa puswakoMkI sahIsaMKyA kI jAnakArI wo sWApiwa prakASaka BI

nahIM xeweê

COMMENT::

XXXXXXXXXXXXXXXXXXXXXXXXXXX

SENT:: Palawa: Ese PasalIprakASakoM se leKaka apanI rAyaltI se vaMciwa raha jAwe

hEMê

COMMENT::

XXXXXXXXXXXXXXXXXXXXXXXXXX

--------------------------------------------------

8. SAMPLE ENTRY

( you will mark tags as shown here)

-------------------------------------------------

SENT:: [[prakASaka vyavasAya_meM]/ k7.3 [sabase:: nr mahawvapUrNa praSna]/ k1 ud

[puswakoM_kI/6 bikrI]/ k1 vid hEM:: vHê]< s>

COMMENT:: Verb 'hEM' is wrongly spelled. It should be 'hE'.

XXXXXXXXXXXXXXXXXXXXXX

SENT:: [[[paTana ruci (rIdiMga- hEbita)_kA_wo] aBAva]/ k1 [nahIM WA]/ VH

kinwu/ yo [[SreRTa puswakoM_kA]/6 prakASana]/ k1

[svalpa mAwrA_meM]/ k7.3 [howA WAê]:: v]< s>

COMMENT::

XXXXXXXXXXXXXXXXXXXXXXXX

SENT:: [[[muxriwa puswakoM_kI]/6: i [sahI saMKyA kI]/6: i jAnakArI]/ k2 wo

[sWApiwa prakASaka_BI]/ k1 [nahIM xeweê]:: v]< s>

COMMENT::

XXXXXXXXXXXXXXXXXXXXXXXXXXX

SENT:: ***Palawa:/ h [Ese PasalI prakASakoM_se]/ h leKaka/ k1 [apanI rAyaltI_se]/ k5

vaMciwa_raha_jAwe_hEMê:: v

COMMENT::

*** OPEN FOR DISCUSSION

XXXXXXXXXXXXXXXXXXXXXXXXXX