(PDF file with embedded Hindi fonts)
OUTLINE
1. BACKGROUND
2. GUIDELINES
2.1. AN ILLUSTRATION
2.1.1. WHAT TO MARK
2.1.2. HOW TO MARK
3. DEFAULT CONVENTIONS
4. GRAMMATICAL MODEL FOLLOWED
5. TAGSETS
5.1. TAGSET-1
5.2. TAGSET-2
6. DOs AND DON' Ts
6.1. DOs # CORRECTIONS ??
6.2. DON' Ts
7. SAMPLE INPUT
8. SAMPLE ENTRY
1. BACKGROUND
AnnCorra is a project that was decided to be taken up
for developing Lexical Resources for Indian Languages( LERIL), at
the " Workshop on Lexical Resources for Natural Language Processing",
5 - 8 Jan 2001, held at IIIT Hyderabad.
The name AnnCorra, shortened for " Annotated Corpora",
is for an electronic lexical resource of annotated corpora.
The purpose behind this effort is to fill the lacuna in such
resources for Indian languages. It will be an important resource
for the developement of Indian language parsers, machine learning of
grammars, lakshancharts ( discrimination nets for sense disambiguation)
and a host of other tools.
This is a project of LERIL ( Lexical Resources for Indian
Languages), an open- source, collaborative initiative of several groups
( and individuals) to create shareable resources for Indian languages.
Another project, TransLexGram, is already underway
The AnnCorra effort is being started based on the electronic
corpora available freely for various Indian languages. One such resource
is the English- Hindi Electronic Dictionary developed through a voluntary
collaborative effort Co- ordinated by Language Technologies Research
Centre, Indian Institute of Infromation Technology, Hyderabad. Another
resource is an electronic corpus of Hindi developed by Ministry of
Information Technology, Government of India.
Like TransLexGram the present task is also a collaborative effort
among individuals and institutions. The resources so developed will be
available as a " free" resource under GPL ( General Public License).
The effort is being coordinated by a steering committee coordinated
by the natural language technology team at NCST, Mumbai. If you wish to
join the effort, send an email to < leril ncst. ernet. in>.
2. GUIDELINES
The effort requires you to do the following
1. Analyse the sentences
2. Mark the tags expressing the analysis.( tagset is provided)
3. If a sentence is generally ambiguous, but has a single
meaning in a given context, then only that meaning
should be marked.
The task can be better understood by looking at some examples.
2.1. AN ILLUSTRATION
Here is a sentence from Hindi
0:: rAma ne mohana ko nIlI kiwAba xI
Tree-1 is a representation of verb, argument relationship within
the various constituents of the sentences -
xI
-----------------
| | |
k1| k4| k2|
| | |
rAma_ne mohana_ko kiwAba
|
| adj
|
nIlI
Tree-1
2.1.1. WHAT TO MARK
Since the objective of tagging is to explicitly mark the relationships
between various components of a sentence, therefore, verbs and their
arguments have to be marked. If needed, some relationships between
nouns and other grammatical categories such as adjectives are also
to be marked.
2.1.2. HOW TO MARK
The information represented in the tree above can also be represented
in a linear fashion. The tags showing the branch can be marked after
the constituent they refer to.
- First, the elements forming a certain relationship should be
bracketted within square brackets ([]),
rAma_ne mohana_ko [nIlI kiwAba] xI
- Next mark the appropriate tag markers
rAma_ne/ mohana_ko/ [nIlI kiwAba]/ xI::
NOTE - Symbol '/' denotes an ARC tag and '::' denotes a NODE
tag ( explained in greater details under TAGSETS)
- Then type in the required tagname
rAma_ne/ k1 mohana_ko/ k4 [nIlI kiwAba]/ k2 xI:: v
Following tags ( most of which are based on Paninian grammatical model)
have been used above. ( A more comprehensive list of tags is given under
TAGSETS):
k1 : karwA
k2 : karma
k4 : sampraxAna
V : kriyA
The idea here is to mark only the specific grammatical information.
Certain DEFAULT CONVENTIONS are left unmarked. For example the adjective
' nIlI' of ' kitAba' has been left unmarked in the above example.
Following DEFAULT CONVENTIONS will save unnecessary typing efforts.
3. DEFAULT CONVENTIONS
1) Within paranthesis, right most element is the Head.
Example - in the constituent [nIlI kiwAba] noun 'kiwAba'
would be the Head.
2) In case the noun is followed by a postposition ( vibhakti etc.)
it should be included in the parantheses and the unit noun_ vibhakti
remains the head.
Example - [nIlI kiwAba_meM],
The the noun ' kitAba' is followed by a postposition ' meM',
the head, in this case, is ' kitAba_ meM' and not ' meM' alone.
3) If the number of elements within parantheses is more than one,
then by default all of them are to be taken as modifiers of the
head.
Example - [merI nIlI kiwAba], both ' merI' and ' nIlI' are modifying
the Head ' kitAba'.
4) In case the number of elements within paranthesis is more than
two( Head plus two) and one or more of them do not modify the head
then it should be marked.
Example - [halkI nIlI kiwAba],
Here, ' halkI' can qualify both ' nIlI' and ' kitAba'. In case it
is modifying ' kitAba', say, in terms of light weight, then it should
be left unmarked. But if it modifies ' nIlI', in terms of light shade,
then it SHOULD be marked. Mark this by adding '>' on the right of ' halkI'
[halkI> nIlI kiwAba].
Symbol '>' indicates that the element immediately on its right is
modified.
5) Karakas attach to the nearest verb on the right ( inflected- kriya
or kridanta).
[rAma_ne/ k1 KAnA/ k2 KAkara:: Kr pAnI/ k2 piyA:: v]< s>
There are two ' k2 s', in the above example and two verbs ( Vkr and
V). By default, therefore, first '[ khAnA]/ k2' will attach itself to
'[ khAkar]:: vkr' ( the nearest verb) and '[ pAnI]/ k2' to '[ piyA]:: v'.
6) Karta karaka ( k1) has a special default rule. If there is only one
' k1' and more then one verbs, then the default
is that ' k1' should attach to the main verb. Example -
[rAma_ne/ k1 KAnA/ k2 KAkara:: Kr pAnI/ k2 piyA:: v]< s>
Though, semantically the element '[ rAma ne]/ k1' is the agent for
both ' khAkar' and ' piyA' but sense agreement and its vibhakti are
controlled by the second verb therefore,
it will be considered as attaching itself to the main verb ' piyA'.
Example - [rAma KAnA KAkara pAnI pIwA hE]< s>
[sIwA KAnA KAkara pAnI pIwI hE]< s>
7) kridanta attaches to immediately succeeding noun or verb ( depending
on the type of kridanta)
For example in Hindi ' kar- kridanta' attaches itself to another
verb, eg, [rAma_ne/ k1 KAnA/ k2 KAkara:: Kr: i pAnI/ k2 piyA:: v: i]< s>.
': i' in the above example indicates that ' KAkar' attaches itself
to the other ': i' element, ' V: i'.
But since it is a default it need not be marked. So the entry would
be, [rAma_ne/ k1 KAnA/ k2 KAkara:: vkr pAnI/ k2 piyA:: v]< s>
However, participle form ' tA_ huA', in Hindi, can modify both,
a noun and a verb. For example take the Hindi sentence -
[mEMne/ k1 xOdZawe_hue:: Kr GodZe_ko/ k2 xeKA:: v]< s>
This is an ambiguous sentence having two senses
a) [mEMne [xOdZawe hue]: i GodZe: i ko xeKA]< s> ; as in,
[mEMne xOdZawA_huA GodZA xeKA]< s>
b) [mEMne [xOdZawe hue]: i GodZe ko xeKA: i]< s> ; as in,
[mEMne xOdZawe_hue GodZA xeKA]< s>
As earlier, symbol ': i' in the above sentences ( a and b) indicates the
element to which the participle form 'xOdZawe_hue' attaches itself.
In a) the meaning is ' I saw a horse while the horse was running'
and in b) the sense is ' While I was running I saw the horse'.
Therefore, by default, a) will not be marked but b) WILL BE MARKED.
Please NOTE that in such sentences ( which are ambiguous in isolation),
the user should judge the correct meaning in the given context and mark
appropriately.
4. GRAMMATICAL MODEL FOLLOWED
Paninian grammatical model has been chosen here for sentence
analysis, hence for the tagnames as well. Preferrence for this model
is based on the following factors -
1) Being based on analysis of an Indian language it
can deal better with the type of constructions Indian
languages have. Therefore is more appropriate for Indian
language analysis.
2) The model not only offers a mechanism for SYNTACTIC analysis
but also incorporates the SEMANTIC information ( nowadays
called dependency analysis). Thus making the relationships
more transparent.
5. TAGSETS
The tagsets used here have been divided into two categories -
1) TAGSET-1 - Tags which express relationships are marked by
a preceding '/' .
For example kArakas are grammatical relationships,
thus they are marked '/ k1', '/ k2', '/ k3' etc.
2) TAGSET-2 - Tags expressing type of node are marked by
a preceding '::'
Verbs etc. are nodes, so they will be marked ':: v',
NOTE : a) Items marked '***' in the lists below are OPEN FOR DISCUSSION.
b) More tags can be added as and when the need comes.
5.1. TAGSET-1 ( Expressing relationship labels) Marked '/'
s : Sentence
Example - [rAma ne KIra KAyI]< s>
k1 : karwAú
Example - [rAma_ne/ k1 KIra KAyI]< s>
k2 : karma
Example - [rAma_ne KIra/ k2 KAyI]< s>
k3 : karaNa
Example - [rAma_ne cammaca_se/ k3 KIra KAyI]< s>
k4 : sampraxAna
Example - [rAma_ne mohana_ko/ k4 KIra xI]< s>
k5 : apAxAna
Example - [rAma_ne katorI_se/ k5 cammaca_se KIra KAyI]< s>
h : hewu
Example - [mohana [vyavasAyika lakRya_se]/ h kAma karawA hE]< s>
t : wAxarWya
Example - [mohana paDZane_ke_liye/ t skUla jAwA hE]< s>
k7.1 : kAlAXikaraNa
Examples - [kala/ k7.1 pAnI barasA]< s>
[[usa jZamAne_meM]/ k7.1 mazhagAI kama WI]< s>
[bacapana_meM/ k7.1 vaha bahuwa SEwAna WA]< s>
[pahale/ k7.1 rAma AyA]< s>
k7.2 : xeSAXikaraNa
Examples - [mejZa_para/ k7.2 kiwAba hE]< s>
[havA_meM/ k7.2 TaMdaka hE]< s>
k7.3 : viRayAXikaraNa #yA anya ??? ***
Examples - [bahuwa se yuvA [isa svawanwrawA saMgrAma_meM]/
k7.3 hissA liyA]< s>
[unhoMne apane SiRya ko ASrama kI sevAoM se
[mukwa karane_ meM]/ k7.3 saMkoca nahIM kiyA i.]< s>
k1 ud : karwA-uxxeSya ***
Example - [XaniyA/ k1 ud iwanI vyavahArakuSala na WI]< s>
k1 vid : karwA-viXeya ***
Example - [XaniyA [iwanI vyavahArakuSala]/ k1 vid na WI]< s>
k2 ud : karma-uxxeSya ***
Example - [rAma mohana_ko/ k2 ud buxXimAna mAnawA hE]< s>
k2 vid : karma-vaXeya ***
Example - [rAma mohana_ko buxXimAna/ k2 vid mAnawA hE]< s>
Vjt : jyoM-wyoM/jaba-waba samAnakAlikawva sambanXa
Example - [jyoM-jyoM puswaka kI kImawa baDZa_rahI_hE/ Vjt: i
wyoM-wyoM pATaka kI kraya Sakwi Gata_rahI_hEê]/ Vjt: i]< s>
up : upapaxa
Examples - [rAma/ k1: i mohana_ke_sAWa/ up: i gayA]< s>
[rAma ne kiwAba_ke_sAWa/ up: i pena/ k2: i KarIxA]< s>
[pedZa_ke_Upara/ up pakRI udZa rahA hE]< s>
[rAma_ke_prawi/ up mohana ko SraxXA hE]< s>
sdr : sAxqSya
Examples - [puwra piwA jEsA< s> dr hE]< s>
6 : RaRTI
Examples - [sammAna_kA/6 BAva]
[puswaka_kI/6 kImawa]
[pATaka_kI/6 kraya Sakwi]
k1 udj : 'jo' vAkya vAlA uxxeSya
Example - [[jisane kAma kiyA hE] vaha]/ k1 udj rAma hE]< s>
k1 vidj : 'jo' vAkya vAlA viXeya
Example - [jisane kAma kiyA hE] vaha rAma/ k1 vidj hE]< s>
jovo : Relative clause modifiers
Example - [jisane kAma kiyA hE]/ jovo vaha rAma hE]< s>
adj : viSeRaNa
Example - [kiwAba mEMne xeKI nIlI_sI/ adj]< s>
Krv : kriyAviSeRaNa
Examples - barAbara
wejZI_se
halke_se
?? : UNABLE TO DECIDE ***
Example - [PalawaH/?? vaha asaPala ho gayA]< s>
5.2. TAGSET-2 ( for nodes) Marked '::'
v : kriyA
Kr : kqxanwa
Examples - [mIrA ke Awe_hI:: Kr mohana calA gayA] s
[KAnA KAkara:: Kr rAma ne pAnI piyA] s
vH : hE
Example - rAma aXyApaka hE
nr : nirXAraNa( superlatives meM) ***
Example - [sabase:: nr [mahawvapUrNa praSna]]
vibh : viBakwa
Examples - [rAma_se_jyAxA/ vibh mohana buxXimAna hE] s
Examples - [mohana rAma_se_kama/ vibh bAwa karawA hE] s
qs : praSnavAcaka
Examples - [kOna/ qs AyA hE?] s
[rAma kyA/ qs KA rahA hE] s
inj : interjections
Examples - are!
bApa re!
yo : yojaka
Example - rAma Ora/ yo SyAma
yok2 : vAkyakarma yojaka ***
Example - [rAma ne kahA ki/ yok2 vaha nahIM A pAegA] s
i : co- indexed
6. DOs AND DON' Ts
6.1. DOs # CORRECTIONS
a) In case auxiliary verbs, inflectional suffixes, vibhakti
etc. are written leaving spaces in between ( like in Hindi),
fill the spaces by underscores. Example - jA rahA hE should
be conjoined by underscores, thus the final entry would
be, jA_rahA_hE. Similarly, in Hindi, the vibhakti should
be conjoined by an underscore with the preceding noun, eg,
'rAma ne' should be marked 'rAma_ne'.
b) In case of Hindi, some people use the convention of
attaching vibhakti to nouns. Example 'ladZakene'. For the
sake of uniformity, please insert a '_' between the noun
and its vibhakti. Therefore, 'ladZakene' should be changed to
'ladZake_ne' .
But please NOTE that vibhakti after a pronoun SHOULD NOT be
changed.
For example - 'usane' remains 'usane'. DO NOT make it
'usa_ne'.
c) Correct errors relating to missing spaces between words.
For example - 'sahIsaMKyA' should be corrected as 'sahI saMKyA'.
d) Emphatic markers such as 'hI', 'wo', 'BI' etc, in Hindi
should be included within the parantheses of the preceding
head and should be attached with an underscore.
For example -
[badZe ladZakoM ne hI kiwAba KarIxI] should be marked as
[[badZe ladZakoM_ne_hI]/ k1 kiwAba/ k2 KarIxI:: v]< s>
e) In case you are not sure about the tag that a particular
constituent should take mark it '/??'
Example - PalawaH/?? vaha asaPala ho gayA
6.2. DON' Ts
incorrect except those mentioned in 6.1. Any corrections or
suggestions that you consider should be incorporated, write it
in the COMMENT field provided after each sentence. ( The only
corrections being permitted in the sentence itself are listed
in 6.1.).
7. SAMPLE INPUT
As extracted from the corpus as given to you.
---------------------------------------------------------------
SENT:: prakASaka vyavasAya meM sabase mahawvapUrNa praSna puswakoM kI bikrI
hEMê
COMMENT::
XXXXXXXXXXXXXXXXXXXXXX
SENT:: paTana ruci ( rIdiMga hEbita) kA wo aBAva nahIM WA kinwu
SreRTa puswakoM kA prakASana svalpa mAwrA meM howA WAê
COMMENT::
XXXXXXXXXXXXXXXXXXXXXXXX
SENT:: muxriwa puswakoMkI sahIsaMKyA kI jAnakArI wo sWApiwa prakASaka BI
nahIM xeweê
COMMENT::
XXXXXXXXXXXXXXXXXXXXXXXXXXX
SENT:: Palawa: Ese PasalIprakASakoM se leKaka apanI rAyaltI se vaMciwa raha jAwe
hEMê
COMMENT::
XXXXXXXXXXXXXXXXXXXXXXXXXX
--------------------------------------------------
8. SAMPLE ENTRY
( you will mark tags as shown here)
-------------------------------------------------
SENT:: [[prakASaka vyavasAya_meM]/ k7.3 [sabase:: nr mahawvapUrNa praSna]/ k1 ud
[puswakoM_kI/6 bikrI]/ k1 vid hEM:: vHê]< s>
COMMENT:: Verb 'hEM' is wrongly spelled. It should be 'hE'.
XXXXXXXXXXXXXXXXXXXXXX
SENT:: [[[paTana ruci (rIdiMga- hEbita)_kA_wo] aBAva]/ k1 [nahIM WA]/ VH
kinwu/ yo [[SreRTa puswakoM_kA]/6 prakASana]/ k1
[svalpa mAwrA_meM]/ k7.3 [howA WAê]:: v]< s>
COMMENT::
XXXXXXXXXXXXXXXXXXXXXXXX
SENT:: [[[muxriwa puswakoM_kI]/6: i [sahI saMKyA kI]/6: i jAnakArI]/ k2 wo
[sWApiwa prakASaka_BI]/ k1 [nahIM xeweê]:: v]< s>
COMMENT::
XXXXXXXXXXXXXXXXXXXXXXXXXXX
SENT:: ***Palawa:/ h [Ese PasalI prakASakoM_se]/ h leKaka/ k1 [apanI rAyaltI_se]/ k5
vaMciwa_raha_jAwe_hEMê:: v
COMMENT::
*** OPEN FOR DISCUSSION
XXXXXXXXXXXXXXXXXXXXXXXXXX