IIT Bombay
IIT Bombay
27
Coling 2008: Companion volume – Posters and Demonstrations, pages 27–30
Manchester, August 2008
how they are formed. To accomplish this we ex-
amined the semantic properties of the second
1.2 Organization of the paper
verbs (V2) in Group 1:
Section 2 discusses CPs in Hindi and the ways to
distinguish them from other, similar looking, (1) V1inf+paRnaa:
constructions. Section 3 discusses the automatic Examples include karnaa paRaa ‘do-lie (had to
extraction of CPs from corpora. Section 4 con- do)’, bolnaa paRaa ‘say-lie (had to say)’ etc. The
cludes the paper. second verb is always paRnaa ‘to lie (lay)’. It
appears in its stem form and bears all the inflec-
2 V+V Complex Predicates in Hindi tions. As V2, paRnaa has the meaning of com-
pulsion/force. paRnaa ‘lie’ as a V2 can be com-
We have identified five different types of V+V
bined with any V1 irrespective of the latter’s se-
sequences in Hindi. These are:
mantic properties. Since there are no syntactic or
semantic restrictions on the selection of V1, this
1. V1 stem+V2: maar Daalnaa (kill-put) ‘kill’.
construction should be treated in the syntax as a
2. V1 inf-e+lagnaa: rone lagnaa (cry-feel) ‘start
combination of a V1 and a modal auxiliary.
crying’.
3. V1 inf+paRnaa: bolnaa paRaa (say-lie) ‘say’. (2) V1 inf-e+lagnaa:
4. V1 inf-e+V2: likhne ko/ke lie kahaa ‘asked to Examples include karne lagaa ‘do-feel (start to
write’. do)’, bolne lagaa ‘say-feel (start to say)’ etc. The
5. V1–kar+V2: lekar gayaa ‘took and went’. V2 in this sequence is always lagnaa ‘feel’ in the
2.1 Identification of CPs] bare form and carries all the inflections. The core
meaning of lagnaa ‘feel’ is lost when it is com-
Following Butt (1993) and Paul (2004), we use bined with a V1. As a V2 it always has the mean-
the following diagnostic tests to identify CPs in ing of beginning, happening of an event. lagnaa
Hindi: ‘feel’ as a V2 can be combined with any V1 irre-
spective of the latter’s semantic properties. Thus,
1. Scope of adverbs this is also an instance of a modal auxiliary and
2. Scope of negation should be derived in the syntax.
3. Nominalization
4. Passivization (3) V1stem+V2
5. Causativization In the formation of V1 stem+V2, the V2 may be
6. Movement any one of ten verbs, as shown in Figure 1.
(see Appendix A for an example of these tests)
1. Daalnaa ‘put’
2. lenaa ‘take’
The tests above have been exhaustively applied 3. denaa ‘give’
to varied data. The results of these tests show 4. uThnaa ‘wake’
that some V+V sequences function as single se- 5. jaanaa ‘go’
mantic units and others do not. They also show 6. paRnaa ‘lie’
that the V1stem+V2, V1inf-e+lagnaa and 7. baiThnaa ‘sit’
V1inf+paRnaa sequences show similar proper- 8. maarnaa ‘kill’
ties and the V1 inf-e+V2 stem and the V1– 9. dhamaknaa ‘throb’
kar+V2 behave similarly. We call these Group 1 10. girnaa ‘fall’
and Group 2 respectively.
Group 1 sequences are true CPs in Hindi. The Figure 1: The 10 vector verbs
V+V sequences are simple predicates (mono- All these V2s also occur as main verbs. As V2,
clausal) with one subject. Group 2 constructions the core meaning of these verbs is lost
are not CPs. They show clausal embedding and (bleached), but they acquire some new semantic
each verb behaves as if it were an independent properties which are otherwise not seen (Abbi,
syntactic entity. In the next section we summa- 1991, 1992; Gopalkrishnan and Abbi, 1992). The
rize the semantic properties of CPs (Group 1). semantic properties of V2s include finality, defi-
2.2 Semantic Properties of V2 in Group 1 niteness, negative value, manner of the action,
attitude of the speaker etc.
After identifying the CPs from among different The combination of V1 and V2 is subject to
V+V sequences, the next step was to determine the semantic compatibility between the two verbs.
28
The argument structure of the CP is determined BBC 40 8 4 28 0.7
by V1 as is the case-marking on the internal ar- (28/4
guments, but the case-marking on the external 0)
argument (subject) is determined by both verbs. CIIL 174 32 7 135 0.79
From this analysis we conclude that V+V (135/
174)
CPs are formed both lexically and syntactically
Table 1: Precision of LCpdV extraction
in Hindi. Detailed investigation shows us that the
V2 in the V1inf-e+lagnaa and the The loss in precision was caused by (i) part of
V1inf+paRnaa constructions is a type of modal speech ambiguity, (ii) passivisation and (iii)
auxiliary and its semantic features are predictable idiomatic usages. For lack of space, we do not
and unvarying. We propose to deal with these discuss this here.
verbs in the syntax and call these verbs syntactic When measures were taken to remedy these
compound verbs (SCpdVs). The V2 choice in the errors, we reached an accuracy of close to 98%
V1stem+V2 is not predictable and the CPs func- (see table 2).
tion as a single complex of syntactic and seman-
tic features. We call these verbs lexical com- BBC CIIL
pound verbs (LCpdVs) and we propose to in- Confirmed LCpdVs 423 953
clude them in the lexical knowledge base. In the (A)
next section we provide a heuristic for automatic Not LCpdVs (B) 13 12
extraction of LCpdVs for storage in the lexicon. Different POS (C) 65 179
Possible LCpdVs but 44 36
2.3 The Extraction Process contexts insufficient
(D)
By scanning the corpus, V1stem+V2 sequences Minimum Precision 0.88 0.95
were found given the heuristic H* specified in (A/(A+B+D)) (423/480) (953/1001)
Figure 2. Maximum Precision 0.97 0.99
((A+B)/(A+B+D)) (467/480) (989/1001)
(Heuristic H*) Total V1stem+V2 10,145 36,115
If a verb V1 is in the stem form and constructions in the
is followed by a verb V2 from a pre- corpus
stored list of verbs that can form the Table 2: Final results of LCpdV extraction
second component of the CP (section
2.2, Figure 3), i.e., the ‘vector’, then A partial list of LCpdVs extracted from a test run
this verb along with the V2 is taken on the CIIL corpus is presented in Table 3.
to be an instance of an LCpdV.
baandh Kar Bhar le jaanaa Banaa
Figure 2: Main heuristic for identifying LCpdVs denaa lenaa denaa ‘take’ denaa
‘tie’ ‘do’ ‘fill’ ‘make’
jaan kaaT Kar de- Badal Bhuul
Ten native speakers of Hindi were consulted. lenaa denaa naa ‘do’ jaanaa jaanaa
They were asked to construct sentences with the ‘know’ ‘cut’ ‘change’ ‘forget’
extracted sequences. If they were able to do so,
jalaa Gir Samajh Samjhaa Khod
that sequence was registered as a true LCpdV. denaa jaanaa lenaa denaa lenaa
The precision of the heuristic is calculated as ‘burn’ ‘fall’ ‘under- ‘make ‘dig’
the ratio of the actual LCpdVs arrived at through stand’ under-
manual validation to the total number of antici- stand’
pated LCpdVs identified by the heuristic. lauTaa Rah Le lenaa De denaa
ghusaa
The results of these calculations are shown in denaa jaanaa ‘take’ ‘give’ denaa
Table 1, with a precision rate of 70% for the ‘return’ ‘stay’ ‘enter’
BBC corpus and 79% for the CIIL one. Table 3: Examples of LCpdV extraction
Cor- To- POS Pas- LCpdVs Preci 3 Conclusions and Future Work
pus tal ambi sive (manu- sion
In this paper, we have presented a study of Hindi
de- gui- forms ally
tec- ties de- compound verbs, proposed diagnostic tests for
tio tected) their detection and given automatic methods for
ns their extraction from a corpus. Native speakers
29
verify that the accuracy of our method is close to Appendix A. Example of a diagnostic Test for
98% on representative corpora. LCpdVs: scope of adverbs
Future work will consist in inserting the ex-
tracted LCpdVs into lexical resources such as the Verb Example Comment CP?
Hindi wordnet2 at the right places with the right Type
links. V1 us-ne jaldii Scope over Yes
stem+ jaldii the whole
References V2 khaa li- sequence
aa‘(S)he
Abbi, Anvita. 1991. Semantics of explicator com- ate
pound verbs. In South Asian Languages, Language quickly.’
Sciences, 13:2, 161-180 V1inf- vah jaldii Scope over Yes
e+ lag- se khaan-e the whole
.Abbi, Anvita. 1992. The explicator compound verb:
naa lag-aa ‘He sequence
some definitional issues and criteria for identifica-
started eat-
tion. Indian Linguistics, 53, 27-46.
ing imme-
Alsina, Alex. 1996. Complex Predicates:Structure diately.’
and Theory. CSLI Publications,Stanford, CA. V1 mujhe yah Scope over Yes
inf+ kaam jaldii the whole
Arora, H. 1979. Aspects of Compound Verbs in Hindi.
paRnaa karnaa sequence
M.Litt. dissertation, Delhi University.
paR-aa ‘I
Burton-Page, J. 1957. Compound and conjunct verbs had to do
in Hindi. BSOAS 19 469-78. the work
quickly.’
Butt, M. 1993. Conscious choice and some light verbs
V1inf- us-ne mu- Either over No
in Urdu. In M. K. Verma ed. (1993) Complex
e+V2 jhe khat V1 or V2 de-
Predicates in South Asian Languages. Manohar
jaldii se pends upon
Publishers and Distributors, New Delhi.
likhn-e the syntactic
Butt, M. 1995. The Structure of Complex Predicates kah-aa ‘He position of
in Urdu. Doctoral Dissertation, Stanford Univer- asked me the adverb
sity. to write the
letter
Cruys Time De and B. V. Moiron. 2007. Semantics-
quickly.’
based multiword expression extraction. ACL-2007
V1– vah jaldii Either over No
Workshop on Multiword Expressions.
kar+ se nahaa- V1 or V2 de-
Gopalkrishnan, D. and Abbi, A. 1992. The explicator V2 kar aa- pends upon
compound verb: some definitional issues and crite- yeg-aa the syntactic
ria for identification. Indian Linguistics, 53, 27-46. ‘He will position of
take bath the adverb
Miller,G., R. Beckwith, C. Fellbaum,, D. Gross, and
quickly and
K. Miller, Five Papers on WordNet. CSL Report
come.’
43, Cognitive Science Laboratory, Princeton Uni-
versity, Princeton, 1990.
http://www.cogsci.princeton.edu/~wn
Narayan, D., D. Chakrabarty, P. Pande, and P. Bhat-
tacharyya. 2002. An experience in building the
Indo WordNet - a WordNet for Hindi, International
Conference on Global WordNet (GWC 02), My-
sore, India, January.
Pasca, Marius, 2005. finding instance names and al-
ternative glosses on the web: WordNet reloaded.
Proceedings of CICLing, Mexico City.
Snow, Rion, Dan Jurafsky, and Andrew Y. Ng. 2006.
Semantic taxonomy induction from heterogenous
evidence. Proceedings of COLING/ACL, Sydney.
2
Developed by the wordnet team at IIT Bombay,
www.cfilt.iitb.ac.in/webhwn
30