0% found this document useful (0 votes)
30 views

IIT Bombay

This document discusses Hindi compound verbs, which are verb-verb combinations (V+V) that function as single semantic units. The authors propose linguistic tests to identify true compound verbs and distinguish them from similar-looking constructions. They analyze the semantic properties of different types of V+V sequences to identify those that qualify as lexical compound verbs (LCpdVs) stored in the dictionary. Finally, an automatic method is described for extracting LCpdVs from corpora based on this linguistic analysis.

Uploaded by

pavankawade63
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views

IIT Bombay

This document discusses Hindi compound verbs, which are verb-verb combinations (V+V) that function as single semantic units. The authors propose linguistic tests to identify true compound verbs and distinguish them from similar-looking constructions. They analyze the semantic properties of different types of V+V sequences to identify those that qualify as lexical compound verbs (LCpdVs) stored in the dictionary. Finally, an automatic method is described for extracting LCpdVs from corpora based on this linguistic analysis.

Uploaded by

pavankawade63
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Hindi Compound Verbs and their Automatic Extraction

Debasri Chakrabarti Hemang Mandalia Ritwik Priya


Humanities and Social Computer Science and En- Computer Science and En-
Sciences Department gineering Department gineering Department
IIT Bombay IIT Bombay IIT Bombay
[email protected] [email protected] [email protected]

Vaijayanthi Sarma Pushpak Bhattacharyya


Humanities and Social Sci- Computer Science and En-
ences Department gineering Department
IIT Bombay IIT Bombay
[email protected] [email protected]

non-CP V+V sequences. Of the CPs thus iso-


Abstract lated, we need to distinguish between those CPs
that are formed in the syntax (derivationally) and
We analyse Hindi complex predicates
those that are formed in the lexicon (LCpdVs) in
and propose linguistic tests for their de-
order to include only the latter in lexical knowl-
tection. This analysis enables us to iden-
edge bases. Further, automatic extraction of
tify a category of V+V complex predi-
LCpdVs from electronic corpora and their inclu-
cates called lexical compound verbs
sion in lexical knowledge bases is a desirable
(LCpdVs) which need to be stored in the
goal for languages like Hindi, which liberally use
dictionary. Based on the linguistic analy-
CPs.
sis, a simple automatic method has been
This paper discusses Hindi Verb+Verb (V+V)
devised for extracting LCpdVs from cor-
CPs and their automatic extraction from a corpus.
pora. We achieve an accuracy of around
98% in this task. The LCpdVs thus ex- 1.1 Related work
tracted may be used to automatically
augment lexical resources like wordnets, Alsina (1996) discusses the general theory of
an otherwise time consuming and labour- complex predicates. Early work on conjunct and
intensive process compound verbs in Hindi appears in Burton-Page
(1957) and Arora (1979). Our work on diagnostic
tests for CPs, as reported here, has been inspired
1 Introduction
by Butt (1993, 1995 for Urdu) and Paul (2004,
Complex predicates (CPs) abound in South for Bengali). The analysis of lexical derivation of
Asian languages [Butt, 1995; Hook, 1974] pri- LCpdVs derives from the work on compound
marily as either, noun+verb combinations (con- verbs by Abbi (1991, 1992) and Gopalkrishnan
junct verbs) or verb+verb (V+V) combinations and Abbi (1992).
(compound verbs). This paper discusses the lat- This work is motivated primarily by the need
ter. to automatically augment lexical networks such
Of the many V+V sequences in Hindi, only a as the Princeton Wordnet (Miller et. al., 1990)
subset constitutes true CPs. Thus, we first need and the Hindi Wordnet (Narayan et. al., 2002).
diagnostic tests to differentiate between CP and Pasca (2005) and Snow et. al. (2006) report work
on such augmentations by processing web docu-
© 2008. Licensed under the Creative Commons Attri- ments.
bution-Noncommercial-Share Alike 3.0 Unported To the best of our knowledge ours is the first
license (http://creativecommons.org/licenses/by-nc- attempt at automatic extraction of LCpdVs from
sa/3.0/). Some rights reserved. Hindi corpora.

27
Coling 2008: Companion volume – Posters and Demonstrations, pages 27–30
Manchester, August 2008
how they are formed. To accomplish this we ex-
amined the semantic properties of the second
1.2 Organization of the paper
verbs (V2) in Group 1:
Section 2 discusses CPs in Hindi and the ways to
distinguish them from other, similar looking, (1) V1inf+paRnaa:
constructions. Section 3 discusses the automatic Examples include karnaa paRaa ‘do-lie (had to
extraction of CPs from corpora. Section 4 con- do)’, bolnaa paRaa ‘say-lie (had to say)’ etc. The
cludes the paper. second verb is always paRnaa ‘to lie (lay)’. It
appears in its stem form and bears all the inflec-
2 V+V Complex Predicates in Hindi tions. As V2, paRnaa has the meaning of com-
pulsion/force. paRnaa ‘lie’ as a V2 can be com-
We have identified five different types of V+V
bined with any V1 irrespective of the latter’s se-
sequences in Hindi. These are:
mantic properties. Since there are no syntactic or
semantic restrictions on the selection of V1, this
1. V1 stem+V2: maar Daalnaa (kill-put) ‘kill’.
construction should be treated in the syntax as a
2. V1 inf-e+lagnaa: rone lagnaa (cry-feel) ‘start
combination of a V1 and a modal auxiliary.
crying’.
3. V1 inf+paRnaa: bolnaa paRaa (say-lie) ‘say’. (2) V1 inf-e+lagnaa:
4. V1 inf-e+V2: likhne ko/ke lie kahaa ‘asked to Examples include karne lagaa ‘do-feel (start to
write’. do)’, bolne lagaa ‘say-feel (start to say)’ etc. The
5. V1–kar+V2: lekar gayaa ‘took and went’. V2 in this sequence is always lagnaa ‘feel’ in the
2.1 Identification of CPs] bare form and carries all the inflections. The core
meaning of lagnaa ‘feel’ is lost when it is com-
Following Butt (1993) and Paul (2004), we use bined with a V1. As a V2 it always has the mean-
the following diagnostic tests to identify CPs in ing of beginning, happening of an event. lagnaa
Hindi: ‘feel’ as a V2 can be combined with any V1 irre-
spective of the latter’s semantic properties. Thus,
1. Scope of adverbs this is also an instance of a modal auxiliary and
2. Scope of negation should be derived in the syntax.
3. Nominalization
4. Passivization (3) V1stem+V2
5. Causativization In the formation of V1 stem+V2, the V2 may be
6. Movement any one of ten verbs, as shown in Figure 1.
(see Appendix A for an example of these tests)
1. Daalnaa ‘put’
2. lenaa ‘take’
The tests above have been exhaustively applied 3. denaa ‘give’
to varied data. The results of these tests show 4. uThnaa ‘wake’
that some V+V sequences function as single se- 5. jaanaa ‘go’
mantic units and others do not. They also show 6. paRnaa ‘lie’
that the V1stem+V2, V1inf-e+lagnaa and 7. baiThnaa ‘sit’
V1inf+paRnaa sequences show similar proper- 8. maarnaa ‘kill’
ties and the V1 inf-e+V2 stem and the V1– 9. dhamaknaa ‘throb’
kar+V2 behave similarly. We call these Group 1 10. girnaa ‘fall’
and Group 2 respectively.
Group 1 sequences are true CPs in Hindi. The Figure 1: The 10 vector verbs
V+V sequences are simple predicates (mono- All these V2s also occur as main verbs. As V2,
clausal) with one subject. Group 2 constructions the core meaning of these verbs is lost
are not CPs. They show clausal embedding and (bleached), but they acquire some new semantic
each verb behaves as if it were an independent properties which are otherwise not seen (Abbi,
syntactic entity. In the next section we summa- 1991, 1992; Gopalkrishnan and Abbi, 1992). The
rize the semantic properties of CPs (Group 1). semantic properties of V2s include finality, defi-
2.2 Semantic Properties of V2 in Group 1 niteness, negative value, manner of the action,
attitude of the speaker etc.
After identifying the CPs from among different The combination of V1 and V2 is subject to
V+V sequences, the next step was to determine the semantic compatibility between the two verbs.

28
The argument structure of the CP is determined BBC 40 8 4 28 0.7
by V1 as is the case-marking on the internal ar- (28/4
guments, but the case-marking on the external 0)
argument (subject) is determined by both verbs. CIIL 174 32 7 135 0.79
From this analysis we conclude that V+V (135/
174)
CPs are formed both lexically and syntactically
Table 1: Precision of LCpdV extraction
in Hindi. Detailed investigation shows us that the
V2 in the V1inf-e+lagnaa and the The loss in precision was caused by (i) part of
V1inf+paRnaa constructions is a type of modal speech ambiguity, (ii) passivisation and (iii)
auxiliary and its semantic features are predictable idiomatic usages. For lack of space, we do not
and unvarying. We propose to deal with these discuss this here.
verbs in the syntax and call these verbs syntactic When measures were taken to remedy these
compound verbs (SCpdVs). The V2 choice in the errors, we reached an accuracy of close to 98%
V1stem+V2 is not predictable and the CPs func- (see table 2).
tion as a single complex of syntactic and seman-
tic features. We call these verbs lexical com- BBC CIIL
pound verbs (LCpdVs) and we propose to in- Confirmed LCpdVs 423 953
clude them in the lexical knowledge base. In the (A)
next section we provide a heuristic for automatic Not LCpdVs (B) 13 12
extraction of LCpdVs for storage in the lexicon. Different POS (C) 65 179
Possible LCpdVs but 44 36
2.3 The Extraction Process contexts insufficient
(D)
By scanning the corpus, V1stem+V2 sequences Minimum Precision 0.88 0.95
were found given the heuristic H* specified in (A/(A+B+D)) (423/480) (953/1001)
Figure 2. Maximum Precision 0.97 0.99
((A+B)/(A+B+D)) (467/480) (989/1001)
(Heuristic H*) Total V1stem+V2 10,145 36,115
If a verb V1 is in the stem form and constructions in the
is followed by a verb V2 from a pre- corpus
stored list of verbs that can form the Table 2: Final results of LCpdV extraction
second component of the CP (section
2.2, Figure 3), i.e., the ‘vector’, then A partial list of LCpdVs extracted from a test run
this verb along with the V2 is taken on the CIIL corpus is presented in Table 3.
to be an instance of an LCpdV.
baandh Kar Bhar le jaanaa Banaa
Figure 2: Main heuristic for identifying LCpdVs denaa lenaa denaa ‘take’ denaa
‘tie’ ‘do’ ‘fill’ ‘make’
jaan kaaT Kar de- Badal Bhuul
Ten native speakers of Hindi were consulted. lenaa denaa naa ‘do’ jaanaa jaanaa
They were asked to construct sentences with the ‘know’ ‘cut’ ‘change’ ‘forget’
extracted sequences. If they were able to do so,
jalaa Gir Samajh Samjhaa Khod
that sequence was registered as a true LCpdV. denaa jaanaa lenaa denaa lenaa
The precision of the heuristic is calculated as ‘burn’ ‘fall’ ‘under- ‘make ‘dig’
the ratio of the actual LCpdVs arrived at through stand’ under-
manual validation to the total number of antici- stand’
pated LCpdVs identified by the heuristic. lauTaa Rah Le lenaa De denaa
ghusaa
The results of these calculations are shown in denaa jaanaa ‘take’ ‘give’ denaa
Table 1, with a precision rate of 70% for the ‘return’ ‘stay’ ‘enter’
BBC corpus and 79% for the CIIL one. Table 3: Examples of LCpdV extraction

Cor- To- POS Pas- LCpdVs Preci 3 Conclusions and Future Work
pus tal ambi sive (manu- sion
In this paper, we have presented a study of Hindi
de- gui- forms ally
tec- ties de- compound verbs, proposed diagnostic tests for
tio tected) their detection and given automatic methods for
ns their extraction from a corpus. Native speakers

29
verify that the accuracy of our method is close to Appendix A. Example of a diagnostic Test for
98% on representative corpora. LCpdVs: scope of adverbs
Future work will consist in inserting the ex-
tracted LCpdVs into lexical resources such as the Verb Example Comment CP?
Hindi wordnet2 at the right places with the right Type
links. V1 us-ne jaldii Scope over Yes
stem+ jaldii the whole
References V2 khaa li- sequence
aa‘(S)he
Abbi, Anvita. 1991. Semantics of explicator com- ate
pound verbs. In South Asian Languages, Language quickly.’
Sciences, 13:2, 161-180 V1inf- vah jaldii Scope over Yes
e+ lag- se khaan-e the whole
.Abbi, Anvita. 1992. The explicator compound verb:
naa lag-aa ‘He sequence
some definitional issues and criteria for identifica-
started eat-
tion. Indian Linguistics, 53, 27-46.
ing imme-
Alsina, Alex. 1996. Complex Predicates:Structure diately.’
and Theory. CSLI Publications,Stanford, CA. V1 mujhe yah Scope over Yes
inf+ kaam jaldii the whole
Arora, H. 1979. Aspects of Compound Verbs in Hindi.
paRnaa karnaa sequence
M.Litt. dissertation, Delhi University.
paR-aa ‘I
Burton-Page, J. 1957. Compound and conjunct verbs had to do
in Hindi. BSOAS 19 469-78. the work
quickly.’
Butt, M. 1993. Conscious choice and some light verbs
V1inf- us-ne mu- Either over No
in Urdu. In M. K. Verma ed. (1993) Complex
e+V2 jhe khat V1 or V2 de-
Predicates in South Asian Languages. Manohar
jaldii se pends upon
Publishers and Distributors, New Delhi.
likhn-e the syntactic
Butt, M. 1995. The Structure of Complex Predicates kah-aa ‘He position of
in Urdu. Doctoral Dissertation, Stanford Univer- asked me the adverb
sity. to write the
letter
Cruys Time De and B. V. Moiron. 2007. Semantics-
quickly.’
based multiword expression extraction. ACL-2007
V1– vah jaldii Either over No
Workshop on Multiword Expressions.
kar+ se nahaa- V1 or V2 de-
Gopalkrishnan, D. and Abbi, A. 1992. The explicator V2 kar aa- pends upon
compound verb: some definitional issues and crite- yeg-aa the syntactic
ria for identification. Indian Linguistics, 53, 27-46. ‘He will position of
take bath the adverb
Miller,G., R. Beckwith, C. Fellbaum,, D. Gross, and
quickly and
K. Miller, Five Papers on WordNet. CSL Report
come.’
43, Cognitive Science Laboratory, Princeton Uni-
versity, Princeton, 1990.
http://www.cogsci.princeton.edu/~wn
Narayan, D., D. Chakrabarty, P. Pande, and P. Bhat-
tacharyya. 2002. An experience in building the
Indo WordNet - a WordNet for Hindi, International
Conference on Global WordNet (GWC 02), My-
sore, India, January.
Pasca, Marius, 2005. finding instance names and al-
ternative glosses on the web: WordNet reloaded.
Proceedings of CICLing, Mexico City.
Snow, Rion, Dan Jurafsky, and Andrew Y. Ng. 2006.
Semantic taxonomy induction from heterogenous
evidence. Proceedings of COLING/ACL, Sydney.

2
Developed by the wordnet team at IIT Bombay,
www.cfilt.iitb.ac.in/webhwn

30

You might also like