Grammar Extraction From Treebanks For Hindi and Te
Grammar Extraction From Treebanks For Hindi and Te
net/publication/220746008
CITATIONS READS
2 847
7 authors, including:
All content following this page was uploaded by Prasanth Kolachina on 08 April 2014.
3803
is being used for parsing (Bharati et al., 2009a) her work, Xia has demonstrated the process for treebanks
of three languages: English, Chinese and Korean. She
3. To generalize verb argument structure information also showed that grammars extracted using LexTract
over the extracted verb frames to address sparsity in have several applications. They can be used as stand
the annotated corpora alone grammars for languages that do not have existing
4. To aid in the validation of treebanks by detecting dif- grammars. They can be used to enhance the coverage of
ferent types of annotation errors using the extracted already existing grammars. They can be used to compare
grammars grammars of different languages. The derivation trees
extracted using LexTract can be used to train statistical
3. Related Work parsers and taggers. LexTract can also help detect certain
kinds of annotation errors and thereby, semi-automate
In this section we briefly survey some of the work on gram-
the process of treebank validation. A major advantage of
mar extraction, generalization using syntactic similarity.
the LexTract approach to grammar development is that it
We also mention a few details about both the Indian lan-
can provide valuable statistical information in the form of
guage treebanks that we used. Syntactic alternation can be
weights associated with primitive elements.
an important criterion while generalizing verbs. We briefly
discuss how syntactic alternation in Hindi differs from En-
The work we present in this paper is on the same
glish.
lines as the LexTract approach to grammar development,
3.1. Grammar Extraction but it is on a much smaller scale. It is meant to be the
The role of grammars in NLP is more extensive than is first step towards building a LexTract like system for
generally supposed. Xia (2000) points out that the task of extracting CPG grammars for Indian languages. Since we
treebanking for a language bears much similarity to the worked with dependency treebanks of Hindi and Telugu,
task of manually crafting a grammar. The treebank of a we chose a dependency grammar formalism known as
language contains an implicit grammar for that language. Computational Paninian grammar (CPG). In fact, the
Statistical NLP systems trained over a treebank make use annotation guidelines followed to annotate the treebank are
of this grammar implicit in the treebank. This is why based on this grammar (Bharati et al., 2009b). As such, the
grammar driven approaches and data driven or statistical grammar extraction process is much more straightforward
approaches are not necessarily mutually exclusive. It than the one in LexTract. In the next section, we give a
is well known that the traditional approach of manually brief outline of the CPG formalism where we define the
crafting a high quality, large coverage grammar takes basic terminology and briefly discuss the components of a
tremendous human effort to build and maintain. In CPG grammar.
addition, the traditional approach does not provide for
3.2. Generalization Based on Syntactic Similarity
flexibility, consistency and generalization. To address these
limitations of the traditional approach to grammar develop- The problem of sparse data in Propbank has been previ-
ment, Xia (2001) presents two alternative approaches that ously addressed using syntactic similarity based general-
generate grammars automatically, one from descriptions ization of semantic roles across verbs (Gordon and Swan-
(LexOrg) and the other from treebanks (LexTract). son, 2007). We try to address the data sparseness prob-
lem by generalizing over argument structure across syntac-
The LexTract system extracts explicit grammars in tically similar verbs to arrive at an automatic verb classifi-
the TAG formalism from a treebank. It is not, however, cation. Gordon and Swanson (2007) define syntactic simi-
limited to the TAG formalism as it can also extract CFGs larity for phrase structure trees using the notion of a parse
from a treebank. Large scale treebanks such as the English tree path (Gildea and Jurafsky, 2002). Gildea and Jurafsky
Penn Treebank (PTB) are not based on existing gram- define a parse tree path as ‘the path from the target word
mars. Instead, they were manually annotated following through the parse tree to the constituent in question, repre-
the annotation guidelines. Since the process of creating sented as a string of parse tree non-terminals linked by sym-
annotation guidelines is similar to the process of building bols indicating upward and downward movement through
a grammar by hand, it can be assumed that an implicit the tree’. This parse tree path feature is used to represent
grammar, hidden in the annotation guidelines, generates the syntactic relationships between a predicate and its ar-
the structures in the treebank. This implicit grammar can guments in a parse tree. The syntactic context of a verb is
be called a treebank grammar. As suggested by Xia, the extracted as the set of all possible parse tree paths from the
task of grammar extraction using LexTract can be seen parse trees of sentences containing that verb. The syntac-
as the task of converting this implicit treebank grammar tic context of a verb is then converted into a feature vector
to an explicit TAG grammar. LexTract builds an LTAG representation. The syntactic similarity between two verbs
grammar in two stages. First, it converts the annotated is calculated using different distance measures such as Eu-
phrase structure trees in the PTB into LTAG derived trees. clidean distance, Chi-square statistic, cosine similarity etc.
In the second stage, it decomposes these derived trees In our work, we present an analogous measure of syntactic
into a set of elementary trees which form the basic units similarity for the dependency structures in the Indian Lan-
of an LTAG grammar. It also extracts derivation trees guage (IL) Treebanks, which is described in section 5. We
which provide information about the order of operations characterize the syntactic context of a verb using a karaka
necessary to build the corresponding derived trees. In frame representation. The notion of karakas is explained in
3804
the next section. In the above sentences, the nominal vibhaktis (case-
endings or post-positions) change according to the
3.3. Syntactic Alternations in Hindi TAM and agreement features of the verb. This co-
Syntactic alternations of a verb have been claimed to re- variation of vibhaktis with verb’s inflectional features
flect its underlying semantics properties. Levin’s classifica- is true not only of finite verb forms but also of non-
tion of English verbs (Levin, 1993) based on this assump- finite verb forms. All this information is exploited in
tion demonstrates how syntactic alternation behavior of a the CPG formalism in a systematic way, as discussed
verb can be correlated to its semantic properties thereby in the next section.
leading to a semantic classification. There have also been
several attempts at automatically identifying distinct clus- 3.4. Indian Language Treebanks
ters of verbs that behave similarly using clustering algo- In this sub-section, we give a very brief overview of the
rithms. These empirically-derived clusters were then com- treebanks used in our work. We worked with treebanks of
pared against Levin’s classification (Merlo and Stevenson, two Indian languages, Hindi and Telugu. The treebanks
2001). for Hindi and Telugu contain 2403 and 1226 sentences re-
The following are some linguistic aspects of verb alterna- spectively. The development of these treebanks is an on-
tion behavior that we encountered in Hindi: going effort. The Hindi treebank is part of a multi-level
resource development project (Bhatt et al., 2009). Some of
• In Hindi, the inchoative-transitive alternation pattern the salient features of the annotation process employed in
cannot be considered an alternation of the same verb the development of these treebanks are as follows:
stem. The verb stems in such constructions, although
morphologically related, are mostly distinct. This is • The syntactic structure of sentences is based on the
illustrated in the examples below: dependency representation scheme.
• Dependency relations in the Hindi treebank are anno-
Inchoative: tated on top of a manually POS-tagged and chunked
darawAzA KulA corpus. In the Telugu treebank, the POS-tagging and
door-3PSg-Nom open
chunking was not performed manually.
’The door opened.’
• Dependency relations are defined between chunk
Transitive: heads.
Atifa-ne darawAzA KolA
Atif-3PSg-Erg door-3PSg open • The dependency tagset used to annotate dependency
’Atif opened the door.’ relations is based on the CPG formalism which we dis-
cuss in section 4.
• Similarly, the diathesis alternation pattern discussed
by Levin is not exhibited by Hindi verbs. 4. Computational Paninian Grammar
In this section, we give a brief overview of the Computa-
• Since Hindi is a morphologically rich, free-word order
tional Paninian Grammar (CPG) formalism. We only out-
language, the alternations are not with respect to the
line details relevant to our goal of grammar extraction. See
position of the constituent as is the case in English. In
Bharati et al. (1995) for a detailed discussion of the CPG
Hindi, alternations are with respect to the case-endings
formalism and the Paninian theory on which it is based. In
(or the post-positions) of the nouns, which are called
subsection 4.1, we introduce the basic terminology neces-
vibhaktis in CPG.
sary for an overview of this formalism.
• Post-positions or vibhaktis alternation is determined
4.1. Terminology
by the form that the verb stem takes in a particular
construction. In other words, the arguments of a verb • The notion of karaka relations is central to Paninian
are realized using different case-endings or vibhaktis Grammar. Karaka relations are syntactico-semantic
based on the tense, aspect and modality (TAM) fea- relations between the verbs and other related con-
tures of the verb. This is illustrated in the examples stituents in a sentence. Each of the participants in an
below: activity denoted by a verbal root is assigned a distinct
karaka. There are six different types of karaka rela-
tions in the Paninian grammar as listed below:
abhaya rotI KatA hE
Abhay-Nom-3PSgM bread eat-pres.simp.-3PSgM 1. k1: karta, participant central to the action de-
’Abhay eats bread.’
noted by the verb
abhaya-ne rotI KAyI 2. k2: karma, participant central to the result of the
Abhay-Erg bread-3PSgF eat-past.simp.-3PSgF action denoted by the verb
’Abhay ate bread.’ 3. k3: karana, instrument essential for the action to
take place
abhaya-ne rotI-ko KAyA
Abhay-Erg bread-Acc eat-past.simp.-default 4. k4: sampradana, beneficiary/recipient of the ac-
’Abhay ate bread.’ tion
3805
Figure 1: Basic demand frame for the verb ‘de’ (to give)
5. k5: apadana, participant which remains station- features assigned to the verb in a syntactic construc-
ary (or is the reference point) in an action involv- tion. Therefore, it can also be referred to as the TAM
ing separation/movement marker of the verb.
6. k7: adhikarana, real or conceptual space/time1 In the previous example sentence, the nouns ’Atifa’
and ’kuAM’ have the vibhaktis ’-ne’ and ’-se’ respec-
For example, in the following example sentence: tively. The vibhakti of the verb ’nikAla’ is ’yA’ which
is also its TAM label.
samIrA-ne abhaya-ko phUla diyA Nominal vibhaktis have also been found to be impor-
Samira-Erg Abhay-Dat flower-Acc give.past tant syntactic cues for identification of semantic role
------------------------------------.3PSgM
in the CPG scheme (Bharati et al., 2008).
’Samira gave a flower to Abhay.’
4.2. Components of CPG: Demand Frames and
Samira is the karta (k1), the flower is the karma (k2) Transformation Frames
and Abhay is the sampradana (k4). Similarly, in the A key aspect of Paninian grammar (CPG) is that the verb
following example: group containing a finite verb is the most important word
group (equivalent to the notion of a ’head’) of a sentence.
Atifa ne kueM se pAnI nikAlA For other word groups in the sentence dependent on this
Atif-Erg well-Abl water-Acc draw.3PSgM head, the vibhakti information of the word group is used
’Atif drew water from the well.’ to map it to an appropriate karaka relation. This karaka-
vibhakti mapping is dependent on the main verb and its
Atif is the karta (k1), well is the apadaana (k5) and TAM label. This mapping is represented by two templates:
water is the karma (k2). default karaka chart (also known as basic demand frames)
In addition to these karaka relations, there are some and karaka chart transformation (also known as transfor-
additional relations in the Paninian scheme such as mation frame). The default demand frame defines the map-
tadarthya (or purpose)2 . ping for a verb or a class of verbs with respect to a basic
reference TAM label. It specifies the karaka relations se-
• The notion of vibhakti relates to the notion of local lected by the verb along with the vibhaktis allowed by the
word groups based on case ending, preposition and basic TAM label. The basic reference TAM label in CPG is
post-position markers. For a nominal word group, vib- chosen to be ’tA hE’ which is equivalent to Present Indefi-
hakti is the post-position (also known as parsarg) oc- nite/Simple Present. For any other TAM label of that verb
curring after the noun. Similarly, in the case of verbal or verb class, a transformation rule is defined that can be ap-
word group, a head verb may be followed by auxil- plied to the default demand frame to obtain the appropriate
iary verbs which may remain as separate words or may karaka-vibhakti mapping for that TAM combination. The
combine with the head verb. This information follow- transformation rules can affect the default demand frames
ing the head verb (in other words, verb stem) is col- in three ways, each defined as an operation in CPG:
lectively called the vibhakti of the verb. The vibhakti
of a verb contains information about the tense, aspect 1. Insert: A new karaka relation is inserted into the de-
and modality (TAM) and also Agreement, which are mand frame along with its vibhakti mapping
2. Delete: An existing karaka relation is deleted from the
1
In the tagset used, k7p represents spatial location, k7t repre- default demand frame
sents temporal location and k7/k7v represents conceptual location.
2
The complete tagset can be found at http: 3. Update: A karaka-vibhakti mapping entry in the de-
//ltrc.iiit.ac.in/MachineTrans/research/ fault demand frame is updated by modifying the vib-
tb/dep-tagset.pdf hakti information according to the new TAM label
3806
Figure 3: Dependency structure containing ‘KA’ (to eat)
The default demand frame and a transformation frame for The syntactic similarity between two verbs is calculated as
the ditransitive verb ’de’ (to give) and the TAM label ’yA’ the distance between their feature vector representations us-
are shown below as an example. ing a variety of distance metrics such as Euclidean distance,
Manhattan distance and cosine similarity.
5. Syntactic Similarity for Dependency
Structures 6. Extracting a CPG grammar from the
A key concept in much of the previous work on semantic Treebank
role labeling is the notion of a parse tree path (Gildea and In this section, we describe the various steps through which
Jurafsky, 2002). The parse tree path representation is based CPG grammars for Hindi and Telugu, were extracted from
on PTB-style phrase structure (PS) trees. Gordon and the treebanks.
Swanson (2007) define syntactic similarity between words
using the parse tree path representations of their syntactic 1. The verb nodes were identified from each sentence in
contexts. the treebank. For each verb node in the dependency
structure of a sentence, the subtree rooted at that node
In the case of dependency parse structures, however, is extracted. Karaka frames for this subtree are ex-
the notion of a parse tree path is not required. This is be- tracted as shown in the previous section. A karaka
cause dependency structures are based on the idea that the frame corresponding to each child of the verb node
syntactic structure of a sentence consists of binary asym- is obtained. An extracted karaka frame contains the
metrical modifier-modified relations between the words following information about the dependent (modifier):
of that sentence. Therefore, in a dependency structure, category type, relation type, post-position. The in-
the information about the syntactic relationship between formation about the verb stem is also included in the
a predicate and its arguments is trivially represented as a frame. In other words, the karaka frames extracted for
parent-child relationship between a head (modified) and its a verb node are lexicalized. At the end of this step, we
dependents (modifiers). For any predicate in a dependency have a sets of karaka frames corresponding to each
structure, an argument frame representation which contains verb instance in the treebank. We call these frames
information about the categorial type, semantic role (in the ’verb instance frames’.
case of labeled dependency) and other kinds of information
2. A list of distinct verb stems based on their surface
about each of the arguments can be extracted readily from
form is compiled.
the tree. Such an argument frame representation in a depen-
dency structure would be equivalent to the parse tree path 3. For each distinct verb stem in the verb stem list, we
representation for PSTs. We refer to these argument frame take all its instances in the treebank, along with the
representations of predicates extracted from a dependency corresponding verb instance frames extracted in 1.
parse tree as karaka frame representations. Figure 3 shows Two instance frames for a particular verb stem can dif-
the dependency structure of a sentence containing the verb fer from one another in one of the following ways:
’KA’ (to eat). The karaka frames extracted for the verb
’KA’ in this parse tree are [NP k1 ne] and [NP k2 0]. These • The same karakas are realized in different in-
extracted karaka frames characterize the syntactic context stance frames by NPs marked with different post-
of the verb ’KA’ (to eat) for this sentence. The set of all positions
karaka frames extracted from each sentence of a given • Some karakas which are present in some instance
verb characterizes the possible syntactic context of that frames are absent in the others
verb. The next step is to represent this syntactic context
as a feature vector. This is done simply by tabulating the • The same karakas are realized in the differing
frequencies of each distinct karaka frame into a feature instance frames by different categories of the
vector representation. The resulting feature vectors are chunks
normalized by dividing the frequencies by the number of • Two instance frames differ in most or all of the
instances of that particular verb stem. karakas.
3807
Case 1 reflects the well known phenomenon of vib- Language Hindi Telugu
hakti or post-position alternation for a verb trigerred Sentences 2403 1226
by difference in the TAM marker (section 3.3.). Case 2 Verb Types 1238 391
relates to the distinction between mandatory and non- Verb Tokens 5051 1616
mandatory dependents of a predicate. Case 3 corre- Tokens per Type 4.07 4.13
sponds to the differences in the category type of an Verbs with Single Instances 799 199
argument for the same verb. Case 4 can be attributed Verbs with Multiple Instances 439 192
solely to verb ambiguity and is the most difficult prob- Complex Predicates 934 122
lem to address.
3808
Type of error Frequency As an example, for the intransitive verb ‘jA’, we show how
POS-tagging 0 argument structure annotation errors can be identified from
Chunking 1 the extracted frames:
Morphological 3
Argument Structure 22 jA
k1 NP 0 0.347826
k1 NULL__NP 0 0.065217
Table 3: Error statistics ------------------------------------
k1s JJP 0 0.021739
k2 NP ko 0.065217
smaller in Telugu. A total of 18 transformation (TAM)
frames were extracted from the Telugu treebank. In the above example, the relative frequency of k1s and k2
is below the threshold for error identification, whereas the
While inferring transformation (TAM) frames for Hindi, in total relative frequency of k1 is above that threshold. Thus,
the case of verbs with single instance, annotated instances in this case, k1s and k2 are identified as annotation errors.
of similar verbs were also used for inference. In Table 2, In Table 3, we show the relative frequencies of these four
We list 9 high frequency verbs in Hindi extracted from types of errors over all the extracted instances of a sam-
the treebank. For each of these verbs, we list the 3 most ple of 25 randomly selected verbs. The table shows that
similar verbs obtained after applying our method for the number of argument structure annotation errors is much
estimating syntactic similarity. We also provide the cosine higher than other types of errors. This is not surprising
similarity value for each pair. It is interesting to note that given the complexity of the dependency annotation task
the similairty that exists among the verbs in each set listed as compared to tasks such as POS-tagging, chunking and
in Table 2 is of different types. The sets of similar verbs morph analysis. Further, the argument structure errors that
corresponding to the first three verbs (’to say’, ’to tell’ and were discovered in the course of grammar extraction were
’to give’) in the table reflect a semantic similarity and can cases of genuine confusibility even for trained annotators.
be assigned a common underlying semantic property. It This shows that such instances can be incorporated in the
is interesting that the similarity of syntactic context in the annotation guidelines to reduce annotation errors in the fu-
case of these verbs correlates with a semantic similarity. ture.
This is on the same lines as Levin’s proposal for verb Apart from the above, we also discovered a small percent-
classification (Levin, 1993). However, not all sets of verbs age (0.3) of sentences in which no verb was found during
exhibit this level of similarity. Verb sets corresponding to the extraction process. The number of errors was much
the last four verbs in Table 2 (’to come’, ’to make’, ’to go’ larger in the case of Telugu treebank as it has not yet been
and ’to see’) are similar only with respect to their surface subjected to any kind of validation.
syntactic properties. There were also erroneous sets such
as the set for the Hindi verb ’to take’. 9. Conclusions and Future Work
We present a system that can extract grammars in the CPG
8. Detecting Annotation Errors formalism from dependency treebanks for Hindi and Tel-
The treebanks that we worked with are relatively recent de- ugu. We discuss the various issues involved in the extrac-
velopments and their validation process is still at a very tion process.
early stage. The grammar extraction process that we fol- In order to address the issue of data sparseness, we explore
low can help semi-automate this process of treebank vali- a generalization approach based on syntactic similarity of
dation. The statistical information associated with each of verbs. We define the notion of syntactic similarity of verbs
the extracted karaka frames for a particular verb is helpful in a dependency representation using the karaka frame rep-
for detecting treebanking errors. Karaka frames with very resentation. The definition is relevant for dependency rep-
low frequency are identified as containing one the follow- resentation in any formalism. Applying this syntactic sim-
ing four main types of annotation errors: ilarity to the verbs extracted from the treebanks, we ob-
tain the pair-wise similarities over the entire set of verbs.
• POS-tagging errors where a word is marked with the Using this similarity database, the verbs can be clustered
wrong POS tag. These errors are detected at various and an unsupervised verb classification can be obtained.
stages during the extraction. The resulting verb clusters can be compared against ear-
lier works on theoretical classification of Hindi verbs into
• Chunking errors where a chunk is marked with the verb classes (Begum et al., 2008b) which is one of our im-
wrong chunk tag. mediate future works.
• Errors during morphological manual annotation or au- We also show how statistical information obtained dur-
tomatic analysis: Errors of this type include errors in ing the extraction process can enable detection of different
identification of verb stems, TAM labels and vibhaktis kinds of annotation errors. A detailed study of how to incor-
(case–endings or post-positions) of nouns. porate the information provided by the grammar extraction
process into a treebank validation system is also part of our
• Argument structure annotation errors: Errors of this future work.
type are most crucial and are difficult to detect during The system that we present in this paper is still under de-
manual validation as the error frequency is very low. velopment.
3809
10. References tating Dependency, Lexical Predicate-Argument Struc-
Collin F. Baker, Charles J. Fillmore, and John B. Lowe. ture, and Phrase Structure. In Proceedings of ICON-
1998. The berkeley framenet project. In Proceedings of 2009: 7th International Conference on Natural Lan-
the 17th international conference on Computational lin- guage Processing, Hyderabad.
guistics, pages 86–90, Montreal, Quebec, Canada. Asso- Fei Xia, Martha Palmer, and Aravind Joshi. 2000. A uni-
ciation for Computational Linguistics. form method of grammar extraction and its applications.
R. Begum, S. Husain, A. Dhwaj, D.M. Sharma, L. Bai, In Proceedings of the 2000 Joint SIGDAT conference on
and R. Sangal. 2008a. Dependency annotation scheme Empirical methods in natural language processing and
for Indian languages. Proceedings of International Joint very large corpora, pages 53–62, Hong Kong.
Conference on Natural Language Processing. Fei Xia. 2001. Automatic Grammar Generation from Two
Different Perspectives. Ph.D. thesis, University of Penn-
Rafiya Begum, Samar Husain, Lakshmi Bai, and
sylvania.
Dipti Misra Sharma. 2008b. Developing Verb Frames
for Hindi. In Proceedings of the Sixth International Lan-
guage Resources and Evaluation (LREC’08), Morocco.
Akshar Bharati, Vineet Chaitanya, and Rajeev Sangal.
1995. Natural Language Perspective: A Paninian Per-
spective.
Akshar Bharati, Samar Husain, Bharat Ambati, Sambhav
Jain, Dipti Misra Sharma, and Rajeev Sangal. 2008.
Two semantic features make all the difference in pars-
ing accuracy. In Proceedings of the 6th International
Conference on Natural Language Processing (ICON-
08), CDAC Pune, India.
Akshar Bharati, Samar Husain, Dipti Misra Sharma, and
Rajeev Sangal. 2009a. Two stage constraint based hy-
brid approach to free word order language dependency
parsing. In Proceedings of the 11th International Con-
ference on Parsing Technologies (IWPT’09), pages 77–
80, Paris, France.
Akshar Bharati, Dipti Misra Sharma, Samar Husain, Lak-
shmi Bai, Rafiya Begum, and Rajeev Sangal. 2009b.
Anncorra: Treebanks for indian languages, guidelines
for annotating hindi treebank.
Rajesh Bhatt, Bhuvana Narasimhan, Martha Palmer, Owen
Rambow, Dipti Sharma, and Fei Xia. 2009. A
multi-representational and multi-layered treebank for
hindi/urdu. In Proceedings of the Third Linguistic Anno-
tation Workshop, held in conjunction with ACL-IJCNLP,
Singapore.
D. Gildea and D. Jurafsky. 2002. Automatic labeling of
semantic roles. Computational Linguistics, 28(3):245–
288.
A. Gordon and R. Swanson. 2007. Generalizing seman-
tic role annotations across syntactically similar verbs. In
Association for Computational Linguistics, volume 45,
page 192.
E. Hajicova. 1998. Prague Dependency Treebank: From
Analytic to Tectogrammatical Annotation. In Proceed-
ings of TSD’98.
Beth Levin. 1993. English Verb Classes and Alternations.
The University of Chicago Press.
M.P. Marcus, B. Santorini, and M.A. Marcinkiewicz. 1994.
Building a large annotated corpus of English: The Penn
Treebank. Computational linguistics, 19(2):313–330.
P. Merlo and S. Stevenson. 2001. Automatic Verb Clas-
sification based on Statistical Distribution of Argument
Structure. Computational Linguistics, 27(3):373–408.
M. Palmer, O. Rambow, R. Bhatt, D.M. Sharma,
B. Narasimhan, and F. Xia. 2009. Hindi Syntax: Anno-