0% found this document useful (0 votes)

28 views9 pages

Grammar Extraction From Treebanks For Hindi and Te

Grammer

Uploaded by

lovelynani792

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

28 views9 pages

Grammar Extraction From Treebanks For Hindi and Te

Grammer

Uploaded by

lovelynani792

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/220746008

Grammar Extraction from Treebanks for Hindi and Telugu.

Conference Paper · January 2010

Source: DBLP

CITATIONS READS

2 847

7 authors, including:

Prasanth Kolachina Sudheer Kolachina

International Institute of Information Technology, Hyderabad Massachusetts Institute of Technology
5 PUBLICATIONS 90 CITATIONS 12 PUBLICATIONS 135 CITATIONS

SEE PROFILE SEE PROFILE

Anil Kumar Singh Rajeev Sangal

Indian Institute of Technology (Banaras Hindu University) Varanasi International Institute of Information Technology, Hyderabad
98 PUBLICATIONS 561 CITATIONS 93 PUBLICATIONS 1,449 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Prasanth Kolachina on 08 April 2014.

The user has requested enhancement of the downloaded file.

Grammar Extraction from Treebanks for Hindi and Telugu
Prasanth Kolachina, Sudheer Kolachina, Anil Kumar Singh, Samar Husain,
Viswanatha Naidu,Rajeev Sangal and Akshar Bharati

Language Technologies Research Centre,

IIIT-Hyderabad, India
{prasanth k, sudheer.kpg08, anil, vnaidu, samar}@research.iiit.ac.in,[email protected]
Abstract
Grammars play an important role in many Natural Language Processing (NLP) applications. The traditional approach to creating gram-
mars manually, besides being labor-intensive, has several limitations. With the availability of large scale syntactically annotated tree-
banks, it is now possible to automatically extract an approximate grammar of a language in any of the existing formalisms from a
corresponding treebank. In this paper, we present a basic approach to extract grammars from dependency treebanks of two Indian lan-
guages, Hindi and Telugu. The process of grammar extraction requires a generalization mechanism. Towards this end, we explore an
approach which relies on generalization of argument structure over the verbs based on their syntactic similarity. Such a generalization
counters the effect of data sparseness in the treebanks. A grammar extracted using this system can not only expand already existing
knowledge bases for NLP tasks such as parsing, but also aid in the creation of grammars for languages where none exist. Further, we
show that the grammar extraction process can help in identifying annotation errors and thus aid in the task of the treebank validation.

1. Introduction information in the form of weights associated with the

primitive elements in the grammar (Xia, 2001).
Large scale annotated resources such as syntactic tree-
banks, PropBank, FrameNet, VerbNet, etc. have been at
One of the important issues with any kind of anno-
the core of Natural Language Processing (NLP) research
tated corpora is data sparseness. Sparseness of annotated
for quite some time. For a language like English for
data has a detrimental effect on the performance of nat-
which these resources were first developed, they have
ural language processing applications trained over such
proved to be indispensable in advancing the state-of-art
corpora. In the case of syntactic annotation, information
for hosts of applications. Following the success of efforts
about the argument structure of the verb is crucial for
like the Penn TreeBank (PTB) (Marcus et al., 1994),
applications such as parsing. For instance, there exist
Prague dependency treebank (Hajicova, 1998), several
differences among individual verbs in the number of
attempts are underway to build such NLP resources for new
their annotated instances based on the frequency of their
languages. One such ongoing effort is to create a treebank
occurrence. The number of annotated instances greatly
for Hindi-Urdu (Bhatt et al., 2009; Palmer et al., 2009;
varies from verb to verb. In fact, the sparse data also poses
Begum et al., 2008a). Begum et al. describe a dependency
a challenge for grammar extraction from treebanks. One
annotation scheme based on the Computational Paninian
of the ways to overcome this limitation of sparse data
Grammar or CPG (Bharati et al., 1995). The treebank
in syntactic treebanks is through generalization of the
being developed using this annotation scheme currently
argument structure across different verbs. Furthermore,
contains around 2500 sentences. Despite its modest size,
generalization based on clustering can lead to creation of
the Hindi treebank has helped improve considerably the
verb classes based on the similarity of argument structure.
accuracies for a variety of NLP applications, especially
In this paper, we present a basic system to extract a depen-
parsing (Bharati et al., 2008).
dency grammar in the CPG formalism from treebanks for
two languages, Hindi and Telugu. Towards this end, we
The role of grammars in the development of advanced NLP
explore an approach which relies on generalization of ar-
systems is well known. Traditionally, the task of creating a
gument structure over verbs based on the similarity of their
grammar for a language involved selecting a formalism and
syntactic contexts. A grammar extracted using this system
encoding the patterns in that language as rules, constraints
can not only expand an already existing knowledge base for
etc. But with the availability of large scale syntactically
NLP tasks such as parsing, but also aid in the creation of a
annotated treebanks, it is now possible to automatically
useful resource. Further, the grammar extraction process
extract an approximate grammar of a language in any of
can help in identifying annotation errors and thus make the
the existing formalisms from a corresponding treebank,
task of the treebank validation easier.
thus reducing human effort considerably. This method of
extracting grammars from treebanks allows for creation
2. Goals of the paper
and expansion of knowledge bases for parsing. Grammars
extracted through this method can be used to evaluate The main goals of this paper are as follows:
the coverage of existing hand-crafted grammars. The 1. To present a system that extracts grammars in the CPG
extraction process itself can help detect annotation errors. formalism from the Hindi and Telugu treebanks
Another major advantage of extracting grammars from
treebank as compared to the traditional approach of 2. To use the extracted grammar to improve the coverage
handcrafting grammars is the availability of statistical of an existing hand-crafted grammar for Hindi, which

3803
is being used for parsing (Bharati et al., 2009a) her work, Xia has demonstrated the process for treebanks
of three languages: English, Chinese and Korean. She
3. To generalize verb argument structure information also showed that grammars extracted using LexTract
over the extracted verb frames to address sparsity in have several applications. They can be used as stand
the annotated corpora alone grammars for languages that do not have existing
4. To aid in the validation of treebanks by detecting dif- grammars. They can be used to enhance the coverage of
ferent types of annotation errors using the extracted already existing grammars. They can be used to compare
grammars grammars of different languages. The derivation trees
extracted using LexTract can be used to train statistical
3. Related Work parsers and taggers. LexTract can also help detect certain
kinds of annotation errors and thereby, semi-automate
In this section we briefly survey some of the work on gram-
the process of treebank validation. A major advantage of
mar extraction, generalization using syntactic similarity.
the LexTract approach to grammar development is that it
We also mention a few details about both the Indian lan-
can provide valuable statistical information in the form of
guage treebanks that we used. Syntactic alternation can be
weights associated with primitive elements.
an important criterion while generalizing verbs. We briefly
discuss how syntactic alternation in Hindi differs from En-
The work we present in this paper is on the same
glish.
lines as the LexTract approach to grammar development,
3.1. Grammar Extraction but it is on a much smaller scale. It is meant to be the
The role of grammars in NLP is more extensive than is first step towards building a LexTract like system for
generally supposed. Xia (2000) points out that the task of extracting CPG grammars for Indian languages. Since we
treebanking for a language bears much similarity to the worked with dependency treebanks of Hindi and Telugu,
task of manually crafting a grammar. The treebank of a we chose a dependency grammar formalism known as
language contains an implicit grammar for that language. Computational Paninian grammar (CPG). In fact, the
Statistical NLP systems trained over a treebank make use annotation guidelines followed to annotate the treebank are
of this grammar implicit in the treebank. This is why based on this grammar (Bharati et al., 2009b). As such, the
grammar driven approaches and data driven or statistical grammar extraction process is much more straightforward
approaches are not necessarily mutually exclusive. It than the one in LexTract. In the next section, we give a
is well known that the traditional approach of manually brief outline of the CPG formalism where we define the
crafting a high quality, large coverage grammar takes basic terminology and briefly discuss the components of a
tremendous human effort to build and maintain. In CPG grammar.
addition, the traditional approach does not provide for
3.2. Generalization Based on Syntactic Similarity
flexibility, consistency and generalization. To address these
limitations of the traditional approach to grammar develop- The problem of sparse data in Propbank has been previ-
ment, Xia (2001) presents two alternative approaches that ously addressed using syntactic similarity based general-
generate grammars automatically, one from descriptions ization of semantic roles across verbs (Gordon and Swan-
(LexOrg) and the other from treebanks (LexTract). son, 2007). We try to address the data sparseness prob-
lem by generalizing over argument structure across syntac-
The LexTract system extracts explicit grammars in tically similar verbs to arrive at an automatic verb classifi-
the TAG formalism from a treebank. It is not, however, cation. Gordon and Swanson (2007) define syntactic simi-
limited to the TAG formalism as it can also extract CFGs larity for phrase structure trees using the notion of a parse
from a treebank. Large scale treebanks such as the English tree path (Gildea and Jurafsky, 2002). Gildea and Jurafsky
Penn Treebank (PTB) are not based on existing gram- define a parse tree path as ‘the path from the target word
mars. Instead, they were manually annotated following through the parse tree to the constituent in question, repre-
the annotation guidelines. Since the process of creating sented as a string of parse tree non-terminals linked by sym-
annotation guidelines is similar to the process of building bols indicating upward and downward movement through
a grammar by hand, it can be assumed that an implicit the tree’. This parse tree path feature is used to represent
grammar, hidden in the annotation guidelines, generates the syntactic relationships between a predicate and its ar-
the structures in the treebank. This implicit grammar can guments in a parse tree. The syntactic context of a verb is
be called a treebank grammar. As suggested by Xia, the extracted as the set of all possible parse tree paths from the
task of grammar extraction using LexTract can be seen parse trees of sentences containing that verb. The syntac-
as the task of converting this implicit treebank grammar tic context of a verb is then converted into a feature vector
to an explicit TAG grammar. LexTract builds an LTAG representation. The syntactic similarity between two verbs
grammar in two stages. First, it converts the annotated is calculated using different distance measures such as Eu-
phrase structure trees in the PTB into LTAG derived trees. clidean distance, Chi-square statistic, cosine similarity etc.
In the second stage, it decomposes these derived trees In our work, we present an analogous measure of syntactic
into a set of elementary trees which form the basic units similarity for the dependency structures in the Indian Lan-
of an LTAG grammar. It also extracts derivation trees guage (IL) Treebanks, which is described in section 5. We
which provide information about the order of operations characterize the syntactic context of a verb using a karaka
necessary to build the corresponding derived trees. In frame representation. The notion of karakas is explained in

3804
the next section. In the above sentences, the nominal vibhaktis (case-
endings or post-positions) change according to the
3.3. Syntactic Alternations in Hindi TAM and agreement features of the verb. This co-
Syntactic alternations of a verb have been claimed to re- variation of vibhaktis with verb’s inflectional features
flect its underlying semantics properties. Levin’s classifica- is true not only of finite verb forms but also of non-
tion of English verbs (Levin, 1993) based on this assump- finite verb forms. All this information is exploited in
tion demonstrates how syntactic alternation behavior of a the CPG formalism in a systematic way, as discussed
verb can be correlated to its semantic properties thereby in the next section.
leading to a semantic classification. There have also been
several attempts at automatically identifying distinct clus- 3.4. Indian Language Treebanks
ters of verbs that behave similarly using clustering algo- In this sub-section, we give a very brief overview of the
rithms. These empirically-derived clusters were then com- treebanks used in our work. We worked with treebanks of
pared against Levin’s classification (Merlo and Stevenson, two Indian languages, Hindi and Telugu. The treebanks
2001). for Hindi and Telugu contain 2403 and 1226 sentences re-
The following are some linguistic aspects of verb alterna- spectively. The development of these treebanks is an on-
tion behavior that we encountered in Hindi: going effort. The Hindi treebank is part of a multi-level
resource development project (Bhatt et al., 2009). Some of
• In Hindi, the inchoative-transitive alternation pattern the salient features of the annotation process employed in
cannot be considered an alternation of the same verb the development of these treebanks are as follows:
stem. The verb stems in such constructions, although
morphologically related, are mostly distinct. This is • The syntactic structure of sentences is based on the
illustrated in the examples below: dependency representation scheme.
• Dependency relations in the Hindi treebank are anno-
Inchoative: tated on top of a manually POS-tagged and chunked
darawAzA KulA corpus. In the Telugu treebank, the POS-tagging and
door-3PSg-Nom open
chunking was not performed manually.
’The door opened.’
• Dependency relations are defined between chunk
Transitive: heads.
Atifa-ne darawAzA KolA
Atif-3PSg-Erg door-3PSg open • The dependency tagset used to annotate dependency
’Atif opened the door.’ relations is based on the CPG formalism which we dis-
cuss in section 4.
• Similarly, the diathesis alternation pattern discussed
by Levin is not exhibited by Hindi verbs. 4. Computational Paninian Grammar
In this section, we give a brief overview of the Computa-
• Since Hindi is a morphologically rich, free-word order
tional Paninian Grammar (CPG) formalism. We only out-
language, the alternations are not with respect to the
line details relevant to our goal of grammar extraction. See
position of the constituent as is the case in English. In
Bharati et al. (1995) for a detailed discussion of the CPG
Hindi, alternations are with respect to the case-endings
formalism and the Paninian theory on which it is based. In
(or the post-positions) of the nouns, which are called
subsection 4.1, we introduce the basic terminology neces-
vibhaktis in CPG.
sary for an overview of this formalism.
• Post-positions or vibhaktis alternation is determined
4.1. Terminology
by the form that the verb stem takes in a particular
construction. In other words, the arguments of a verb • The notion of karaka relations is central to Paninian
are realized using different case-endings or vibhaktis Grammar. Karaka relations are syntactico-semantic
based on the tense, aspect and modality (TAM) fea- relations between the verbs and other related con-
tures of the verb. This is illustrated in the examples stituents in a sentence. Each of the participants in an
below: activity denoted by a verbal root is assigned a distinct
karaka. There are six different types of karaka rela-
tions in the Paninian grammar as listed below:
abhaya rotI KatA hE
Abhay-Nom-3PSgM bread eat-pres.simp.-3PSgM 1. k1: karta, participant central to the action de-
’Abhay eats bread.’
noted by the verb
abhaya-ne rotI KAyI 2. k2: karma, participant central to the result of the
Abhay-Erg bread-3PSgF eat-past.simp.-3PSgF action denoted by the verb
’Abhay ate bread.’ 3. k3: karana, instrument essential for the action to
take place
abhaya-ne rotI-ko KAyA
Abhay-Erg bread-Acc eat-past.simp.-default 4. k4: sampradana, beneficiary/recipient of the ac-
’Abhay ate bread.’ tion

3805
Figure 1: Basic demand frame for the verb ‘de’ (to give)

Figure 2: ‘yA’ transformation frame for transitive verb

5. k5: apadana, participant which remains station- features assigned to the verb in a syntactic construc-
ary (or is the reference point) in an action involv- tion. Therefore, it can also be referred to as the TAM
ing separation/movement marker of the verb.
6. k7: adhikarana, real or conceptual space/time1 In the previous example sentence, the nouns ’Atifa’
and ’kuAM’ have the vibhaktis ’-ne’ and ’-se’ respec-
For example, in the following example sentence: tively. The vibhakti of the verb ’nikAla’ is ’yA’ which
is also its TAM label.
samIrA-ne abhaya-ko phUla diyA Nominal vibhaktis have also been found to be impor-
Samira-Erg Abhay-Dat flower-Acc give.past tant syntactic cues for identification of semantic role
------------------------------------.3PSgM
in the CPG scheme (Bharati et al., 2008).
’Samira gave a flower to Abhay.’
4.2. Components of CPG: Demand Frames and
Samira is the karta (k1), the flower is the karma (k2) Transformation Frames
and Abhay is the sampradana (k4). Similarly, in the A key aspect of Paninian grammar (CPG) is that the verb
following example: group containing a finite verb is the most important word
group (equivalent to the notion of a ’head’) of a sentence.
Atifa ne kueM se pAnI nikAlA For other word groups in the sentence dependent on this
Atif-Erg well-Abl water-Acc draw.3PSgM head, the vibhakti information of the word group is used
’Atif drew water from the well.’ to map it to an appropriate karaka relation. This karaka-
vibhakti mapping is dependent on the main verb and its
Atif is the karta (k1), well is the apadaana (k5) and TAM label. This mapping is represented by two templates:
water is the karma (k2). default karaka chart (also known as basic demand frames)
In addition to these karaka relations, there are some and karaka chart transformation (also known as transfor-
additional relations in the Paninian scheme such as mation frame). The default demand frame defines the map-
tadarthya (or purpose)2 . ping for a verb or a class of verbs with respect to a basic
reference TAM label. It specifies the karaka relations se-
• The notion of vibhakti relates to the notion of local lected by the verb along with the vibhaktis allowed by the
word groups based on case ending, preposition and basic TAM label. The basic reference TAM label in CPG is
post-position markers. For a nominal word group, vib- chosen to be ’tA hE’ which is equivalent to Present Indefi-
hakti is the post-position (also known as parsarg) oc- nite/Simple Present. For any other TAM label of that verb
curring after the noun. Similarly, in the case of verbal or verb class, a transformation rule is defined that can be ap-
word group, a head verb may be followed by auxil- plied to the default demand frame to obtain the appropriate
iary verbs which may remain as separate words or may karaka-vibhakti mapping for that TAM combination. The
combine with the head verb. This information follow- transformation rules can affect the default demand frames
ing the head verb (in other words, verb stem) is col- in three ways, each defined as an operation in CPG:
lectively called the vibhakti of the verb. The vibhakti
of a verb contains information about the tense, aspect 1. Insert: A new karaka relation is inserted into the de-
and modality (TAM) and also Agreement, which are mand frame along with its vibhakti mapping
2. Delete: An existing karaka relation is deleted from the
1
In the tagset used, k7p represents spatial location, k7t repre- default demand frame
sents temporal location and k7/k7v represents conceptual location.
2
The complete tagset can be found at http: 3. Update: A karaka-vibhakti mapping entry in the de-
//ltrc.iiit.ac.in/MachineTrans/research/ fault demand frame is updated by modifying the vib-
tb/dep-tagset.pdf hakti information according to the new TAM label

3806
Figure 3: Dependency structure containing ‘KA’ (to eat)

The default demand frame and a transformation frame for The syntactic similarity between two verbs is calculated as
the ditransitive verb ’de’ (to give) and the TAM label ’yA’ the distance between their feature vector representations us-
are shown below as an example. ing a variety of distance metrics such as Euclidean distance,
Manhattan distance and cosine similarity.
5. Syntactic Similarity for Dependency
Structures 6. Extracting a CPG grammar from the
A key concept in much of the previous work on semantic Treebank
role labeling is the notion of a parse tree path (Gildea and In this section, we describe the various steps through which
Jurafsky, 2002). The parse tree path representation is based CPG grammars for Hindi and Telugu, were extracted from
on PTB-style phrase structure (PS) trees. Gordon and the treebanks.
Swanson (2007) define syntactic similarity between words
using the parse tree path representations of their syntactic 1. The verb nodes were identified from each sentence in
contexts. the treebank. For each verb node in the dependency
structure of a sentence, the subtree rooted at that node
In the case of dependency parse structures, however, is extracted. Karaka frames for this subtree are ex-
the notion of a parse tree path is not required. This is be- tracted as shown in the previous section. A karaka
cause dependency structures are based on the idea that the frame corresponding to each child of the verb node
syntactic structure of a sentence consists of binary asym- is obtained. An extracted karaka frame contains the
metrical modifier-modified relations between the words following information about the dependent (modifier):
of that sentence. Therefore, in a dependency structure, category type, relation type, post-position. The in-
the information about the syntactic relationship between formation about the verb stem is also included in the
a predicate and its arguments is trivially represented as a frame. In other words, the karaka frames extracted for
parent-child relationship between a head (modified) and its a verb node are lexicalized. At the end of this step, we
dependents (modifiers). For any predicate in a dependency have a sets of karaka frames corresponding to each
structure, an argument frame representation which contains verb instance in the treebank. We call these frames
information about the categorial type, semantic role (in the ’verb instance frames’.
case of labeled dependency) and other kinds of information
2. A list of distinct verb stems based on their surface
about each of the arguments can be extracted readily from
form is compiled.
the tree. Such an argument frame representation in a depen-
dency structure would be equivalent to the parse tree path 3. For each distinct verb stem in the verb stem list, we
representation for PSTs. We refer to these argument frame take all its instances in the treebank, along with the
representations of predicates extracted from a dependency corresponding verb instance frames extracted in 1.
parse tree as karaka frame representations. Figure 3 shows Two instance frames for a particular verb stem can dif-
the dependency structure of a sentence containing the verb fer from one another in one of the following ways:
’KA’ (to eat). The karaka frames extracted for the verb
’KA’ in this parse tree are [NP k1 ne] and [NP k2 0]. These • The same karakas are realized in different in-
extracted karaka frames characterize the syntactic context stance frames by NPs marked with different post-
of the verb ’KA’ (to eat) for this sentence. The set of all positions
karaka frames extracted from each sentence of a given • Some karakas which are present in some instance
verb characterizes the possible syntactic context of that frames are absent in the others
verb. The next step is to represent this syntactic context
as a feature vector. This is done simply by tabulating the • The same karakas are realized in the differing
frequencies of each distinct karaka frame into a feature instance frames by different categories of the
vector representation. The resulting feature vectors are chunks
normalized by dividing the frequencies by the number of • Two instance frames differ in most or all of the
instances of that particular verb stem. karakas.

3807
Case 1 reflects the well known phenomenon of vib- Language Hindi Telugu
hakti or post-position alternation for a verb trigerred Sentences 2403 1226
by difference in the TAM marker (section 3.3.). Case 2 Verb Types 1238 391
relates to the distinction between mandatory and non- Verb Tokens 5051 1616
mandatory dependents of a predicate. Case 3 corre- Tokens per Type 4.07 4.13
sponds to the differences in the category type of an Verbs with Single Instances 799 199
argument for the same verb. Case 4 can be attributed Verbs with Multiple Instances 439 192
solely to verb ambiguity and is the most difficult prob- Complex Predicates 934 122
lem to address.

Table 1: Verb statistics summary

4. The karaka frames extracted for each verb from the
treebank are converted into a feature vector represent- Verb V1 verbs similar to V1 Cosine similarity
ing the verb’s syntactic context. As described in the kaha (to say) batA (to tell) 0.981
previous section, the feature vector of a verb contains doharA (to repeat) 0.802
the frequencies of each distinct karaka frame extracted pUCa (to ask) 0.738
for that verb normalized by the number of occurrences batA (to tell) kaha (to say) 0.981
of that verb. In section 8., we show how these nor- doharA (to repeat) 0.799
malized karaka frame frequencies can be useful for pUCa (to ask) 0.732
detecting annotation errors. The syntactic similarity de (to give) Beja (to send) 0.872
between two verbs extracted from the treebanks is es- sunA (cause to listen) 0.865
sOMpa (to hand over) 0.855
timated as a distance between their feature vectors. At
mila (to meet) A (to come) 0.795
the end of this step, a similarity database containing
bIta (to elapse) 0.773
the pair-wise syntactic similarity values for all the ex- Dala (to mould) 0.773
tracted verbs is obtained. le (to take) hatA (to remove) 0.950
deKa (to see) 0.911
5. The final step in the grammar extraction process is sulaJA (to simplify) 0.887
to build Basic Demand Frames and Transformation A (to come) pahuMca (to reach/arrive) 0.905
cala (to walk) 0.903
frames (see section 4.2.) using the karaka frames for
ruka (to stop) 0.885
each extracted verb. In other words, the basic demand banA (to make) uTA (to lift) 0.903
frame and all possible transformation frames for each baDA (to increase) 0.897
verb, need to be inferred from the karaka frames. As Coda (to leave) 0.852
can be seen from Fig 1, the basic demand frame con- jA (to go) pahuMca (to reach/arrive) 0.886
tains information about the various participants that A (to come) 0.788
the verb demands and their mandatory/optional status. cala (to walk) 0.685
In order to obtain the distinction between mandatory deKa (to see) le (to take) 0.911
and optional participants, we introduce a distinction hatA (to remove) 0.855
between the core versus non-core karaka labels, which raKa (to keep) 0.845
is similar to the distinction made in Framenet (Baker
et al., 1998). We took the following labels in the
Table 2: Similar Verbs for High Frequency Verbs in Hindi
Paninian scheme to be core: k1, k1s, k2, k2p, k4 and
k4a (section 4.). It must be noted that such a distinc-
tion is not defined within the CPG framework and we 7. Results
introduce the distinction to simplify the process of ex-
We applied the steps described in the previous section for
traction. In fact, the distinction between mandatory
both Hindi and Telugu treebanks. The overall summary of
and optional demands of a verb can be inferred from
the verb statistics is presented in Table 1. As can be seen
the karaka frames. However, this inference might not
from the table, the number of verbs with single instance is
be reliable due to possibility of annotation errors in the
quite high in both the treebanks. This statistic indicates the
treebanks. In the case of transformation frames, the
gravity of the data sparseness issue in these corpora. The
sparseness of the annotated corpora is a problem al-
number of complex predicates is also quite high in the case
ready pointed out. Verbs differ in the number of their
of Hindi. We excluded sentences with complex predicates
annotated instances. The number of transformations
during the grammar extraction process. Basic demand
present for a verb will vary according to the number
frames were inferred for 284 verbs in Hindi and 384 verbs
of its annotated instances. In the treebanks we used,
in Telugu. In the case of transformation (TAM) frames,
there were a large number of verbs with a single in-
81 frames were obtained for Hindi. However, due to the
stance. To counter this problem, in the case of verbs
complex vibhakti alternation phenomenon in Hindi, these
with very few annotated instances, we use the karaka
frames cannot be treated as definitive and require manual
frames of similar verbs in addition to those of the verbs
verification. In the case of Telugu, the process of inferring
to infer the transformation frames.
transformation (TAM) frames is comparatively simpler.
Besides, the verb inflectional paradigm is comparatively

3808
Type of error Frequency As an example, for the intransitive verb ‘jA’, we show how
POS-tagging 0 argument structure annotation errors can be identified from
Chunking 1 the extracted frames:
Morphological 3
Argument Structure 22 jA
k1 NP 0 0.347826
k1 NULL__NP 0 0.065217
Table 3: Error statistics ------------------------------------
k1s JJP 0 0.021739
k2 NP ko 0.065217
smaller in Telugu. A total of 18 transformation (TAM)
frames were extracted from the Telugu treebank. In the above example, the relative frequency of k1s and k2
is below the threshold for error identification, whereas the
While inferring transformation (TAM) frames for Hindi, in total relative frequency of k1 is above that threshold. Thus,
the case of verbs with single instance, annotated instances in this case, k1s and k2 are identified as annotation errors.
of similar verbs were also used for inference. In Table 2, In Table 3, we show the relative frequencies of these four
We list 9 high frequency verbs in Hindi extracted from types of errors over all the extracted instances of a sam-
the treebank. For each of these verbs, we list the 3 most ple of 25 randomly selected verbs. The table shows that
similar verbs obtained after applying our method for the number of argument structure annotation errors is much
estimating syntactic similarity. We also provide the cosine higher than other types of errors. This is not surprising
similarity value for each pair. It is interesting to note that given the complexity of the dependency annotation task
the similairty that exists among the verbs in each set listed as compared to tasks such as POS-tagging, chunking and
in Table 2 is of different types. The sets of similar verbs morph analysis. Further, the argument structure errors that
corresponding to the first three verbs (’to say’, ’to tell’ and were discovered in the course of grammar extraction were
’to give’) in the table reflect a semantic similarity and can cases of genuine confusibility even for trained annotators.
be assigned a common underlying semantic property. It This shows that such instances can be incorporated in the
is interesting that the similarity of syntactic context in the annotation guidelines to reduce annotation errors in the fu-
case of these verbs correlates with a semantic similarity. ture.
This is on the same lines as Levin’s proposal for verb Apart from the above, we also discovered a small percent-
classification (Levin, 1993). However, not all sets of verbs age (0.3) of sentences in which no verb was found during
exhibit this level of similarity. Verb sets corresponding to the extraction process. The number of errors was much
the last four verbs in Table 2 (’to come’, ’to make’, ’to go’ larger in the case of Telugu treebank as it has not yet been
and ’to see’) are similar only with respect to their surface subjected to any kind of validation.
syntactic properties. There were also erroneous sets such
as the set for the Hindi verb ’to take’. 9. Conclusions and Future Work
We present a system that can extract grammars in the CPG
8. Detecting Annotation Errors formalism from dependency treebanks for Hindi and Tel-
The treebanks that we worked with are relatively recent de- ugu. We discuss the various issues involved in the extrac-
velopments and their validation process is still at a very tion process.
early stage. The grammar extraction process that we fol- In order to address the issue of data sparseness, we explore
low can help semi-automate this process of treebank vali- a generalization approach based on syntactic similarity of
dation. The statistical information associated with each of verbs. We define the notion of syntactic similarity of verbs
the extracted karaka frames for a particular verb is helpful in a dependency representation using the karaka frame rep-
for detecting treebanking errors. Karaka frames with very resentation. The definition is relevant for dependency rep-
low frequency are identified as containing one the follow- resentation in any formalism. Applying this syntactic sim-
ing four main types of annotation errors: ilarity to the verbs extracted from the treebanks, we ob-
tain the pair-wise similarities over the entire set of verbs.
• POS-tagging errors where a word is marked with the Using this similarity database, the verbs can be clustered
wrong POS tag. These errors are detected at various and an unsupervised verb classification can be obtained.
stages during the extraction. The resulting verb clusters can be compared against ear-
lier works on theoretical classification of Hindi verbs into
• Chunking errors where a chunk is marked with the verb classes (Begum et al., 2008b) which is one of our im-
wrong chunk tag. mediate future works.
• Errors during morphological manual annotation or au- We also show how statistical information obtained dur-
tomatic analysis: Errors of this type include errors in ing the extraction process can enable detection of different
identification of verb stems, TAM labels and vibhaktis kinds of annotation errors. A detailed study of how to incor-
(case–endings or post-positions) of nouns. porate the information provided by the grammar extraction
process into a treebank validation system is also part of our
• Argument structure annotation errors: Errors of this future work.
type are most crucial and are difficult to detect during The system that we present in this paper is still under de-
manual validation as the error frequency is very low. velopment.

3809
10. References tating Dependency, Lexical Predicate-Argument Struc-
Collin F. Baker, Charles J. Fillmore, and John B. Lowe. ture, and Phrase Structure. In Proceedings of ICON-
1998. The berkeley framenet project. In Proceedings of 2009: 7th International Conference on Natural Lan-
the 17th international conference on Computational lin- guage Processing, Hyderabad.
guistics, pages 86–90, Montreal, Quebec, Canada. Asso- Fei Xia, Martha Palmer, and Aravind Joshi. 2000. A uni-
ciation for Computational Linguistics. form method of grammar extraction and its applications.
R. Begum, S. Husain, A. Dhwaj, D.M. Sharma, L. Bai, In Proceedings of the 2000 Joint SIGDAT conference on
and R. Sangal. 2008a. Dependency annotation scheme Empirical methods in natural language processing and
for Indian languages. Proceedings of International Joint very large corpora, pages 53–62, Hong Kong.
Conference on Natural Language Processing. Fei Xia. 2001. Automatic Grammar Generation from Two
Different Perspectives. Ph.D. thesis, University of Penn-
Rafiya Begum, Samar Husain, Lakshmi Bai, and
sylvania.
Dipti Misra Sharma. 2008b. Developing Verb Frames
for Hindi. In Proceedings of the Sixth International Lan-
guage Resources and Evaluation (LREC’08), Morocco.
Akshar Bharati, Vineet Chaitanya, and Rajeev Sangal.
1995. Natural Language Perspective: A Paninian Per-
spective.
Akshar Bharati, Samar Husain, Bharat Ambati, Sambhav
Jain, Dipti Misra Sharma, and Rajeev Sangal. 2008.
Two semantic features make all the difference in pars-
ing accuracy. In Proceedings of the 6th International
Conference on Natural Language Processing (ICON-
08), CDAC Pune, India.
Akshar Bharati, Samar Husain, Dipti Misra Sharma, and
Rajeev Sangal. 2009a. Two stage constraint based hy-
brid approach to free word order language dependency
parsing. In Proceedings of the 11th International Con-
ference on Parsing Technologies (IWPT’09), pages 77–
80, Paris, France.
Akshar Bharati, Dipti Misra Sharma, Samar Husain, Lak-
shmi Bai, Rafiya Begum, and Rajeev Sangal. 2009b.
Anncorra: Treebanks for indian languages, guidelines
for annotating hindi treebank.
Rajesh Bhatt, Bhuvana Narasimhan, Martha Palmer, Owen
Rambow, Dipti Sharma, and Fei Xia. 2009. A
multi-representational and multi-layered treebank for
hindi/urdu. In Proceedings of the Third Linguistic Anno-
tation Workshop, held in conjunction with ACL-IJCNLP,
Singapore.
D. Gildea and D. Jurafsky. 2002. Automatic labeling of
semantic roles. Computational Linguistics, 28(3):245–
288.
A. Gordon and R. Swanson. 2007. Generalizing seman-
tic role annotations across syntactically similar verbs. In
Association for Computational Linguistics, volume 45,
page 192.
E. Hajicova. 1998. Prague Dependency Treebank: From
Analytic to Tectogrammatical Annotation. In Proceed-
ings of TSD’98.
Beth Levin. 1993. English Verb Classes and Alternations.
The University of Chicago Press.
M.P. Marcus, B. Santorini, and M.A. Marcinkiewicz. 1994.
Building a large annotated corpus of English: The Penn
Treebank. Computational linguistics, 19(2):313–330.
P. Merlo and S. Stevenson. 2001. Automatic Verb Clas-
sification based on Statistical Distribution of Argument
Structure. Computational Linguistics, 27(3):373–408.
M. Palmer, O. Rambow, R. Bhatt, D.M. Sharma,
B. Narasimhan, and F. Xia. 2009. Hindi Syntax: Anno-

View publication stats

3810

Analysis of Challenges Faced by Indian Logistics Service Providers
No ratings yet
Analysis of Challenges Faced by Indian Logistics Service Providers
13 pages
Paper 16
No ratings yet
Paper 16
6 pages
Gar VA2015
No ratings yet
Gar VA2015
3 pages
Internet of Robotic Things Driving Intelligent Robotics of Future - Concept, Architecture, Applications and Technologies
No ratings yet
Internet of Robotic Things Driving Intelligent Robotics of Future - Concept, Architecture, Applications and Technologies
11 pages
Solar Tracking Methods To Maximize PV System Outpu
No ratings yet
Solar Tracking Methods To Maximize PV System Outpu
10 pages
AnEye TrackingBasedWirelessControlSystem
No ratings yet
AnEye TrackingBasedWirelessControlSystem
12 pages
ASustainableVehicleRoutingProblemforIndian POMS Conf
No ratings yet
ASustainableVehicleRoutingProblemforIndian POMS Conf
6 pages
ref 3
No ratings yet
ref 3
16 pages
JARV5N23
No ratings yet
JARV5N23
5 pages
Geotechnical Investigation On Slopes Failures Along The Mughal Road From Ba Iaz To
No ratings yet
Geotechnical Investigation On Slopes Failures Along The Mughal Road From Ba Iaz To
8 pages
Rasa Rat Nasa Muc Chaya
No ratings yet
Rasa Rat Nasa Muc Chaya
8 pages
Shti 264 Shti190448
No ratings yet
Shti 264 Shti190448
6 pages
ImpactofCFLDinIndia
No ratings yet
ImpactofCFLDinIndia
118 pages
Biochemical Composition of Pulp and Seed of Wild Jack (Artocarpus Hirsutus Lam.) Fruit
No ratings yet
Biochemical Composition of Pulp and Seed of Wild Jack (Artocarpus Hirsutus Lam.) Fruit
3 pages
3 - LR No.31 FactorInfluencingtoUsersAcceptanceofDigitalPaymentSystem
No ratings yet
3 - LR No.31 FactorInfluencingtoUsersAcceptanceofDigitalPaymentSystem
6 pages
The Consumer Protection Act Are View of Legal Perspective
No ratings yet
The Consumer Protection Act Are View of Legal Perspective
10 pages
RajkumaretalEffectofPlantSpacingandFertilizerLevelsonGrowthYieldAndqualityofcastor....
No ratings yet
RajkumaretalEffectofPlantSpacingandFertilizerLevelsonGrowthYieldAndqualityofcastor....
5 pages
Adaptive NF Paper
No ratings yet
Adaptive NF Paper
7 pages
Rock Fragmentation by Blasting - A Review
No ratings yet
Rock Fragmentation by Blasting - A Review
10 pages
Artificial Intelligence in Academic Writing: A Paradigm-Shifting Technological Advance
No ratings yet
Artificial Intelligence in Academic Writing: A Paradigm-Shifting Technological Advance
3 pages
38 BiophysicsRev 3 011306 IDPs Anfinsen
No ratings yet
38 BiophysicsRev 3 011306 IDPs Anfinsen
29 pages
coverpage_metamorphic
No ratings yet
coverpage_metamorphic
2 pages
Call For Papers in International Journal of Production Research (IJPR) - "Investments in Industry 4.0 Technologies and Supply Chain Finance: Approaches, Framework and Strategies."
No ratings yet
Call For Papers in International Journal of Production Research (IJPR) - "Investments in Industry 4.0 Technologies and Supply Chain Finance: Approaches, Framework and Strategies."
5 pages
DRMANDAR1_REV_1_
No ratings yet
DRMANDAR1_REV_1_
7 pages
Case Study 3
No ratings yet
Case Study 3
6 pages
Hewan Transgenik
No ratings yet
Hewan Transgenik
9 pages
Constraints-in-fish-farming-activities-as-perceived-by-the-fish-farmers-of-RI-Bhoi-and-west-Garo-Hills-districts-of-Meghalaya
No ratings yet
Constraints-in-fish-farming-activities-as-perceived-by-the-fish-farmers-of-RI-Bhoi-and-west-Garo-Hills-districts-of-Meghalaya
6 pages
Application of Handheld Tele-ECG For Health Care Delivery in Rural India
No ratings yet
Application of Handheld Tele-ECG For Health Care Delivery in Rural India
8 pages
Operation and Control of A Hybrid Photovoltaic-Diesel-Fuel Cell System Connected To Micro-Grid
No ratings yet
Operation and Control of A Hybrid Photovoltaic-Diesel-Fuel Cell System Connected To Micro-Grid
7 pages
Operation and Control of A Hybrid Photovoltaic-Diesel-Fuel Cell System Connected To Micro-Grid
No ratings yet
Operation and Control of A Hybrid Photovoltaic-Diesel-Fuel Cell System Connected To Micro-Grid
7 pages
Jgsi 21
No ratings yet
Jgsi 21
10 pages
175 - Engineering Exploration A Collaborative Experience of Designing and Evolvin
No ratings yet
175 - Engineering Exploration A Collaborative Experience of Designing and Evolvin
6 pages
Energy Management of Renewable Energy Sources in A Microgrid Using Cuckoo Search Algorithm
No ratings yet
Energy Management of Renewable Energy Sources in A Microgrid Using Cuckoo Search Algorithm
7 pages
91779-ArticleText-177033-1-10-20201209
No ratings yet
91779-ArticleText-177033-1-10-20201209
14 pages
40 46V8N10PT
No ratings yet
40 46V8N10PT
8 pages
Histopathological and Immunohistochemical Diagnosis of Infectious Bursal Disease in Poultry Birds
No ratings yet
Histopathological and Immunohistochemical Diagnosis of Infectious Bursal Disease in Poultry Birds
10 pages
47.2017RSMSMangrovemolluscMumbaiKantharajan
No ratings yet
47.2017RSMSMangrovemolluscMumbaiKantharajan
11 pages
Design Implementation of A Low Cost Blood Glucose
No ratings yet
Design Implementation of A Low Cost Blood Glucose
7 pages
Effect of Music Therapy On Hospital Induced Anxiety and HRQoL in CABG Patients
No ratings yet
Effect of Music Therapy On Hospital Induced Anxiety and HRQoL in CABG Patients
6 pages
Geohydrology2013 StableIsotopeofGomtiRiver
No ratings yet
Geohydrology2013 StableIsotopeofGomtiRiver
17 pages
Physico-Chemical, Functional and Antioxidant Properties of Millet Flours
No ratings yet
Physico-Chemical, Functional and Antioxidant Properties of Millet Flours
4 pages
Reduction of Noise Using Continuously Changing Variable Clock and Clock Gating For IC Chips
No ratings yet
Reduction of Noise Using Continuously Changing Variable Clock and Clock Gating For IC Chips
11 pages
Trends in Packaging of Millets and Millet-Based Processed Products
No ratings yet
Trends in Packaging of Millets and Millet-Based Processed Products
12 pages
Crystals 12 01148 v24
No ratings yet
Crystals 12 01148 v24
23 pages
Bovine Dystocia - An Overview: September 2019
No ratings yet
Bovine Dystocia - An Overview: September 2019
6 pages
Elixir Journalpaper
No ratings yet
Elixir Journalpaper
5 pages
JMP
No ratings yet
JMP
34 pages
Movie Recommender System Using K-Means Clustering AND K-Nearest Neighbor
No ratings yet
Movie Recommender System Using K-Means Clustering AND K-Nearest Neighbor
7 pages
A Systematic Review On Etiology, Epidemiology, and Treatment of CP
No ratings yet
A Systematic Review On Etiology, Epidemiology, and Treatment of CP
10 pages
Nepal Eqk 2015 TGNH
No ratings yet
Nepal Eqk 2015 TGNH
16 pages
3-CleaningSanitationandHygieneMaintenanceofDairyProcessingEquipment
No ratings yet
3-CleaningSanitationandHygieneMaintenanceofDairyProcessingEquipment
10 pages
Madakaetal
No ratings yet
Madakaetal
5 pages
RajarataJ
No ratings yet
RajarataJ
14 pages
1 s2.0 S2214785321024937 Main
No ratings yet
1 s2.0 S2214785321024937 Main
12 pages
PhcogRev 2015 9 18 87 162101
No ratings yet
PhcogRev 2015 9 18 87 162101
8 pages
33rd Research Paper
No ratings yet
33rd Research Paper
20 pages
p326 Camera Paper1
No ratings yet
p326 Camera Paper1
11 pages
Saltwater Aquarium Reef Tank Book
No ratings yet
Saltwater Aquarium Reef Tank Book
11 pages
Gensim for Natural Language Processing: Definitive Reference for Developers and Engineers
From Everand
Gensim for Natural Language Processing: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Language Identification: Fundamentals and Applications
From Everand
Language Identification: Fundamentals and Applications
Fouad Sabry
No ratings yet
Apollo247_252171107_labreport
No ratings yet
Apollo247_252171107_labreport
2 pages
LabTest_17May2025
No ratings yet
LabTest_17May2025
14 pages
Care Sheet Rabbit
No ratings yet
Care Sheet Rabbit
4 pages
CSE 2-2 MEFA Mid 2 QP
No ratings yet
CSE 2-2 MEFA Mid 2 QP
2 pages
Mobile Computing Chapter4
No ratings yet
Mobile Computing Chapter4
51 pages
AI Mid2 Q Paper
No ratings yet
AI Mid2 Q Paper
2 pages
9781785043321
100% (10)
9781785043321
27 pages
Jason
No ratings yet
Jason
31 pages
Sensors 22 05849 v2
No ratings yet
Sensors 22 05849 v2
19 pages
Using OpenClinica As A Monitor
No ratings yet
Using OpenClinica As A Monitor
16 pages
Annotation of Antonio Morga S Sucesos de Las Islas
No ratings yet
Annotation of Antonio Morga S Sucesos de Las Islas
3 pages
3 AI Annotation
No ratings yet
3 AI Annotation
34 pages
OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations
No ratings yet
OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations
30 pages
Chen Panda-70M Captioning 70M Videos With Multiple Cross-Modality Teachers CVPR 2024 Paper
No ratings yet
Chen Panda-70M Captioning 70M Videos With Multiple Cross-Modality Teachers CVPR 2024 Paper
12 pages
PDF Technology in Education. Innovative Solutions and Practices Simon K.S. Cheung download
100% (1)
PDF Technology in Education. Innovative Solutions and Practices Simon K.S. Cheung download
55 pages
Set Annotation Position by Percentage
No ratings yet
Set Annotation Position by Percentage
7 pages
A Comprehensive Review On Feature Set Used For Anaphora Resolution
No ratings yet
A Comprehensive Review On Feature Set Used For Anaphora Resolution
90 pages
Note Making in Academic Writing
No ratings yet
Note Making in Academic Writing
4 pages
Amurd: Annotated Arabic-English Receipt Dataset For Key Information Extraction and Classification
No ratings yet
Amurd: Annotated Arabic-English Receipt Dataset For Key Information Extraction and Classification
11 pages
(Ebook) The Routledge Course in Translation Annotation: Arabic-English-Arabic by Ali Almanna ISBN 9781138913073, 1138913073 pdf download
No ratings yet
(Ebook) The Routledge Course in Translation Annotation: Arabic-English-Arabic by Ali Almanna ISBN 9781138913073, 1138913073 pdf download
61 pages
Annotation For Objective 4
No ratings yet
Annotation For Objective 4
3 pages
Mayhew, S., Et Al. (2023) - Universal NER - A Gold-Standard Multilingual Named Entity Recognition Benchmark. Arxiv
No ratings yet
Mayhew, S., Et Al. (2023) - Universal NER - A Gold-Standard Multilingual Named Entity Recognition Benchmark. Arxiv
16 pages
Sample Portfolio Rpms 2022-2023 Mt-1-4
No ratings yet
Sample Portfolio Rpms 2022-2023 Mt-1-4
18 pages
Xanadu Now Platform Administration 11-7-2024 (1)
No ratings yet
Xanadu Now Platform Administration 11-7-2024 (1)
508 pages
AYUB JAHU(2)
No ratings yet
AYUB JAHU(2)
3 pages
Developing Retrieval Augmented Generation (RAG) Based LLM Systems From Pdfs - An Expert Report
No ratings yet
Developing Retrieval Augmented Generation (RAG) Based LLM Systems From Pdfs - An Expert Report
36 pages
Tesol103 Document sAMRAdaptationSet01
No ratings yet
Tesol103 Document sAMRAdaptationSet01
8 pages
Handout 10904 Ellis APractical Guideto GISin Civil 3 D
No ratings yet
Handout 10904 Ellis APractical Guideto GISin Civil 3 D
37 pages
Artificial Intelligence For Education
No ratings yet
Artificial Intelligence For Education
18 pages
Peer Reviews
No ratings yet
Peer Reviews
4 pages
Annotations - Quizizz
No ratings yet
Annotations - Quizizz
2 pages
White Paper Now in Android With Koin - PDF 2
No ratings yet
White Paper Now in Android With Koin - PDF 2
34 pages
The immuneML Ecosystem For Machine Learning Analysis of
No ratings yet
The immuneML Ecosystem For Machine Learning Analysis of
22 pages
RPMS Portfolio Template Design 1 S y 2023 2024
No ratings yet
RPMS Portfolio Template Design 1 S y 2023 2024
53 pages
Infinite 3D Landmarks Improving Continuous 2D Facial Landmark Detection Paper
No ratings yet
Infinite 3D Landmarks Improving Continuous 2D Facial Landmark Detection Paper
12 pages
Down The Rabbit Hole Detecting Online Extremism, Radicalisation, and Politicised Hate Speech
No ratings yet
Down The Rabbit Hole Detecting Online Extremism, Radicalisation, and Politicised Hate Speech
35 pages
Introduction Data Visualization
No ratings yet
Introduction Data Visualization
15 pages

Grammar Extraction From Treebanks For Hindi and Te

Uploaded by

Grammar Extraction From Treebanks For Hindi and Te

Uploaded by

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Grammar Extraction from Treebanks for Hindi and Telugu.

Conference Paper · January 2010

Prasanth Kolachina Sudheer Kolachina

SEE PROFILE SEE PROFILE

Anil Kumar Singh Rajeev Sangal

SEE PROFILE SEE PROFILE

The user has requested enhancement of the downloaded file.

Language Technologies Research Centre,

1. Introduction information in the form of weights associated with the

Figure 2: ‘yA’ transformation frame for transitive verb

Table 1: Verb statistics summary

View publication stats

You might also like