Treebank
Treebank
Treebanks
1.
2.
3.
4.
5.
6.
Introduction
Treebank design
Treebank development
Treebank usage
Conclusion
Literature
1. Introduction
A treebank can be defined as a linguistically annotated corpus that includes some grammatical analysis beyond the part-of-speech level. The term treebank appears to have
been coined by Geoffrey Leech (Sampson 2003) and obviously alludes to the fact that
the most common way of representing the grammatical analysis is by means of a tree
structure. However, in current usage, the term is in no way restricted to corpora containing tree-shaped representations, but applies to all kinds of grammatically analyzed corpora.
It is customary to restrict the application of the term treebank to corpora where the
grammatical analysis is the result of manual annotation or post-editing. This is in contrast to the term parsed corpus, which is more often used about automatically analyzed
corpora, whether the analysis has been manually corrected or not. This is also the usage
that will be adopted here, although it is worth pointing out that the two terms are
sometimes used interchangeably in the literature (cf. Abeille 2003b).
Treebanks have been around in some shape or form at least since the 1970s. One of
the earliest efforts to produce a syntactically annotated corpus was performed by Ulf
Teleman and colleagues at Lund University, resulting in close to 300,000 words of both
written and spoken Swedish, manually annotated with both phrase structure and grammatical functions, an impressive achievement at the time but unfortunately documented
only in Swedish (cf. Teleman 1974; Nivre 2002). However, it is only in the last ten to
fifteen years that treebanks have appeared on a large scale for a wide range of languages,
mostly developed using a combination of automatic processing and manual annotation
or post-editing. In this article, we will not attempt to give a comprehensive inventory of
available treebanks but focus on theoretical and methodological issues, referring to specific treebanks only to exemplify the points made. A fairly representative overview of
available treebanks for a number of languages can be found in Abeille (2003a), together
with a discussion of certain methodological issues. In addition, proceedings from the
annual workshops on Treebanks and Linguistic Theories (TLT) contain many useful references (Hinrichs/Simov 2002; Nivre/Hinrichs 2003; Kbler et al. 2004). Cf. also article 20
for some of the more well-known and influential treebanks.
The rest of this article is structured as follows. We begin, in section 2, by discussing
design issues for treebanks, in particular the choice of annotation scheme. We move on,
in section 3, to the development of treebanks, discussing the division of labor between
manual and automatic analysis, as well as tools to be used in the development process.
2. Treebank design
Ideally, the design of a treebank should be motivated by its intended usage, whether
linguistic research or language technology development (cf. section 4 below), in the same
way that any software design should be informed by a requirements analysis (cf. article
9 on design strategies). However, in actual practice, there are a number of other factors
that influence the design, such as the availability of data and analysis tools. Moreover,
given that the development of a treebank is a very labor-intensive task, there is usually
also a desire to design the treebank in such a way that it can serve several purposes
simultaneously. Thus, as observed by Abeille (2003b), the majority of large treebank
projects have emerged as the result of a convergence between computational linguistics
and corpus linguistics, with only partly overlapping goals. It is still a matter of ongoing
debate to what extent it is possible to cater for different needs without compromising
the usefulness for each individual use, and different design choices can to some extent
be seen to represent different standpoints in this debate. We will return to this problem
in relation to annotation schemes in section 2.2. But first we will consider the choice of
corpus material.
13. Treebanks
on a subset of the Brown Corpus of American English (Kucera/Francis 1967), which is
a typical balanced corpus. By and large, however, the majority of available treebanks
for written language are based on contemporary newspaper text, which has the practical
advantage of being relatively easily accessible. An important case in point is the Wall
Street Journal section of the Penn Treebank (Marcus et al. 1993), which has been very
influential as a model for treebanks across a wide range of languages.
Although most treebanks developed so far have been based on more or less contemporary data from a single language, there are also exceptions to this pattern. On the one
hand, there are historical treebanks, based on data from earlier periods of a language
under development, such as the Penn-Helsinki Parsed Corpus of Middle English (Kroch/
Taylor 2000) and the Partially Parsed Corpus of Medieval Portuguese (Rocio et al. 2003).
On the other hand, there are parallel treebanks based on texts in one language and their
translations in other languages. The Czech-English Penn Treebank has been developed
mejrek
for the specific purpose of machine translation at Charles University in Prague (C
et al. 2004), and several other projects are emerging in this area (cf. Cyrus et al. 2003;
Volk/Samuelsson 2004).
Finally, we have to consider the issue of corpus size. Despite recent advances in automating the annotation process, linguistic annotation is still a very labor-intensive activity.
Consequently, there is an inevitable tradeoff in corpus design between the amount of
data that can be included and the amount of annotation that can be applied to the data.
Depending on the intended usage, it may be preferable to build a smaller treebank with
a more detailed annotation, such as the SUSANNE corpus (Sampson 1995), or a larger
treebank with a less detailed annotation, such as the original bracketed version of the
Penn Treebank (Marcus et al. 1993). Because the annotation of grammatical structure is
even more expensive than annotation at lower levels, treebanks in general tend to be
one or two orders of magnitude smaller than corresponding corpora without syntactic
annotation. Thus, whereas an ordinary corpus of one million running words is not considered very big today, there are only a few treebanks that reach this size, and most of
them are considerably smaller.
AT
JJ
NN1c
VVDv
AT1
NN1c
II
AT
NNL1n
NN1u
NNJ1c
GG
VVGt
IO
JJ
NN2
YG
VVNt
IF
NN1c
NN1u
NN2
II
VV0t
NN2
YF
The
grand
jury
took
a
swipe
at
the
State
Welfare
Department
+<apos>s
handling
of
federal
funds
granted
for
child
welfare
services
in
foster
homes
+.
the
grand
jury
take
a
swipe
at
the
state
welfare
department
handle
of
federal
fund
grant
for
child
welfare
service
in
foster
home
-
In the following, we will not discuss word-level annotation but concentrate on the annotation of syntactic (and to some extent semantic) structure, since this is what distinguishes treebanks from other annotated corpora. Moreover, word-level annotation tends
to be rather similar across different treebank annotation schemes.
The choice of annotation scheme for a large-scale treebank is influenced by many
different factors. One of the most central considerations is the relation to linguistic
theory. Should the annotation scheme be theory-specific or theory-neutral? If the first
of these options is chosen, which theoretical framework should be adopted? If we opt
for the second, how do we achieve broad consensus, given that truly theory-neutral
annotation is impossible? The answers to these questions interact with other factors, in
particular the grammatical characteristics of the language that is being analyzed, and the
13. Treebanks
tradition of descriptive grammar that exists for this language. In addition, the relation to
annotation schemes used for other languages is relevant, from the point of view of comparative studies or development of parallel treebanks. To this we may add the preferences of different potential user groups, ranging from linguistic researchers and language
technology developers to language teachers and students at various levels of education.
Finally, when embarking on a large-scale treebank project, researchers usually cannot
afford to disregard the resources and tools for automatic and interactive annotation that
exist for different candidate annotation schemes.
The number of treebanks available for different languages is growing steadily and
with them the number of different annotation schemes. Broadly speaking we can distinguish three main kinds of annotation in current practice:
Constituency annotation
Functional annotation
Semantic annotation
In addition, we can distinguish between (more or less) theory-neutral and theory-specific
annotation schemes, a dimension that cuts across the three types of annotation. It should
also be noted that the annotation found in many if not most of the existing treebanks
actually combines two or even all three of these categories. We will treat the categories in
the order in which they are listed above, which also roughly corresponds to the historical
development of treebank annotation schemes.
The annotation of constituent structure, often referred to as bracketing, is the main
kind of annotation found in early large-scale projects such as the Lancaster Parsed Corpus (Garside et al. 1992) and the original Penn Treebank (Marcus et al. 1993). Normally,
this kind of annotation consists of part-of-speech tagging for individual word tokens
and annotation of major phrase structure categories such as NP, VP, etc. Figure 13.2
shows a representative example, taken from the IBM Paris Treebank using a variant of
the Lancaster annotation scheme.
[N Vous_PPSA5MS N]
[V accedez_VINIP5
[P a_PREPA
[N cette_DDEMFS session_NCOFS N]
P]
[Pv a_PREP31 partir_PREP32 de_PREP33
[N la_DARDFS fenetre_NCOFS
[A Gestionnaire_AJQFS
[P de_PREPD
[N taches_NCOFP
N]
P]
A]
N]
Pv]
V]
Fig. 13.2: Constituency annotation in the IBM Paris Treebank
Komink
Sb
Komink
Chimney-sweep
.
AuxK
komny
Obj
vymet
sweeps
komny .
chimney .
Other examples of treebanks based primarily on dependency analysis is the METU Treebank of Turkish (Oflazer et al. 2003), the Danish Dependency Treebank (Kromann
2003), the Eus3LB Corpus of Basque (Aduriz et al. 2003), the Turin University Treebank
of Italian (Bosco/Lombardo 2004), and the parsed corpus of Japanese described in Kurohashi/Nagao (2003).
13. Treebanks
The trend towards more functionally oriented annotation schemes is also reflected in
the extension of constituency-based schemes with annotation of grammatical functions.
Cases in point are SUSANNE (Sampson 1995), which is a development of the Lancaster
annotation scheme mentioned above, and Penn Treebank II (Marcus et al. 1994), which
adds functional tags to the original phrase structure annotation. A combination of constituent structure and grammatical functions along these lines is currently the dominant
paradigm in treebank annotation and exists in many different variations. Adapted versions of the Penn Treebank II scheme are found in the Penn Chinese Treebank (Xue et al.
2004), in the Penn Korean Treebank (Han et al. 2002) and in the Penn Arabic Treebank
(Maamouri/Bies 2004), as well as in a treebank of Spanish (Moreno et al. 2003). A
similar combination of constituency and grammatical functions is also used in the ICEGB Corpus of British English (Nelson et al. 2002).
A different way of combining constituency and functional annotation is represented
by the TIGER annotation scheme for German (Brants et al. 2002), developed from the
earlier NEGRA scheme, which integrates the annotation of constituency and dependency in a graph where node labels represent phrasal categories while edge labels represent syntactic functions, and which allows crossing branches in order to model discontinuous constituents. Another scheme that combines constituent structure with functional
annotation while allowing discontinuous constituents is the VISL (Visual Interactive
Syntax Learning) scheme, originally developed for pedagogical purposes and applied to
22 languages on a small scale, subsequently used in developing larger treebanks in Portuguese (Afonso et al. 2002) and Danish (Bick 2003). Yet another variation is found in
the Italian Syntactic-Semantic Treebank (Montemagni et al. 2003), which employs two
independent layers of annotation, one for constituent structure, one for dependency
structure.
From functional annotation, it is only a small step to a shallow semantic analysis,
such as the annotation of predicate-argument structure found in the Proposition Bank
(Kingsbury/Palmer 2003). The Proposition Bank is based on the Penn Treebank and
adds a layer of annotation where predicates and their arguments are analyzed in terms
of a frame-based lexicon. The Prague Dependency Treebank, in addition to the surfaceoriented dependency structure exemplified in Figure 13.3, also provides a layer of tectogrammatical analysis involving case roles, which can be described as a semantically oriented deep syntactic analysis (cf. Hajicova 1998). The Turin University Treebank also
adds annotation of semantic roles to the dependency-based annotation of grammatical
functions (Bosco/Lombardo 2004), and the Sinica treebank of Chinese uses a combination of constituent structure and functional annotation involving semantic roles (Chen
et al. 2003).
Other examples of semantic annotation are the annotation of word senses in the
Italian Syntactic-Semantic Treebank (Montemagni et al. 2003) and in the Hellenic National Treebank of Greek (Stamou et al. 2003). Discourse semantic phenomena are annotated in the RST Discourse Treebank (Carlson et al. 2002), the German TIGER Treebank (Kunz/Hansen-Schirra 2003), and the Penn Discourse Treebank (Miltsakaki et al.
2004). Despite these examples, semantic annotation has so far played a rather marginal
role in the development of treebanks, but it can be expected to become much more
important in the future.
Regardless of whether the annotation concerns constituent structure, functional
structure or semantic structure, there is a growing interest in annotation schemes that
3. Treebank development
The methods and tools for treebank development have evolved considerably from the
very first treebank projects, where all annotation was done manually, to the present-day
situation, which is characterized by a more or less elaborate combination of manual
work and automatic processing, supported by emerging standards and customized software tools. In section 3.1., we will discuss basic methodological issues in treebank development, including the division of labor between manual work and automatic processing.
In section 3.2., we will then give a brief overview of available tools and standards in the
area. We will focus on the process of syntactic annotation, since this is what distinguishes
treebank development from corpus development in general.
3.1. Methodology
One of the most important considerations in the annotation of a treebank is to ensure
consistency, i. e. to ensure that the same (or similar) linguistic phenomena are annotated
13. Treebanks
in the same (or similar) ways throughout the corpus, since this is a critical requirement
in many applications of treebanks, be it frequency-based linguistic studies, parser evaluation or induction of grammars (cf. section 4 below). This in turn requires explicit and
well- documented annotation guidelines, which can be used in the training of human
annotators, but which can also serve as a source of information for future users of
the treebank. Besides documenting the general principles of annotation, including the
annotation scheme as described in section 2.2., the guidelines need to contain detailed
examples of linguistic phenomena and their correct annotation. Among linguistic phenomena that are problematic for any annotation scheme, we can mention coordination
structures, discontinuous constituents, and different kinds of multi-word expressions.
The need to have a rich inventory of examples means that the annotation guidelines for a
large treebank project will usually amount to several hundred pages (cf. Sampson 2003).
Another important methodological issue in treebank development is the division of
labor between automatic annotation performed by computational analyzers and human
annotation or post-editing. Human annotation was the only feasible solution in early
treebank projects, such as Teleman (1974) and Jrborg (1986) for Swedish, but has the
drawback of being labor-intensive and therefore expensive for large volumes of data. In
addition, there is the problem of ensuring consistency across annotators if several people
are involved. Fully automatic annotation has the advantage of being both inexpensive
and consistent but currently cannot be used without introducing a considerable proportion of errors, which typically increases with the complexity of the annotation scheme.
Hence, fully automatic annotation is the preferred choice only when the amount of data
to be annotated makes manual annotation or post-editing prohibitively expensive, as in
the 200 million word corpus of the Bank of English (Jrvinen 2003). In addition, fully
automatic analysis of a larger section of the treebank can be combined with manual postcorrection for smaller sections, as in the Danish Arboretum, which contains a botanical
garden of a few hundred thousand words completely corrected, a forest of one million words partially corrected, and a jungle of nine million words with automatic
analysis only (Bick 2003).
Given the complementary advantages and drawbacks of human and automated annotation, most treebank projects today use a combination of automatic analysis and
manual work in order to make the process as efficient as possible while maintaining the
highest possible accuracy. The traditional way of combining automated and manual
processing is to perform syntactic parsing (complete or partial) followed by human postediting to correct errors in the parser output. This methodology was used, for example,
in the development of the Penn Treebank (Taylor et al. 2003) and the Prague Dependency Treebank (Bhmova et al. 2003). One variation on this theme is to use human
disambiguation instead of human post-correction, i. e. to let the human annotator
choose the correct analysis from a set of possible analyses produced by a nondeterministic parser. This approach is used in the TreeBanker (Carter 1997) and in the development
of the LinGO Redwood Treebanks (Oepen et al. 2002).
Regardless of the exact implementation of this methodology, human post-editing or
parse selection runs the risk of biasing the annotation towards the output of the automatic analyzer, since human editors have a tendency of accepting the proposed analysis
even in doubtful cases. The desire to reduce this risk was one of the motivating factors
behind the methodology for interactive corpus annotation developed by Thorsten Brants
and colleagues in the German NEGRA project (Brants et al. 2003), which uses a cascade
10
13. Treebanks
their own tools suited to their own needs. To some extent this can be explained by the
fact that different projects use different annotation schemes, motivated by properties of
the particular language analyzed and the purpose of the annotation, and that not all
tools are compatible with all annotation schemes (or software platforms). However, it
probably also reflects the lack of maturity of the field and the absence of a widely accepted standard for the encoding of treebank annotation. While there have been several
initiatives to standardize corpus encoding in general (cf. article 22), these recommendations have either not extended to the level of syntactic annotation or have not gained
widespread acceptance in the field. Instead, there exist several de facto standards set by
the most influential treebank projects, in particular the Penn Treebank, but also the
Prague Dependency Treebank for dependency representations. Another popular encoding standard is TIGER-XML (Knig/Lezius 2003), originally developed within the German TIGER project, which can be used as a general interchange format although it
imposes certain restrictions on the form of the annotation.
As observed by Ide/Romary (2003), there is a widely recognized need for a general
framework that can accommodate different annotation schemes and facilitate the sharing of resources as well as the development of reusable tools. It is part of the objective
of the ISO/TC 37/SC 4 Committee on Language Resource Management to develop such
a framework, building on the model presented in Ide/Romary (2003), but at the time of
writing there is no definite proposal available.
4. Treebank usage
Empirical linguistic research provided most of the early motivation for developing treebanks, and linguistic research continues to be one of the most important usage areas for
parsed corpora. We discuss linguistic research in section 4.1. below. In recent years,
however, the use of treebanks in natural language processing, including research as well
as technical development, has increased dramatically and has become the primary driving force behind the development of new treebanks. This usage is the topic of section
4.2. (cf. also article 35).
The use of treebanks is not limited to linguistic research and natural language processing, although these have so far been the dominant areas. In particular, there is a
great potential for pedagogical uses of treebanks, both in language teaching and in the
teaching of linguistic theory. A good example is the Visual Interactive Syntax Learning
(VISL) project at the University of Southern Denmark, which has developed teaching
treebanks for 22 languages with a number of different teaching tools including interactive games such as Syntris, based on the well-known computer game Tetris (see
http://visl.edu.dk).
11
12
13. Treebanks
An important methodological issue in treebank evaluation is the way in which performance of a parser is measured relative to a manually annotated treebank sample (a
so-called gold standard). An obvious metric to use is the proportion of sentences where
the parser output completely matches the gold standard annotation (the exact match
criterion). However, it can be argued that this is a relatively crude evaluation metric,
since an error in the analysis of a single word or constituent will have the same impact
on the result as the failure to produce any analysis whatsoever. Consequently, the most
widely used evaluation metrics measure various kinds of partial correspondence between
the parser output and the gold standard parse.
The most well-known evaluation metrics are the PARSEVAL measures (Black et al.
1991), which are based on the number of matching constituents between the parser output and the gold standard, and which have been widely used in parser evaluation using
data from the Penn Treebank. As an alternative to the constituency-based PARSEVAL
measures, several researchers have proposed evaluation schemes based on dependency
relations and argued that these provide a better way of comparing parsers that use
different representations (Lin 1998; Carroll et al. 1998).
A very successful use of treebanks during the last decade has been the induction of
probabilistic grammars for parsing, with lexicalised probabilistic models like those of
Collins (1999) and Charniak (2000) representing the current state of the art. An even
more radical approach is Data-Oriented Parsing (Bod 1998), which eliminates the traditional notion of grammar completely and uses a probabilistic model defined directly on
the treebank. But there has also been great progress in broad-coverage parsing using socalled deep grammars, where treebanks are mainly used to induce statistical models for
parse selection (see, e. g., Riezler et al. 2002; Toutanova et al. 2002). In fact, one of the
most significant results in research on syntactic parsing during the last decade is arguably
the conclusion that treebanks are indispensable in order to achieve robust broad-coverage parsing, regardless of which basic parsing methodology is assumed.
Besides using treebanks to induce grammars or optimize syntactic parsers, it is possible to induce other linguistic resources that are relevant for natural language processing. One important example is the extraction of subcategorization frames (cf. Briscoe/
Carroll 1997). Cf. also articles 35 and 39.
5. Conclusion
Treebanks have already been established as a very valuable resource both in linguistic
research and in natural language processing. In the future, we can expect their usefulness
to increase even more, with improved methods for treebank development and usage,
with more advanced tools built on universal standards, and with new kinds of annotation being added. Treebanks with semantic-pragmatic annotation have only begun to
emerge and will play an important role in the development of natural language understanding. Parallel treebanks, which hardly exist at the moment, will provide an invaluable resource for research on translation as well as the development of better methods
for machine translation. Spoken language treebanks, although already in existence, will
be developed further to increase our understanding of the structure of spoken discourse
and lead to enhanced methods in speech technology.
13
14
5. Literature
Abeille, A. (ed.) (2003a), Treebanks: Building and Using Parsed Corpora. Dordrecht: Kluwer.
Abeille, A. (2003b), Introduction. In: Abeille 2003a, xiiixxvi.
Abeille, A./Clement, L./Toussenel, F. (2003), Building a Treebank for French. In: Abeille 2003a,
165187.
Abney, S. (1991), Parsing by Chunks. In: Berwick, R./Abney, S./Tenny, C. (eds.), Corpus-based
Methods in Language and Speech. Dordrecht: Kluwer, 257278.
Aduriz, I./Aranzabe, M. J./Arriola, J. M./Atutxa, A./Daz de Ilarraza, A./Garmendia, A./Oronoz,
M. (2003), Construction of a Basque Dependency Treebank. In: Nivre/Hinrichs 2003, 201204.
Afonso, S./Bick, E./Haber, R./Santos, D. (2002), Floresta Sinta(c)tica, a Treebank for Portuguese.
In: Proceedings of the Third International Conference on Language Resources and Evaluation. Las
Palmas, Spain, 16981703.
Bick, E. (2003), Arboretum, a Hybrid Treebank for Danish. In: Nivre/Hinrichs 2003, 920.
Black, E./Abney, S./Flickinger, D./Gdaniec, C./Grishman, R./Harrison, P./Hindle, D./Ingria, R./
Jelinek, F./Klavans, J./Liberman, M./Marcus, M./Roukos, S./Santorini, B./Strzalkowski, T.
(1991), A Procedure for Quantitatively Comparing the Syntactic Coverage of English Grammars. In: Proceedings of the DARPA Speech and Natural Language Workshop. Pacific Grove,
CA, 306311.
Bod, R. (1998), Beyond Grammar. Chicago: CSLI Publications.
Bhmova, A./Hajic, J./Hajicova, E./Hladka, B. (2003), The PDT: A 3-level Annotation Scenario.
In: Abeille 2003a, 103127.
Bosco, C./Lombardo, V. (2004), Dependency and Relational Structure in Treebank Annotation. In:
Proceedings of the Workshop Recent Advances in Dependency Grammar. Geneva, Switzerland,
916.
Brants, T./Plaehn, O. (2000), Interactive Corpus Annotation. In: Proceedings of the Second International Conference on Language Resources and Evaluation. Athens, Greece, 453459.
Brants, S./Dipper, S./Hansen, S./Lezius, W./Smith, G. (2002), The TIGER Treebank. In: Hinrichs/
Simov 2002, 2442.
Brants, T./Skut, W./Uszkoreit, H. (2003), Syntactic Annotation of a German Newspaper Corpus.
In: Abeille 2003a, 7387.
Briscoe, E./Carroll, J. (1997), Automatic Extraction of Subcategorization from Corpora. In: Proceedings of the Fifth Conference on Applied Natural Language Processing. Washington, DC,
356363.
Cahill, A./McCarthy, M./Van Genabith, J./Way, A. (2002), Evaluating F-structure Annotation for
the Penn-II Treebank. In: Hinrichs/Simov 2002, 4360.
Carlson, L./Marcu, D./Okurowski, M. E. (2002), RST Discourse Treebank. Philadelphia, PA: Linguistic Data Consortium.
Carroll, J./Briscoe, E./Sanfilippo, A. (1998), Parser Evaluation: A Survey and a New Proposal. In:
Proceedings of the First International Conference on Language Resources and Evaluation (LREC).
Granada, Spain, 447454.
Carter, D. (1997), The TreeBanker: A Tool for Supervised Training of Parsed Corpora. In: Proceedings of the ACL Workshop on Computational Environments for Grammar Development and Linguistic Engineering. Madrid, Spain, 915.
Charniak, E. (1996), Tree-bank Grammars. In: Proceedings of AAAI/IAAI, 10311036.
Charniak, E. (2000), A Maximum-Entropy-Inspired Parser. In: Proceedings of the First Meeting
of the North American Chapter of the Association for Computational Linguistics. Seattle, WA,
132139.
Chen, K./Luo, C./Chang, M./Chen, F./Chen, C./Huang, C./Gao, Z. (2003), Sinica Treebank. In:
Abeille 2003a, 231248.
Chomsky, N. (1965), Aspects of the Theory of Syntax. Cambridge, MA: MIT Press.
13. Treebanks
mejrek, M./Curn, J./Havelka, J./Hajic, J./Kubon, V. (2004), Prague Czech-English Dependency
C
Treebank: Syntactically Annotated Resources for Machine Translation. In: Proceedings of the IV
International Conference on Language Resources and Evaluation. Lisbon, Portugal, 15971600.
Collins, M. (1999), Head-driven Statistical Models for Natural Language Parsing. PhD Thesis,
University of Pennsylvania.
Collins, M./Hajic, J./Brill, E./Ramshaw, L./Tillmann, C. (1999), A Statistical Parser of Czech. In:
Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics, 505
512.
Cyrus, L./Feddes, H./Schumacher, F. (2003), FuSe A Multi-layered Parallel Treebank. In: Nivre/
Hinrichs 2003, 213216.
Dickinson, M./Meurers, W. D. (2003), Detecting Inconsistencies in Treebanks. In: Nivre/Hinrichs
2003, 4556.
Garside, R./Leech, G./Varadi, T. (compilers) (1992), Lancaster Parsed Corpus. A Machine-readable
Syntactically Analyzed Corpus of 144,000 Words. Available for Distribution through ICAME.
Bergen: The Norwegian Computing Centre for the Humanities.
Hajic, J. (1998), Building a Syntactically Annotated Corpus: The Prague Dependency Treebank.
In: Issues of Valency and Meaning. Prague: Karolinum, 106132.
Hajicova, E. (1998), Prague Dependency Treebank: From Analytic to Tectogrammatical Annotation. In: Proceedings of the First Workshop on Text, Speech, Dialogue. Brno, Czech Republic,
4550.
Han, C./Han, N./Ko, S. (2002), Development and Evaluation of a Korean Treebank and its Application to NLP. In: Proceedings of the Third International Conference on Language Resources and
Evaluation. Las Palmas, Spain, 16351642.
Hindle, D. (1994), A Parser for Text Corpora. In: Zampolli, A. (ed.), Computational Approaches to
the Lexicon. New York: Oxford University Press, 103151.
Hinrichs, E./Simov, K. (eds.) (2002), Proceedings of the First Workshop on Treebanks and Linguistic
Theories. Sozopol, Bulgaria.
Hinrichs, E. W./Bartels, J./Kawata, Y./Kordoni, V./Telljohann, H. (2000), The Tbingen Treebanks
for Spoken German, English and Japanese. In: Wahlster, W. (ed.), Verbmobil: Foundations of
Speech-to-Speech Translation. Berlin: Springer, 552576.
Hockenmaier, J./Steedman, M. (2002), Acquiring Compact Lexicalized Grammars from a Cleaner
Treebank. In: Proceedings of the Third International Conference on Language Resources and
Evaluation. Las Palmas, Spain, 19741981.
Ide, N./Romary, L. (2003), Encoding Syntactic Annotation. In: Abeille 2003a, 281296.
Jrborg, J. (1986), Manual fr syntaggning. Gteborg University: Department of Swedish.
Jrvinen, T. (2003), Bank of English and Beyond. In: Abeille 2003a, 4359.
Kingsbury, P./Palmer, M. (2003), PropBank: The Next Level of TreeBank. In: Nivre/Hinrichs 2003,
105116.
Knig, E./Lezius, W. (2003), The TIGER Language A Description Language for Syntax Graphs.
Formal Definition. Technical Report, IMS, University of Stuttgart.
Kroch, A./Taylor, A. (2000), Penn-Helsinki Parsed Corpus of Middle English. URL: *http://
www.ling.upenn.edu/mideng/.+
Kromann, M. T. (2003), The Danish Dependency Treebank and the DTAG Treebank Tool. In:
Nivre/Hinrichs 2003, 217220.
Kbler, S./Nivre, J./Hinrichs, E./Wunsch, H. (eds.) (2004), Proceedings of the Third Workshop on
Treebanks and Linguistic Theories. Tbingen, Germany.
Kucera, H./Francis, W. (1967), Computational Analysis of Present-day American English. Providence,
RI: Brown University Press.
Kunz, K./Hansen-Schirra, S. (2003), Coreference annotation of the TIGER treebank. In: Nivre/
Hinrichs 2003, 221224.
Kurohashi, S./Nagao, M. (2003), Building a Japanese Parsed Corpus. In: Abeille 2003a, 249260.
15
16
13. Treebanks
17
Stamou, S./Andrikopoulos, V./Christodoulakis, D. (2003), Towards Developing a Semantically Annotated Treebank Corpus for Greek. In: Nivre/Hinrichs 2003, 225228.
Taylor, A./Marcus, M./Santorini, B. (2003), The Penn Treebank: An Overview. In: Abeille 2003a,
522.
Teleman, U. (1974), Manual fr grammatisk beskrivning av talad och skriven svenska. Lund: Studentlitteratur.
Toutanova, K./Manning, C. D./Shieber, S. M./Flickinger, D./Oepen, S. (2002), Parse Disambiguation for a Rich HPSG Grammar. In: Hinrichs/Simov 2002, 253263.
Ule, T./Simov, K. (2004), Unexpected Productions May Well Be Errors. In: Proceedings of the
Fourth International Conference on Language Resources and Evaluation. Lisbon, Portugal,
17951798.
Vilnat, A./Paroubek, P./Monceaux, L./Robba, I./Gendner, V./Illouz, G./Jardino, M. (2003), EASY
or How Difficult Can It Be to Define a Reference Treebank for French. In: Nivre/Hinrichs 2003,
229232.
Volk, M./Samuelsson, Y. (2004), Bootstrapping Parallel Treebanks. In: Proceedings of the 5th International Workshop on Linguistically Interpreted Corpora. Geneva, Switzerland, 6370.
Wallis, S. (2003), Completing Parsed Corpora. In: Abeille 2003a, 6171.
Wouden, T. van der/Hoekstra, H./Moortgat, M./Renmans, B./Schuurman, I. (2002), Syntactic
Analysis in the Spoken Dutch Corpus. In: Proceedings of the Third International Conference on
Language Resources and Evaluation. Las Palmas, Spain, 768773.
Xue, N./Xia, F./Chiou, F.-D./Palmer, M. (2004), The Penn Chinese Treebank: Phrase Structure
Annotation of a Large Corpus. In: Natural Language Engineering 11, 207238.