SlideShare a Scribd company logo
2
Most read
3
Most read
4
Most read
Treebank Annotation 
By – 
Mohit Jasapara – 2012EEB1059 
Aashish Kholiya – 2012MEB1083 
1
Treebank 
 The termtreebank was coined by linguist Geoffrey Leech in the 1980s because 
both syntactic and semantic structure are commonly represented compositionally 
as a tree structure. 
 In linguistics , a treebank is a parsed text corpus that annotates syntactic or 
semantic sentence structure. 
 In simple words, treebanks are collections of manually checked syntactic analyses 
of sentences. 
2
3 Treebank
Construction 
 Treebanks are often created on top of a corpus that has already been annotated 
with part-of-speech tags. 
 treebanks are sometimes enhanced with semantic or other linguistic information. 
 Treebanks can be created completely manually, where linguists annotate each 
sentence with syntactic structure, or semi-automatically, where a parser assigns 
some syntactic structure which linguists then check and, if necessary, correct 
4
Construction 
 In practice, fully checking and completing the parsing of natural language corpora 
is a labour-intensive project that can take teams of graduate linguists several years. 
 The level of annotation detail and the breadth of the linguistic sample determine 
the difficulty of the task and the length of time required to build a treebank. 
5
Construction 
 Some treebanks follow a specific linguistic theory in their syntactic annotation 
(e.g. the BulTreeBank follows HPSG) but most try to be less theory-specific. 
However, two main groups can be distinguished: 
treebanks that annotate phrase structure (for example the Penn Treebank or ICE-GB) 
and 
those that annotate dependency structure (for example the Prague Dependency 
Treebank or the Quranic Arabic Dependency Treebank). 
6
Construction 
 It is important to clarify the distinction between the formal representation and the 
file format used to store the annotated data. 
 Treebanks are necessarily constructed according to a particular grammar. The same 
grammar may be implemented by different file formats. 
7
Construction 
For example, the syntactic analysis for John loves Mary, shown in the figure on the 
right, may be represented by simple labelled brackets in a text file, like this (following 
the Penn Treebank notation): 
8
Construction 
 This type of representation is popular because it is light on resources, and the tree 
structure is relatively easy to read without software tools. However as corpora 
become increasingly complex, other file formats may be preferred. Alternatives 
include treebank-specific XML schemes, numbered indentation and various types 
of standoff notation. 
9
Applications 
Computational perspective 
 From a computational perspective, Treebank have been used to engineer state-of-the- 
art natural language processing systems such as part-of-speech 
taggers, parsers, semantic analyzers and machine translation systems. 
 Most computational systems utilize gold-standard Treebank data. 
 However, an automatically parsed corpus that is not corrected by human linguists 
can still be useful. 
10
Applications 
 It can provide evidence of rule frequency for a parser. 
 A parser may be improved by applying it to large amounts of text and gathering 
rule frequencies. 
 However, it should be obvious that only by a process of correcting and completing 
a corpus by hand is it possible then to identify rules absent from the parser 
knowledge base. In addition, frequencies are likely to be more accurate. 
11
Applications 
Corpus linguistics 
 In corpus linguistics, Treebank are used to study syntactic phenomena 
for example, diachronic corpora can be used to study the time course of syntactic 
change. 
 Once parsed, a corpus will contain frequency evidence showing how common 
different grammatical structures are in use. 
 Treebank also provide evidence of coverage and support the discovery of new, 
unanticipated, grammatical phenomena. 
. 
12
Applications 
 Interaction research is particularly fruitful as further layers of annotation, e.g. 
semantic, pragmatic, are added to a corpus. 
 It is then possible to evaluate the impact of non-syntactic phenomena on 
grammatical choices 
13
Applications 
Theoretical linguistics and Psycholinguistics 
 Another use of Treebank in theoretical linguistics and psycholinguistics is 
interaction evidence. 
 A completed Treebank can help linguists carry out experiments as to how the 
decision to use one grammatical construction tends to influence the decision to 
form others, and to try to understand how speakers and writers make decisions as 
they form sentences. 
14
Penn Treebank Project 
 The Penn Treebank Project annotates naturally-occurring text for linguistic 
structure. 
 Most notably, it produces skeletal parses showing rough syntactic and semantic 
information -- a bank of linguistic trees . 
 It also annotate text with part-of-speech tags, and for the Switchboard corpus of 
telephone conversations, dysfluency annotation. 
 It is located in the LINC Laboratory of the Computer and Information Science 
Department at the University of Pennsylvania. 
15
Penn Treebank Project 
 The Linguistic Data Consortium(LDC) provides tools and formats for creating and 
managing linguistic annotations. 
 `Linguistic annotation‘ covers any descriptive or analytic notations applied to raw 
language data. 
 The Penn Treebank is a human-annotated and partially `skeletally' parsed corpus 
consisting of over 4.5 million words of American English. 
 It includes the Brown Corpus (retagged) and the Wall Street Journal Corpus, as well 
as Department of Energy abstracts, Dow Jones Newswire stories, Department of 
Agriculture bulletins, Library of America texts, MUC-3 messages, IBM Manual 
sentences, WBUR radio transcripts, and ATIS sentences. 
16
17
18
References 
 http://en.wikipedia.org/wiki/Treebank 
 http://www.cis.upenn.edu/~treebank/ 
 https://catalog.ldc.upenn.edu/LDC97S62 
 http://mshang.ca/syntree/ 
 http://faculty.washington.edu/fxia/LAWVI/workshop_presentation_slides/special_se 
ssion/pml/ 
 http://www.seas.upenn.edu/~pdtb/tools.shtml 
19
20

More Related Content

What's hot (20)

PPTX
Text mining
Koshy Geoji
 
PPT
Fuzzy relations
naugariya
 
PPTX
Forward and Backward chaining in AI
Megha Sharma
 
PPTX
Text summarization
Akash Karwande
 
PDF
Lecture: Word Sense Disambiguation
Marina Santini
 
PPTX
Recognition-of-tokens
Dattatray Gandhmal
 
PPTX
Sentiment analysis
Makrand Patil
 
PPTX
Time, Schedules, and Resources in Artificial Intelligence.pptx
kitsenthilkumarcse
 
PPT
Association rule mining
Acad
 
PPT
First order logic
Rushdi Shams
 
PPTX
Planning in Artificial Intelligence
kitsenthilkumarcse
 
PPTX
ELEMENTS OF TRANSPORT PROTOCOL
Shashank Rustagi
 
PPTX
Semantic nets in artificial intelligence
harshita virwani
 
PPTX
Data Mining: Mining stream time series and sequence data
DataminingTools Inc
 
PPTX
Data Mining: Outlier analysis
DataminingTools Inc
 
PDF
Text summarization
prateek khandelwal
 
PPTX
Agents in Artificial intelligence
Lalit Birla
 
PPTX
Word embedding
ShivaniChoudhary74
 
PPT
Introduction to Natural Language Processing
Pranav Gupta
 
PPTX
Information retrieval introduction
nimmyjans4
 
Text mining
Koshy Geoji
 
Fuzzy relations
naugariya
 
Forward and Backward chaining in AI
Megha Sharma
 
Text summarization
Akash Karwande
 
Lecture: Word Sense Disambiguation
Marina Santini
 
Recognition-of-tokens
Dattatray Gandhmal
 
Sentiment analysis
Makrand Patil
 
Time, Schedules, and Resources in Artificial Intelligence.pptx
kitsenthilkumarcse
 
Association rule mining
Acad
 
First order logic
Rushdi Shams
 
Planning in Artificial Intelligence
kitsenthilkumarcse
 
ELEMENTS OF TRANSPORT PROTOCOL
Shashank Rustagi
 
Semantic nets in artificial intelligence
harshita virwani
 
Data Mining: Mining stream time series and sequence data
DataminingTools Inc
 
Data Mining: Outlier analysis
DataminingTools Inc
 
Text summarization
prateek khandelwal
 
Agents in Artificial intelligence
Lalit Birla
 
Word embedding
ShivaniChoudhary74
 
Introduction to Natural Language Processing
Pranav Gupta
 
Information retrieval introduction
nimmyjans4
 

Similar to Treebank annotation (20)

PPTX
Corpus study design
bikashtaly
 
DOCX
Corpus Linguistics
Dr.Ravindra Borse
 
PPTX
Corpus linguistics
Alicia Ruiz
 
PPTX
lexicography
ayfa
 
DOCX
Corpus Analysis in Corpus linguistics
Umm-e-Rooman Yaqoob
 
PDF
W17 5406
bonbon93
 
PDF
English kazakh parallel corpus for statistical machine translation
ijnlc
 
PDF
Document Author Classification Using Parsed Language Structure
kevig
 
PDF
Document Author Classification using Parsed Language Structure
kevig
 
PDF
Document Author Classification Using Parsed Language Structure
kevig
 
PPTX
Corpus linguistics
jesuspickers80
 
PPTX
LEXICOGRAPHY
mimisy
 
PDF
Building of Database for English-Azerbaijani Machine Translation Expert System
Waqas Tariq
 
PDF
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
IJITE
 
PDF
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
ijrap
 
PDF
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
gerogepatton
 
PDF
Corpus linguistics intro
Alex Curtis
 
PPTX
Corpus linguistics
Raul Vargas
 
PDF
A Comprehensive Study On Natural Language Processing And Natural Language Int...
Scott Bou
 
PDF
STRESS TEST FOR BERT AND DEEP MODELS: PREDICTING WORDS FROM ITALIAN POETRY
kevig
 
Corpus study design
bikashtaly
 
Corpus Linguistics
Dr.Ravindra Borse
 
Corpus linguistics
Alicia Ruiz
 
lexicography
ayfa
 
Corpus Analysis in Corpus linguistics
Umm-e-Rooman Yaqoob
 
W17 5406
bonbon93
 
English kazakh parallel corpus for statistical machine translation
ijnlc
 
Document Author Classification Using Parsed Language Structure
kevig
 
Document Author Classification using Parsed Language Structure
kevig
 
Document Author Classification Using Parsed Language Structure
kevig
 
Corpus linguistics
jesuspickers80
 
LEXICOGRAPHY
mimisy
 
Building of Database for English-Azerbaijani Machine Translation Expert System
Waqas Tariq
 
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
IJITE
 
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
ijrap
 
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
gerogepatton
 
Corpus linguistics intro
Alex Curtis
 
Corpus linguistics
Raul Vargas
 
A Comprehensive Study On Natural Language Processing And Natural Language Int...
Scott Bou
 
STRESS TEST FOR BERT AND DEEP MODELS: PREDICTING WORDS FROM ITALIAN POETRY
kevig
 
Ad

Recently uploaded (20)

PPTX
10CLA Term 3 Week 4 Study Techniques.pptx
mansk2
 
PPTX
Nutrition Quiz bee for elementary 2025 1.pptx
RichellMarianoPugal
 
PPTX
Cybersecurity: How to Protect your Digital World from Hackers
vaidikpanda4
 
PPTX
PROTIEN ENERGY MALNUTRITION: NURSING MANAGEMENT.pptx
PRADEEP ABOTHU
 
PPTX
Continental Accounting in Odoo 18 - Odoo Slides
Celine George
 
PPTX
Top 10 AI Tools, Like ChatGPT. You Must Learn In 2025
Digilearnings
 
PPTX
Introduction to pediatric nursing in 5th Sem..pptx
AneetaSharma15
 
PDF
Right to Information.pdf by Sapna Maurya XI D
Directorate of Education Delhi
 
PPTX
Digital Professionalism and Interpersonal Competence
rutvikgediya1
 
PPTX
MALABSORPTION SYNDROME: NURSING MANAGEMENT.pptx
PRADEEP ABOTHU
 
PPTX
national medicinal plants board mpharm.pptx
SHAHEEN SHABBIR
 
PPTX
Rules and Regulations of Madhya Pradesh Library Part-I
SantoshKumarKori2
 
PPTX
Electrophysiology_of_Heart. Electrophysiology studies in Cardiovascular syste...
Rajshri Ghogare
 
PPTX
YSPH VMOC Special Report - Measles Outbreak Southwest US 7-20-2025.pptx
Yale School of Public Health - The Virtual Medical Operations Center (VMOC)
 
PDF
A guide to responding to Section C essay tasks for the VCE English Language E...
jpinnuck
 
PPTX
The Future of Artificial Intelligence Opportunities and Risks Ahead
vaghelajayendra784
 
PDF
Exploring-the-Investigative-World-of-Science.pdf/8th class curiosity/1st chap...
Sandeep Swamy
 
PPTX
Constitutional Design Civics Class 9.pptx
bikesh692
 
PPTX
HERNIA: INGUINAL HERNIA, UMBLICAL HERNIA.pptx
PRADEEP ABOTHU
 
PPTX
GENERAL METHODS OF ISOLATION AND PURIFICATION OF MARINE__MPHARM.pptx
SHAHEEN SHABBIR
 
10CLA Term 3 Week 4 Study Techniques.pptx
mansk2
 
Nutrition Quiz bee for elementary 2025 1.pptx
RichellMarianoPugal
 
Cybersecurity: How to Protect your Digital World from Hackers
vaidikpanda4
 
PROTIEN ENERGY MALNUTRITION: NURSING MANAGEMENT.pptx
PRADEEP ABOTHU
 
Continental Accounting in Odoo 18 - Odoo Slides
Celine George
 
Top 10 AI Tools, Like ChatGPT. You Must Learn In 2025
Digilearnings
 
Introduction to pediatric nursing in 5th Sem..pptx
AneetaSharma15
 
Right to Information.pdf by Sapna Maurya XI D
Directorate of Education Delhi
 
Digital Professionalism and Interpersonal Competence
rutvikgediya1
 
MALABSORPTION SYNDROME: NURSING MANAGEMENT.pptx
PRADEEP ABOTHU
 
national medicinal plants board mpharm.pptx
SHAHEEN SHABBIR
 
Rules and Regulations of Madhya Pradesh Library Part-I
SantoshKumarKori2
 
Electrophysiology_of_Heart. Electrophysiology studies in Cardiovascular syste...
Rajshri Ghogare
 
YSPH VMOC Special Report - Measles Outbreak Southwest US 7-20-2025.pptx
Yale School of Public Health - The Virtual Medical Operations Center (VMOC)
 
A guide to responding to Section C essay tasks for the VCE English Language E...
jpinnuck
 
The Future of Artificial Intelligence Opportunities and Risks Ahead
vaghelajayendra784
 
Exploring-the-Investigative-World-of-Science.pdf/8th class curiosity/1st chap...
Sandeep Swamy
 
Constitutional Design Civics Class 9.pptx
bikesh692
 
HERNIA: INGUINAL HERNIA, UMBLICAL HERNIA.pptx
PRADEEP ABOTHU
 
GENERAL METHODS OF ISOLATION AND PURIFICATION OF MARINE__MPHARM.pptx
SHAHEEN SHABBIR
 
Ad

Treebank annotation

  • 1. Treebank Annotation By – Mohit Jasapara – 2012EEB1059 Aashish Kholiya – 2012MEB1083 1
  • 2. Treebank  The termtreebank was coined by linguist Geoffrey Leech in the 1980s because both syntactic and semantic structure are commonly represented compositionally as a tree structure.  In linguistics , a treebank is a parsed text corpus that annotates syntactic or semantic sentence structure.  In simple words, treebanks are collections of manually checked syntactic analyses of sentences. 2
  • 4. Construction  Treebanks are often created on top of a corpus that has already been annotated with part-of-speech tags.  treebanks are sometimes enhanced with semantic or other linguistic information.  Treebanks can be created completely manually, where linguists annotate each sentence with syntactic structure, or semi-automatically, where a parser assigns some syntactic structure which linguists then check and, if necessary, correct 4
  • 5. Construction  In practice, fully checking and completing the parsing of natural language corpora is a labour-intensive project that can take teams of graduate linguists several years.  The level of annotation detail and the breadth of the linguistic sample determine the difficulty of the task and the length of time required to build a treebank. 5
  • 6. Construction  Some treebanks follow a specific linguistic theory in their syntactic annotation (e.g. the BulTreeBank follows HPSG) but most try to be less theory-specific. However, two main groups can be distinguished: treebanks that annotate phrase structure (for example the Penn Treebank or ICE-GB) and those that annotate dependency structure (for example the Prague Dependency Treebank or the Quranic Arabic Dependency Treebank). 6
  • 7. Construction  It is important to clarify the distinction between the formal representation and the file format used to store the annotated data.  Treebanks are necessarily constructed according to a particular grammar. The same grammar may be implemented by different file formats. 7
  • 8. Construction For example, the syntactic analysis for John loves Mary, shown in the figure on the right, may be represented by simple labelled brackets in a text file, like this (following the Penn Treebank notation): 8
  • 9. Construction  This type of representation is popular because it is light on resources, and the tree structure is relatively easy to read without software tools. However as corpora become increasingly complex, other file formats may be preferred. Alternatives include treebank-specific XML schemes, numbered indentation and various types of standoff notation. 9
  • 10. Applications Computational perspective  From a computational perspective, Treebank have been used to engineer state-of-the- art natural language processing systems such as part-of-speech taggers, parsers, semantic analyzers and machine translation systems.  Most computational systems utilize gold-standard Treebank data.  However, an automatically parsed corpus that is not corrected by human linguists can still be useful. 10
  • 11. Applications  It can provide evidence of rule frequency for a parser.  A parser may be improved by applying it to large amounts of text and gathering rule frequencies.  However, it should be obvious that only by a process of correcting and completing a corpus by hand is it possible then to identify rules absent from the parser knowledge base. In addition, frequencies are likely to be more accurate. 11
  • 12. Applications Corpus linguistics  In corpus linguistics, Treebank are used to study syntactic phenomena for example, diachronic corpora can be used to study the time course of syntactic change.  Once parsed, a corpus will contain frequency evidence showing how common different grammatical structures are in use.  Treebank also provide evidence of coverage and support the discovery of new, unanticipated, grammatical phenomena. . 12
  • 13. Applications  Interaction research is particularly fruitful as further layers of annotation, e.g. semantic, pragmatic, are added to a corpus.  It is then possible to evaluate the impact of non-syntactic phenomena on grammatical choices 13
  • 14. Applications Theoretical linguistics and Psycholinguistics  Another use of Treebank in theoretical linguistics and psycholinguistics is interaction evidence.  A completed Treebank can help linguists carry out experiments as to how the decision to use one grammatical construction tends to influence the decision to form others, and to try to understand how speakers and writers make decisions as they form sentences. 14
  • 15. Penn Treebank Project  The Penn Treebank Project annotates naturally-occurring text for linguistic structure.  Most notably, it produces skeletal parses showing rough syntactic and semantic information -- a bank of linguistic trees .  It also annotate text with part-of-speech tags, and for the Switchboard corpus of telephone conversations, dysfluency annotation.  It is located in the LINC Laboratory of the Computer and Information Science Department at the University of Pennsylvania. 15
  • 16. Penn Treebank Project  The Linguistic Data Consortium(LDC) provides tools and formats for creating and managing linguistic annotations.  `Linguistic annotation‘ covers any descriptive or analytic notations applied to raw language data.  The Penn Treebank is a human-annotated and partially `skeletally' parsed corpus consisting of over 4.5 million words of American English.  It includes the Brown Corpus (retagged) and the Wall Street Journal Corpus, as well as Department of Energy abstracts, Dow Jones Newswire stories, Department of Agriculture bulletins, Library of America texts, MUC-3 messages, IBM Manual sentences, WBUR radio transcripts, and ATIS sentences. 16
  • 17. 17
  • 18. 18
  • 19. References  http://en.wikipedia.org/wiki/Treebank  http://www.cis.upenn.edu/~treebank/  https://catalog.ldc.upenn.edu/LDC97S62  http://mshang.ca/syntree/  http://faculty.washington.edu/fxia/LAWVI/workshop_presentation_slides/special_se ssion/pml/  http://www.seas.upenn.edu/~pdtb/tools.shtml 19
  • 20. 20