SlideShare a Scribd company logo
Do we need annotated corpora
in the era of the data deluge?

Martin Wynne
martin.wynne@oucs.ox.ac.uk
researchsupport@oucs.ox.ac.uk
Oxford e-Research Centre &
IT Services (formerly OUCS) &

ACRH2
Lisbon
Thursday 29th November 2012

Faculty of Linguistics, Philology and Phonetics,
University of Oxford
1
Annotated Corpora for Research in the Humanities
Problems with annotation

It can:
• lead to circular reasoning
• be incorrect
• be inconsistent
• follow a particular theory
• have a specific level of granularity
• use a particular tag-set
• introduce subjective interpretations
3
The data deluge
The case for the corpus today
(against “the web as corpus”)

The spoken corpus: spoken, and other non-computer-mediated data
The historical corpus: pre-internet data (beyond books)
The specialised corpus: with integrity, provenance and controlled
sampling and representativeness
The annotated corpus: adding and sharing linguistic annotation
The web corpus: filtering and organising the data deluge (aka "the web
for corpus")

5
The case for the corpus today
(against “the web as corpus”)

But we do need to go beyond the finite text corpus:
●
speech
●
video
●
the language of the internet - new genres, new media, new modes
●
capturing the context, especially other data streams
●
engaging with the non-finite corpora (aka "the web as corpus")

6
Annotated Corpora for Research in the Humanities
Annotated Corpora for Research in the Humanities
Image by James Cridland from Flickr. Some rights reserved.
Annotation - Why?

• To perform identification, categorization and analysis
of features of the text
• It enables certain types of search and analysis,
especially beyond the word form (e.g. “search for all
inflected forms of cause as a verb”)
• It can be the foundation for further automatic analysis
of a corpus (e.g. POS tags can be used for parsing)
• Preserving the analysis, enabling replicability of
research, and reusability of the annotated corpus

10
Annotation: less than the text?

“Annotation of a text is a procedure which loses
information. There is no point in arguing that the
information is in the computer's memory somewhere
- annotation is the substitution of a general
category for a specific item, and with respect to
that area of the classification, the item has lost its
uniqueness.”
(John Sinclair, personal communication, 2001)

11
Annotation: how?

•
•
•
•

Annotations should be separable
Detailed and explicit documentation should be
provided
Annotation practices should be linguistically
consensual
Annotation should observe standards
(Leech 2005)

http://www.ota.ox.ac.uk/documents/creating/dlc/

12
Annotation standards?

Use of standards can help to ensure successful:
• interpretation,
• interchange,
• preservation,
• incorporation into other resources,
• processing by generic software.
And is a way of resolving tricky encoding decisions, and
of justifying and documenting your decisions.

13
Potential problems with annotation

1. Annotation is liable to be subjective and inconsistent
2. Annotation is sometimes intellectual and painstaking,
sometimes trivial and automatic
3. Annotation leads to digital silos
4. Annotation makes building a shared services
infrastructure difficult

14
Interoperability and sustainability
for digital textual scholarship

Well-known problems with digital resources in the humanities of:
• fragmentation of communities, resources, tools;
• lack of connectness and interoperability;
• sustainability of online services;
• lack of deployment of tools as reliable and available services
There is a potential solution in distributed, federated infrastructure
services.

15
Annotated Corpora for Research in the Humanities
Annotated Corpora for Research in the Humanities
The CLARIN Vision
A researcher in the Darmstadt, from his desktop computer, can:
 do a single sign-on, with local authentication, and then:
 search for, find and obtain authorization to use corpora in Oxford,
Prague and Berlin
 select the precise dataset to work on, and save that selection
 run semantic analysis tools from Budapest and statistical tools from
Tübingen over the dataset
 use computational power from the local, national or other
computing centre where necessary
 obtain advice and support for carrying out all technical and
methodological procedures
 save the workflow and results of the analysis, and share those
results with collaborators in Paris, Vienna and Zagreb
 discuss and iteratively adopt and re-run the analyses with
collaborators
Annotated Corpora for Research in the Humanities
Silos or fishtanks??

Let's talk about fishtanks rather than silos...
There are lots of fishtanks out there, some very elaborate, big, pretty...
But they're all in different places and
unconnected.
And if I want to keep a fish I have to
build a fishtank (or put it in yours)...
And who's going to carry on feeding
the fish?
Let's not all make our own fishtanks.

20
Wouldn't it be better to have an ecosystem where we can all set our
fishes free?

You can access all of the riches of the deep and it's a lot easier to get
into fish research

21
Annotated Corpora for Research in the Humanities
Annotated Corpora for Research in the Humanities
Annotated Corpora for Research in the Humanities
CLARIN
http://www.clarin.eu/

Infrastructure services for
research in the humanities and
social sciences using language
resources and tools.
Services to include:
Access and identity federation
Network of service centres
•
Concept and component
metadata registries
•
Federated resource discovery
•
Federated search across
resources
•
SOA for connecting tools
•
PID services
•
•

Bamboo
http://www.project-bamboo.org/

Project Bamboo is building
applications and shared
infrastructure for humanities
research, principally:
•
Research environments for
humanities scholars
•
Infrastructure allowing librarians
and technologists to support
humanities scholarship
•
Evolution of shared applications
for the curation and exploration of
widely distributed content
collections
•
Build a community for uptake,
expansion and sustainability

DARIAH
http://www.dariah.eu

Enhance and support digitallyenabled research across the
humanities and arts.
DARIAH is working with
communities of practice to:
Explore and apply ICT-based
methods and tools
•

Improve research opportunities
and outcomes through linking
distributed digital source materials
of many kinds
•

Exchange knowledge, expertise,
methodologies and practices
across domains and disciplines
•
Annotated Corpora for Research in the Humanities
Annotated Corpora for Research in the Humanities
Annotated Corpora for Research in the Humanities
Annotated Corpora for Research in the Humanities
Corpus Linguistics

30
Player One (a man)

Player Two (a woman)

[Enter two players]

What news, Borachio?
[Don John, Much Ado About Nothing, I, 3]

I came yonder from a great supper: I can give
you intelligence of an intended marriage.
[Borachio, Much Ado About Nothing, I, 3]

They say the lady is fair; 'tis a truth,
I can bear them witness; and virtuous;
'tis so, I cannot reprove it

A married man! that's most intolerable.

[Earl of Warwick, Henry VI Part I, V, 4]

Yet hasty marriage seldom proveth well.

[Benedick, Much Ado About Nothing, II, 3]
[Richard III, Henry VI Part III, IV, 1]

Is the single man therefore blessed?
No; as a wall'd town is more worthier than a
village, so is the forehead of a married man
more honourable than the bare brow of a
bachelor

Many a good hanging prevents a bad
marriage

[Touchstone, As You Like It, III, 3]

[Feste, Twelfth Night, I, 5]

By this marriage, All little jealousies, which
now seem great,
And all great fears, which now import their
dangers,
Would then be nothing

I may chance have some odd quirks and
remnants of wit broken on me, because I
have railed so long against marriage: but doth
not the appetite alter? a man loves the meat
in his youth that he cannot endure in his age.

[Agrippa, Antony and Cleopatra, II, 2]

[Benedick, Much Ado About Nothing, II, 3]

They are in the very wrath of love, and they
will together. Clubs cannot part them.

Speak low, if you speak love.

[Rosalind, As you Like It, V, 2]

[Don Pedro, Much Ado About Nothing, II, 1]
Data-intensive Humanities

32
Annotated Corpora for Research in the Humanities
Annotated Corpora for Research in the Humanities
Annotated Corpora for Research in the Humanities
Annotated Corpora for Research in the Humanities
Annotated Corpora for Research in the Humanities
Nature 474, 436-440 (2011) | doi:10.1038/474436a
Annotated Corpora for Research in the Humanities
Annotated Corpora for Research in the Humanities
"[There is] a monolithic conception of social space, according to which
it would suffice to have the right information to make the right decisions.
But in point of fact, information itself is far from homogenous and no
purely quantitative approach is satisfying. Having ever greater amounts
of information at our fingertips not only does not make us more
virtuous, as Rousseau already predicted, but it does not even make us
more knowledgeable."
[Tzvetan Todorov, In Defence of the Enlightenment, 2009]

41
The simple challenge then...

... to transform the Humanities by promoting shared digital services,
facilities, resources and tools, without destroying the justification and
arguments for the Humanities for the Humanities sake, and thus
accidentally contributing to the decline and eventual destruction of
civilization

42
Annotated Corpora for Research in the Humanities
The 'take-home messages'
●

●
●
●

●

in the era of the data deluge, web science and digital scholarship,
we need to rethink the case for the corpus today, and the case
for doing annotation
we need an ecosystem, not separate 'fishtanks'
annotation risks more fragmentation
we need to follow the physical sciences in deciding priorities &
adopting standards, reducing complexity and variety, to promote
shared facilities and infrastructures
but, at the same time, we need to avoid arguments for scientism
and instrumentalism, and to defend
the humanities

44
Annotated Corpora for Research in the Humanities

More Related Content

PDF
When will there be a digital revolution in the humanities?
Martin Wynne
 
PDF
P1 e1 cristina_rubio_english
patatonciarodriguez
 
DOCX
P1 e1 marioesperón_english
Mario Oubiña
 
DOCX
P1 e1 alejandro_espinosa_mariofernandez_english
alexmario123
 
PDF
P1 e1 alex_fibla_english
maspafary
 
PDF
P1 e1 julian_vega_english
Julianandreiv
 
DOC
P1 e1 santiago martínez
evilmaster
 
DOCX
INTERNET
ArianaAlvarez9
 
When will there be a digital revolution in the humanities?
Martin Wynne
 
P1 e1 cristina_rubio_english
patatonciarodriguez
 
P1 e1 marioesperón_english
Mario Oubiña
 
P1 e1 alejandro_espinosa_mariofernandez_english
alexmario123
 
P1 e1 alex_fibla_english
maspafary
 
P1 e1 julian_vega_english
Julianandreiv
 
P1 e1 santiago martínez
evilmaster
 
INTERNET
ArianaAlvarez9
 

What's hot (20)

PDF
p7 e1 niurkavargas
NiurkaVargas2
 
ODP
ARIN6912 Presentation Week 5: Digital Environments
kittysquish
 
PPT
Cautious Optimism: Cultivate your Garden
Chris Rusbridge
 
PDF
Fragments, Pivots and Jumps that Relate and Narrative
Ruth Tringham
 
PPTX
Dh presentation 2018
University of Cape Town
 
PPTX
Dh presentation 2019
University of Cape Town
 
PPT
The Distributed National Electronic Resource and the Electronic Libraries Pro...
Chris Rusbridge
 
PDF
Digital community design: exploring the role of mobile social software in the...
Giuseppe Lugano
 
PPTX
Internet services, by Carlos Cajaraville Lojo
José M. Rivas
 
PDF
2 virtual library article 21 34
prjpublications
 
PPTX
greenstone digital library software
sharon bacalzo
 
PPT
Hartley Presentation on Cataloging & Metadata Trends
rshartley
 
PPTX
GREENSTONE DIGITAL LIBRARY SOFTWARE
sharon bacalzo
 
PPTX
Creation of Digital Libraries using Open Source Software
Arun VR
 
PDF
Big Data in the Arts and Humanities
Andrew Prescott
 
DOCX
Digital library softaware greenstone & dsapce
S.N,D.T Women's University
 
PDF
DIGITAL LIBRARIES: WHITHER THOU GOEST?
IAEME Publication
 
PPTX
Common online terminologies
alyssamonicacruz
 
PDF
Preserve or preserve not
National Library of Australia
 
PDF
Digital humanities, digital libraries, information science what relation? 4
Anna Maria Tammaro
 
p7 e1 niurkavargas
NiurkaVargas2
 
ARIN6912 Presentation Week 5: Digital Environments
kittysquish
 
Cautious Optimism: Cultivate your Garden
Chris Rusbridge
 
Fragments, Pivots and Jumps that Relate and Narrative
Ruth Tringham
 
Dh presentation 2018
University of Cape Town
 
Dh presentation 2019
University of Cape Town
 
The Distributed National Electronic Resource and the Electronic Libraries Pro...
Chris Rusbridge
 
Digital community design: exploring the role of mobile social software in the...
Giuseppe Lugano
 
Internet services, by Carlos Cajaraville Lojo
José M. Rivas
 
2 virtual library article 21 34
prjpublications
 
greenstone digital library software
sharon bacalzo
 
Hartley Presentation on Cataloging & Metadata Trends
rshartley
 
GREENSTONE DIGITAL LIBRARY SOFTWARE
sharon bacalzo
 
Creation of Digital Libraries using Open Source Software
Arun VR
 
Big Data in the Arts and Humanities
Andrew Prescott
 
Digital library softaware greenstone & dsapce
S.N,D.T Women's University
 
DIGITAL LIBRARIES: WHITHER THOU GOEST?
IAEME Publication
 
Common online terminologies
alyssamonicacruz
 
Preserve or preserve not
National Library of Australia
 
Digital humanities, digital libraries, information science what relation? 4
Anna Maria Tammaro
 
Ad

Viewers also liked (20)

PDF
Test
Beta Escobar
 
PDF
Gift Tips
Johanna van Dijk
 
PPT
7.stress dan individu
edibayu
 
PPTX
Digipak advertisiment pster
Megan Mead
 
PDF
Desarrollo de habilidades directivas, angel ortiz 0280
Angel Rogelio Ortiz del Pino
 
PPTX
المهارات الحركية الكبرى
Osama Madbooly
 
PDF
CDP TPNA Municipales 2013
ETIC MANAGEMENT - LEARNING MAKERS
 
PDF
Tic
Rafapanzer
 
PPT
Christmas Ornaments
destopaper
 
PDF
Quik start of NITRO RC CAR
ProBotiZ Group, Nagpur
 
PDF
Zakhirae Qasas
shia qaum
 
PPTX
Plus de mal de voiture dans une voiture autonome
Wagenverkopen
 
PDF
Données Clés - Communauté d'Agglomération Fécamp Caux Littoral
AURH - Agence d'urbanisme Le Havre - Estuaire de la Seine
 
DOCX
Faiz Aldalbhi CV English dated 17 Nov 15
Faiz Aldalbhi
 
PPTX
Slide identificationsu2011
aawilkins
 
PPTX
P2
fidah123
 
PPTX
Moodboard
michaelluton7
 
DOCX
Online Assignment
akhilabethel
 
PPTX
El entrenamiento mental en los negocios online
Eva Maria Vea Dossantos
 
PDF
Ialc homepage
Natalya Staritskaya
 
Gift Tips
Johanna van Dijk
 
7.stress dan individu
edibayu
 
Digipak advertisiment pster
Megan Mead
 
Desarrollo de habilidades directivas, angel ortiz 0280
Angel Rogelio Ortiz del Pino
 
المهارات الحركية الكبرى
Osama Madbooly
 
CDP TPNA Municipales 2013
ETIC MANAGEMENT - LEARNING MAKERS
 
Christmas Ornaments
destopaper
 
Quik start of NITRO RC CAR
ProBotiZ Group, Nagpur
 
Zakhirae Qasas
shia qaum
 
Plus de mal de voiture dans une voiture autonome
Wagenverkopen
 
Données Clés - Communauté d'Agglomération Fécamp Caux Littoral
AURH - Agence d'urbanisme Le Havre - Estuaire de la Seine
 
Faiz Aldalbhi CV English dated 17 Nov 15
Faiz Aldalbhi
 
Slide identificationsu2011
aawilkins
 
Moodboard
michaelluton7
 
Online Assignment
akhilabethel
 
El entrenamiento mental en los negocios online
Eva Maria Vea Dossantos
 
Ialc homepage
Natalya Staritskaya
 
Ad

Similar to Annotated Corpora for Research in the Humanities (20)

PDF
Big data and Digital Transformations in the Humanities
Martin Wynne
 
PPT
eMargin Presentation given to Skills Funding Agency
RDUES
 
PPTX
Digital Humanities: An Introduction
Dilip Barad
 
PPTX
Laurel Stvan dh ant_conc 2/27/13
Jessica C. Murphy
 
PPTX
Annotation and Scholarship
John Bradley
 
PPTX
Open Research
NatGustafsonSundell
 
PDF
MacroMicroZoom.pdf
Martin Wynne
 
PPTX
Digital Humanities: A brief introduction to the field
aelang
 
PPT
Evaluating Digital Scholarship, Alison Byerly
NITLE
 
PPTX
Comparative Literature in the Age of Digital Humanities _ On Possible Future ...
InsiyafatemaAlvani
 
PPTX
Jb dariah-annotation-workshop
John Bradley
 
PPT
Data versus Text: 30 years of confrontation
Lou Burnard
 
PDF
Forty Years of the OTA
Martin Wynne
 
PPTX
Comparative Literature in the Age of Digital Humanities _ On Possible Future ...
Hina Parmar
 
PPT
Pliny: 4 perspectives
John Bradley
 
PPT
Du Literary and linguistic computing aux Digital Humanities : retour sur 40 a...
OpenEdition
 
PPTX
Digital transformations: new challenges for the arts and humanities - Andrew ...
Jisc
 
PPTX
Digital humanities
MansiGajjar13
 
PPTX
Towards greater transparency in digital literary analysis
John Lavagnino
 
PPTX
Doing the Digital: How Scholars Learned to Stop Worrying and Love the Computer
Andrew Prescott
 
Big data and Digital Transformations in the Humanities
Martin Wynne
 
eMargin Presentation given to Skills Funding Agency
RDUES
 
Digital Humanities: An Introduction
Dilip Barad
 
Laurel Stvan dh ant_conc 2/27/13
Jessica C. Murphy
 
Annotation and Scholarship
John Bradley
 
Open Research
NatGustafsonSundell
 
MacroMicroZoom.pdf
Martin Wynne
 
Digital Humanities: A brief introduction to the field
aelang
 
Evaluating Digital Scholarship, Alison Byerly
NITLE
 
Comparative Literature in the Age of Digital Humanities _ On Possible Future ...
InsiyafatemaAlvani
 
Jb dariah-annotation-workshop
John Bradley
 
Data versus Text: 30 years of confrontation
Lou Burnard
 
Forty Years of the OTA
Martin Wynne
 
Comparative Literature in the Age of Digital Humanities _ On Possible Future ...
Hina Parmar
 
Pliny: 4 perspectives
John Bradley
 
Du Literary and linguistic computing aux Digital Humanities : retour sur 40 a...
OpenEdition
 
Digital transformations: new challenges for the arts and humanities - Andrew ...
Jisc
 
Digital humanities
MansiGajjar13
 
Towards greater transparency in digital literary analysis
John Lavagnino
 
Doing the Digital: How Scholars Learned to Stop Worrying and Love the Computer
Andrew Prescott
 

More from Martin Wynne (7)

PDF
CLARIN Supporting Horizon Europe proposals
Martin Wynne
 
PDF
CLARIN - Corpora, corpus tools and collaboration
Martin Wynne
 
PDF
Forty-five Years of the OTA
Martin Wynne
 
PDF
Corpus Approaches to the Language of Literature 2008
Martin Wynne
 
PDF
Exploring rhetoric in the Electronic Enlightenment
Martin Wynne
 
PDF
Corpus Linguistics for Language Teaching and Learning
Martin Wynne
 
PDF
Hacking EEBO: colour terms
Martin Wynne
 
CLARIN Supporting Horizon Europe proposals
Martin Wynne
 
CLARIN - Corpora, corpus tools and collaboration
Martin Wynne
 
Forty-five Years of the OTA
Martin Wynne
 
Corpus Approaches to the Language of Literature 2008
Martin Wynne
 
Exploring rhetoric in the Electronic Enlightenment
Martin Wynne
 
Corpus Linguistics for Language Teaching and Learning
Martin Wynne
 
Hacking EEBO: colour terms
Martin Wynne
 

Recently uploaded (20)

PDF
Arihant Class 10 All in One Maths full pdf
sajal kumar
 
PDF
The Picture of Dorian Gray summary and depiction
opaliyahemel
 
PPTX
Five Point Someone – Chetan Bhagat | Book Summary & Analysis by Bhupesh Kushwaha
Bhupesh Kushwaha
 
PDF
Types of Literary Text: Poetry and Prose
kaelandreabibit
 
PPTX
NOI Hackathon - Summer Edition - GreenThumber.pptx
MartinaBurlando1
 
PDF
Virat Kohli- the Pride of Indian cricket
kushpar147
 
DOCX
UPPER GASTRO INTESTINAL DISORDER.docx
BANDITA PATRA
 
PPT
Python Programming Unit II Control Statements.ppt
CUO VEERANAN VEERANAN
 
PPTX
TEF & EA Bsc Nursing 5th sem.....BBBpptx
AneetaSharma15
 
PPTX
vedic maths in python:unleasing ancient wisdom with modern code
mistrymuskan14
 
PDF
Landforms and landscapes data surprise preview
jpinnuck
 
PPTX
An introduction to Dialogue writing.pptx
drsiddhantnagine
 
PDF
5.EXPLORING-FORCES-Detailed-Notes.pdf/8TH CLASS SCIENCE CURIOSITY
Sandeep Swamy
 
PDF
2.Reshaping-Indias-Political-Map.ppt/pdf/8th class social science Exploring S...
Sandeep Swamy
 
PPTX
Strengthening open access through collaboration: building connections with OP...
Jisc
 
PDF
Sunset Boulevard Student Revision Booklet
jpinnuck
 
PPTX
Skill Development Program For Physiotherapy Students by SRY.pptx
Prof.Dr.Y.SHANTHOSHRAJA MPT Orthopedic., MSc Microbiology
 
DOCX
Action Plan_ARAL PROGRAM_ STAND ALONE SHS.docx
Levenmartlacuna1
 
PPTX
Tips Management in Odoo 18 POS - Odoo Slides
Celine George
 
PPTX
HISTORY COLLECTION FOR PSYCHIATRIC PATIENTS.pptx
PoojaSen20
 
Arihant Class 10 All in One Maths full pdf
sajal kumar
 
The Picture of Dorian Gray summary and depiction
opaliyahemel
 
Five Point Someone – Chetan Bhagat | Book Summary & Analysis by Bhupesh Kushwaha
Bhupesh Kushwaha
 
Types of Literary Text: Poetry and Prose
kaelandreabibit
 
NOI Hackathon - Summer Edition - GreenThumber.pptx
MartinaBurlando1
 
Virat Kohli- the Pride of Indian cricket
kushpar147
 
UPPER GASTRO INTESTINAL DISORDER.docx
BANDITA PATRA
 
Python Programming Unit II Control Statements.ppt
CUO VEERANAN VEERANAN
 
TEF & EA Bsc Nursing 5th sem.....BBBpptx
AneetaSharma15
 
vedic maths in python:unleasing ancient wisdom with modern code
mistrymuskan14
 
Landforms and landscapes data surprise preview
jpinnuck
 
An introduction to Dialogue writing.pptx
drsiddhantnagine
 
5.EXPLORING-FORCES-Detailed-Notes.pdf/8TH CLASS SCIENCE CURIOSITY
Sandeep Swamy
 
2.Reshaping-Indias-Political-Map.ppt/pdf/8th class social science Exploring S...
Sandeep Swamy
 
Strengthening open access through collaboration: building connections with OP...
Jisc
 
Sunset Boulevard Student Revision Booklet
jpinnuck
 
Skill Development Program For Physiotherapy Students by SRY.pptx
Prof.Dr.Y.SHANTHOSHRAJA MPT Orthopedic., MSc Microbiology
 
Action Plan_ARAL PROGRAM_ STAND ALONE SHS.docx
Levenmartlacuna1
 
Tips Management in Odoo 18 POS - Odoo Slides
Celine George
 
HISTORY COLLECTION FOR PSYCHIATRIC PATIENTS.pptx
PoojaSen20
 

Annotated Corpora for Research in the Humanities

  • 1. Do we need annotated corpora in the era of the data deluge? Martin Wynne [email protected] [email protected] Oxford e-Research Centre & IT Services (formerly OUCS) & ACRH2 Lisbon Thursday 29th November 2012 Faculty of Linguistics, Philology and Phonetics, University of Oxford 1
  • 3. Problems with annotation It can: • lead to circular reasoning • be incorrect • be inconsistent • follow a particular theory • have a specific level of granularity • use a particular tag-set • introduce subjective interpretations 3
  • 5. The case for the corpus today (against “the web as corpus”) The spoken corpus: spoken, and other non-computer-mediated data The historical corpus: pre-internet data (beyond books) The specialised corpus: with integrity, provenance and controlled sampling and representativeness The annotated corpus: adding and sharing linguistic annotation The web corpus: filtering and organising the data deluge (aka "the web for corpus") 5
  • 6. The case for the corpus today (against “the web as corpus”) But we do need to go beyond the finite text corpus: ● speech ● video ● the language of the internet - new genres, new media, new modes ● capturing the context, especially other data streams ● engaging with the non-finite corpora (aka "the web as corpus") 6
  • 9. Image by James Cridland from Flickr. Some rights reserved.
  • 10. Annotation - Why? • To perform identification, categorization and analysis of features of the text • It enables certain types of search and analysis, especially beyond the word form (e.g. “search for all inflected forms of cause as a verb”) • It can be the foundation for further automatic analysis of a corpus (e.g. POS tags can be used for parsing) • Preserving the analysis, enabling replicability of research, and reusability of the annotated corpus 10
  • 11. Annotation: less than the text? “Annotation of a text is a procedure which loses information. There is no point in arguing that the information is in the computer's memory somewhere - annotation is the substitution of a general category for a specific item, and with respect to that area of the classification, the item has lost its uniqueness.” (John Sinclair, personal communication, 2001) 11
  • 12. Annotation: how? • • • • Annotations should be separable Detailed and explicit documentation should be provided Annotation practices should be linguistically consensual Annotation should observe standards (Leech 2005) http://www.ota.ox.ac.uk/documents/creating/dlc/ 12
  • 13. Annotation standards? Use of standards can help to ensure successful: • interpretation, • interchange, • preservation, • incorporation into other resources, • processing by generic software. And is a way of resolving tricky encoding decisions, and of justifying and documenting your decisions. 13
  • 14. Potential problems with annotation 1. Annotation is liable to be subjective and inconsistent 2. Annotation is sometimes intellectual and painstaking, sometimes trivial and automatic 3. Annotation leads to digital silos 4. Annotation makes building a shared services infrastructure difficult 14
  • 15. Interoperability and sustainability for digital textual scholarship Well-known problems with digital resources in the humanities of: • fragmentation of communities, resources, tools; • lack of connectness and interoperability; • sustainability of online services; • lack of deployment of tools as reliable and available services There is a potential solution in distributed, federated infrastructure services. 15
  • 18. The CLARIN Vision A researcher in the Darmstadt, from his desktop computer, can:  do a single sign-on, with local authentication, and then:  search for, find and obtain authorization to use corpora in Oxford, Prague and Berlin  select the precise dataset to work on, and save that selection  run semantic analysis tools from Budapest and statistical tools from Tübingen over the dataset  use computational power from the local, national or other computing centre where necessary  obtain advice and support for carrying out all technical and methodological procedures  save the workflow and results of the analysis, and share those results with collaborators in Paris, Vienna and Zagreb  discuss and iteratively adopt and re-run the analyses with collaborators
  • 20. Silos or fishtanks?? Let's talk about fishtanks rather than silos... There are lots of fishtanks out there, some very elaborate, big, pretty... But they're all in different places and unconnected. And if I want to keep a fish I have to build a fishtank (or put it in yours)... And who's going to carry on feeding the fish? Let's not all make our own fishtanks. 20
  • 21. Wouldn't it be better to have an ecosystem where we can all set our fishes free? You can access all of the riches of the deep and it's a lot easier to get into fish research 21
  • 25. CLARIN http://www.clarin.eu/ Infrastructure services for research in the humanities and social sciences using language resources and tools. Services to include: Access and identity federation Network of service centres • Concept and component metadata registries • Federated resource discovery • Federated search across resources • SOA for connecting tools • PID services • • Bamboo http://www.project-bamboo.org/ Project Bamboo is building applications and shared infrastructure for humanities research, principally: • Research environments for humanities scholars • Infrastructure allowing librarians and technologists to support humanities scholarship • Evolution of shared applications for the curation and exploration of widely distributed content collections • Build a community for uptake, expansion and sustainability DARIAH http://www.dariah.eu Enhance and support digitallyenabled research across the humanities and arts. DARIAH is working with communities of practice to: Explore and apply ICT-based methods and tools • Improve research opportunities and outcomes through linking distributed digital source materials of many kinds • Exchange knowledge, expertise, methodologies and practices across domains and disciplines •
  • 31. Player One (a man) Player Two (a woman) [Enter two players] What news, Borachio? [Don John, Much Ado About Nothing, I, 3] I came yonder from a great supper: I can give you intelligence of an intended marriage. [Borachio, Much Ado About Nothing, I, 3] They say the lady is fair; 'tis a truth, I can bear them witness; and virtuous; 'tis so, I cannot reprove it A married man! that's most intolerable. [Earl of Warwick, Henry VI Part I, V, 4] Yet hasty marriage seldom proveth well. [Benedick, Much Ado About Nothing, II, 3] [Richard III, Henry VI Part III, IV, 1] Is the single man therefore blessed? No; as a wall'd town is more worthier than a village, so is the forehead of a married man more honourable than the bare brow of a bachelor Many a good hanging prevents a bad marriage [Touchstone, As You Like It, III, 3] [Feste, Twelfth Night, I, 5] By this marriage, All little jealousies, which now seem great, And all great fears, which now import their dangers, Would then be nothing I may chance have some odd quirks and remnants of wit broken on me, because I have railed so long against marriage: but doth not the appetite alter? a man loves the meat in his youth that he cannot endure in his age. [Agrippa, Antony and Cleopatra, II, 2] [Benedick, Much Ado About Nothing, II, 3] They are in the very wrath of love, and they will together. Clubs cannot part them. Speak low, if you speak love. [Rosalind, As you Like It, V, 2] [Don Pedro, Much Ado About Nothing, II, 1]
  • 38. Nature 474, 436-440 (2011) | doi:10.1038/474436a
  • 41. "[There is] a monolithic conception of social space, according to which it would suffice to have the right information to make the right decisions. But in point of fact, information itself is far from homogenous and no purely quantitative approach is satisfying. Having ever greater amounts of information at our fingertips not only does not make us more virtuous, as Rousseau already predicted, but it does not even make us more knowledgeable." [Tzvetan Todorov, In Defence of the Enlightenment, 2009] 41
  • 42. The simple challenge then... ... to transform the Humanities by promoting shared digital services, facilities, resources and tools, without destroying the justification and arguments for the Humanities for the Humanities sake, and thus accidentally contributing to the decline and eventual destruction of civilization 42
  • 44. The 'take-home messages' ● ● ● ● ● in the era of the data deluge, web science and digital scholarship, we need to rethink the case for the corpus today, and the case for doing annotation we need an ecosystem, not separate 'fishtanks' annotation risks more fragmentation we need to follow the physical sciences in deciding priorities & adopting standards, reducing complexity and variety, to promote shared facilities and infrastructures but, at the same time, we need to avoid arguments for scientism and instrumentalism, and to defend the humanities 44