SlideShare a Scribd company logo
Microscope, macroscope and zoom lens:
close, distant and scalable reading in the
Humanities
Digital Humanities Summer School,
University of Oxford,
7th
July 2023
Martin Wynne
Senior Researcher in Corpus Linguistics
Faculty of Linguistics, Philology and Phonetics
https://orcid.org/0000-0002-4155-0530
martin.wynne@ling-phil.ox.ac.uk
Martin Wynne Text Analysis 2
Summary
●
What is text analysis?
●
Close reading, distant reading, textual
interpretation
●
Corpus linguistics: the vanguard of digital
humanities
Martin Wynne Text Analysis 3
Types of text analysis
●
the study of rhetoric ('how language is used to persuade')
●
close reading ('focus on the words')
●
stylistics ('study of the language of literature')
●
stylometry ('quantifying aspects of the language of texts, especially for authorship attribution and investigating genres')
●
corpus linguistics ('developing and analysing large electronic datasets representative of particular language varieties')
●
distant reading ('studying and doing things with more text than you can read')
●
macroanalysis (‘plotting features in large corpora over time’)
●
discourse analysis ('categorizing and analysing structural elements of discourse')
●
critical corpus discourse analysis ('using corpus linguistic methods to reveal hidden agendas and motivations in texts')
●
deconstruction ('what the words are not saying or failing to say')
●
forensic linguistics (‘gathering legal evidence about attribution and meaning’)
●
qualitative social science ('annotation and analysis of interviews, survey results, etc.')
●
...and more...
Martin Wynne Text Analysis 4
My meaning of text analysis
A diverse, open and fluid set of methods, datasets and tools, to be
used in support of a variety of research processes, with the aim of
interpreting texts
Martin Wynne Text Analysis 5
...is not just for linguists, but also not for literary scholars,
historians, political scientists, sociologists, journalists, activists,
forensic scientists…
...and it can be useful for all or any of them
My sort of text analysis
Martin Wynne Text Analysis 6
“Text analysis tools aid the interpreter asking questions of texts”
Geoffrey Rockwell
https://web.archive.org/web/20150410205354/http://tada.mcmaster.ca/Main/WhatTA
Martin Wynne Text Analysis 7
Methods and Techniques
Search
- search large texts quickly
- search a collection of texts or corpus
- search enhanced by linguistic annotation
- complex searches
Analyze
- patterns of words
- collocations
- expanded co-text around words
- wordlists, keywords
- clusters, ngrams
Compare
- compare texts
- compare sections of a text with each other
- compare a text with a reference corpus
Visualize
- concordances
- distribution of features in a text or corpus
Close Reading
Elaine Showalter describes close reading
as:
“...slow reading, a deliberate attempt to
detach ourselves from the magical power of
story-telling and pay attention to language,
imagery, allusion, intertextuality, syntax
and form.”
It is, in her words, ‘a form of
defamiliarisation we use in order to break
through our habitual and casual reading
practices’ (Teaching Literature, p.98).
Further introductory reading:
●
https://www.york.ac.uk/english/writing-at-york/writing-resources/close-reading/
●
https://writingcenter.fas.harvard.edu/pages/how-do-close-reading
●
http://theliterarylink.com/closereading.html
Close Reading
●
Traditional criticism (biographical, social, historical, psychological)
...the new paradigm...
●
Practical criticism, New Criticism (concentrate on the ‘words on the page’)
●
Hermeneutics (theory and practice of interpretation)
●
Interpretation (always provisional, never final)
●
Inductive reasoning (not deductive and mathematical, based on experience and probabilities)
...the next new paradigm...
●
New Historicism (literature must be understood in its historical context)
...
https://www.english.cam.ac.uk/classroom/pracrit.htm
https://www.oxfordbibliographies.com/view/document/obo-9780190221911/obo-978019022191
1-0015.xml
Close reading as the paradigm for
text-based humanities scholarship
But what do you do with a million books?
There are only about 30,000 days in a human life -- at a book a
day, it would take 30 lifetimes to read a million books and our
research libraries contain more than ten times that number. Only
machines can read through the 400,000 books already publicly
available for free download from the Open Content Alliance.
 Gregory Crane, “What do you do with a million books?”
D-Lib Magazine, March 2006
And 5 million books?
We constructed a corpus of digitized texts containing about 4% of all books ever
printed. Analysis of this corpus enables us to investigate cultural trends
quantitatively. We survey the vast terrain of “culturomics” focusing on linguistic
and cultural phenomena that were reflected in the English language between
1800 and 2000. We show how this approach can provide insights about fields as
diverse as lexicography, the evolution of grammar, collective memory, the
adoption of technology, the pursuit of fame, censorship, and historical
epidemiology. “Culturomics” extends the boundaries of rigorous quantitative
inquiry to a wide array of new phenomena spanning the social sciences and the
humanities.
www.sciencexpress.org / 16 December 2010
MacroMicroZoom.pdf
Distant reading: where distance, let me
repeat it, is a condition of knowledge: it
allows you to focus on units that are much
smaller or much larger than the text:
devices, themes, tropes—or genres and
systems. And if, between the very small
and the very large, the text itself
disappears, well, it is one of those cases
when one can justifiably say, less is more.
If we want to understand the system in its
entirety, we must accept losing something.
We always pay a price for theoretical
knowledge: reality is infinitely rich;
concepts are abstract, are poor. But it’s
precisely this ‘poverty’ that makes it
possible to handle them, and therefore to
know. This is why less is actually more.
Franco Moretti, “Conjectures on World
Literature” Distant Reading, 2013.
Distant Reading
A canon of 200 novels, for instance,
sounds very large for 19th-century
Britain (and is much larger than the
current one), but it still less than 1% of
the novels that were actually published
[…] and close reading won’t help here, a
novel a day every day of the year would
take a century or so … And it’s not even
a matter of time, but of method: a field
this large cannot be understood by
stitching together separate bits of
knowledge about individual cases,
because it isn’t a sum of individual
cases: it’s a collective system, that
should be grasped as such, as a whole.
Franco Moretti, Graphs, Maps, Trees:
Abstract Models for Literary History, 2005
MacroMicroZoom.pdf
MacroMicroZoom.pdf
What are we ultimately aiming for
when it comes to digital scholarship in the Humanities?
Ways to combine close reading with
big data approaches.
From “distant”
(not) reading to
close reading and
back again...
Digital Humanities
as a locus for
“scalable” reading
practices
DATA: digitally
assisted text
analysis
Martin Mueller,
Northwestern
Martin Wynne Text Analysis 20
What do you need to know in order to move to
interpretation?
1. You need to know what’s in your dataset.
2. You need to know how to find what you are looking for.
3. You need to know how to make sense of what you find.
Martin Wynne Text Analysis 21
Software tools
●
AntConc
●
Sketch Engine
●
CQPweb
●
#LancsBox
●
English-corpora.org
●
KonText
●
Voyant Tools
●
CliC
●
Hansard at Huddersfield
●
...and more
Martin Wynne Text Analysis 22
Finding resources
●
CLARIN Virtual Language Observatory
(https://vlo.clarin.eu/)
●
CLARIN Resource Families
(https://www.clarin.eu/resource-families/)
Martin Wynne Text Analysis 23
Corpus Query Tools:
a CLARIN Resource Family
https://www.clarin.eu/resource-families/corpus-query-tools
The 'aftermath' of the seminar
Subject: Les Francais des Corpus – Aftermath
Dear colleagues,
First, many thanks for presenting at /attending
the Francais des Corpus Workshop and for making
it such a success.
I promised I would keep you in touch with one
another and hope that the full list of your e-
mail addresses above makes that possible.
…
KWIC concordance from Written BNC2014 generated in #lancsbox X
(a representative corpus of British English released in 2021).
'aftermath'
Collocates:
War
Gulf
coup
World
disaster
Tiananmen
death
revolution
defeat
Chernobyl
affair
riots
battle
massacre
wars
election
Crisis
events
explosion
invasion
trial
fire
June
Square
victory
accident
attempt
Significant collocates in the British National Corpus
(a representative corpus of British English released in 1994).
BNCWeb parameters:
There are 1486 different types in your collocation database
for the query "[word="aftermath"%c] [word="of"%c]".
(Your query "aftermath of" returned 544 hits in 337 different texts)
The selected range was 1 to 4.
Corpus basis for calculation: the whole BNC.
Type of calculation: Log-likelihood
Tag restriction: any noun
Collocates occur at least 5 times in the whole BNC.
Words collocate at least 5 times.
J. R. Firth (1890-1960)
“The complete meaning of a word is
always contextual, and no study of
meaning apart from context can be taken
seriously.”
J. R. Firth (1935). "The Technique of Semantics." Transactions of the Philological Society,
36-72; p. 37 (Reprinted in Firth (1957).
“You shall know a word by the company
it keeps.”
J. R. Firth (1957). "Papers in Linguistics, 1934-1951". Oxford: Oxford University Press.
What is a corpus?
“…a collection of pieces of language, selected and
ordered according to explicit linguistic criteria in
order to be used as a sample of the language.”
(Sinclair 1996)
What is Corpus Linguistics?
(1) Focus on linguistic performance, rather than competence
(2) Focus on linguistic description, rather than linguistic universals
(3) Focus on quantitative, as well as qualitative models of language
(4) Focus on a more empiricist, rather than rationalist view of
scientific inquiry.
(Leech 1992)
Antconc: explore your own texts and corpora
●
Download for free from
https://www.laurenceanthony.net/software/antconc/
●
Use with any 'plain' text’
●
Multilingual
capabilities
●
Does not interpret
mark-up or metadata
#LancsBox
Download for free from https://lancsbox.lancs.ac.uk/
●
Works with your own data or existing corpora
●
Visualizes language data
●
Analyses data in any language
●
Automatically annotates data for part-of-speech (for
some languages)
●
Wizard tool produces a prose report
●
Works with major operating systems (Windows, Mac,
Linux)
●
Latest version #LancsBox X launched 2023
CQPweb:
Online interface for indexed corpora
http://cqpweb.lancs.ac.uk
...but now also with a new feature
to upload data, in limited ways...
SketchEngine: an online interface for
your corpus
https://www.sketchengine.eu/
Access to Sketch Engine is by paid subscription. Individual licences are available from €6.56
per month, with free trials available.
Martin Wynne Text Analysis 34
A new opportunity
"It is not easy to justify assertions about the alleged frequency of infrequency of
some particular belief or attitude in the past. How many examples does one need to
cite in order to prove the point? Lacking any satisfactory method of quantifying
these matters, all I can do is to record my impressions after long immersion in the
period."
Keith Thomas, The Ends of Life, Oxford University Press, 2009.
“But the sad truth is that much of what it has taken me a lifetime to build up by
painful accumulation can now be achieved by a moderately diligent student in the
course of a morning.”
Keith Thomas, Diary, London Review of Books, 10 June 2010.
Martin Wynne Text Analysis 35
Some (more or less) testable assertions
Tudor
 “The idea of a "Tudor era" in history is a misleading invention, claims an Oxford University
historian. Cliff Davies says his research shows the term "Tudor" was barely ever used
during the time of Tudor monarchs.” (http://www.bbc.co.uk/news/education-18240901
May 2012)
Holocaust
 “I will argue that “The Holocaust” is an ideological representation of the Nazi
holocaust...Until recently, however, the Nazi holocaust barely figured in American life.
Between the end of World War II and the late 60s, only a handful of books and films
touched on the subject”. (Norman Finkelstein, The Holocaust Industry. Verso, 2000.)
State
●
“...no political writer before the middle of the sixteenth century used the word 'state' in
anything like its modern political sense” [referring to the machinery of government and
social control] (Quentin Skinner, The Foundations of Modern Political Thought, Cambridge
University Press, 1978).
0
6
/
0
7
/
2
3
Annotation
Annotation of texts should include structural markup, metadata, and linguistic
annotation, including:
- Standardized metadata for basic categories such as language, relevant dates,
author, title and text type;
- Part-of-speech tagging;
- Lemmatization; and
- Modernized (or otherwise normalized) forms
...and these can be the basis for further levels of annotation, such as:
- semantic tags
- named entity recognition
- etc.
MacroMicroZoom.pdf
MacroMicroZoom.pdf
Martin Wynne Text Analysis 39
Digital scholarship in the Humanities
and Digital Science
Issues and assumptions in scientific research:
●
Consensus (and compromise) about funding priorities
●
Adoption of technical standards
●
Standards for the representation of knowledge and interpretations (agreement on concepts and categories!)
●
Reproducibility and replicability of research
●
Sharing of generic tools
●
Curation of tools and data in professional service centres
●
Support for software sustainability
●
Promotion of interoperability of resources and tools
●
Sharing research outputs
●
Research leading to an accumulation of knowledge
●
Increasingly data-driven research
CLARIN ERIC in members and centres
40
Official membership
• 23 members
• 3 observers
• 1 linked party
A distributed network of >60 centres
25 CTS certified data centres,
strong focus on FAIRness & interoperability
• federated login:
• central metadata harvesting for easy discovery:
• chained services:
• language data - in written, spoken, video or multimodal form
• advanced tools - to discover, explore, exploit, annotate, analyse
or combine data sets, wherever they are located
CLARIN corpus resources and tools
Corpora: at least 4130 - see VLO (https://vlo.clarin.eu/) !
Online interfaces:
● Corpuscle
● Korp
● KonText
● NoSketch Engine
● D* (Diacollo demo)
● TEITOK
Federated content search: https://contentsearch.clarin.eu/
Resource Families:
● 13 curated guides to different types of corpora and how to get them
● Coming soon: Desktop corpus tools and Online corpus tools
MacroMicroZoom.pdf
Online and desktop tools for corpus analysis
“Corpus, concordance, collocation”
Diachronic collocations in a text collection: DiaCollo from the Deutsches Textarchiv
Diachronic collocations in a text collection: DiaCollo from the Deutsches Textarchiv
MacroMicroZoom.pdf
MacroMicroZoom.pdf
Martin Wynne Text Analysis 48
Types of Text Analysis: Further Reading
●
Baker, P (2006), Using Corpora in Discourse Analysis, London: Continuum [summary and further information at https://www.lancaster.ac.uk/staff/bakerjp/usingcorpora.htm
]
●
Baker, P (2012), ‘Acceptable Bias? Using Corpus Linguistics Methods with Critical Discourse Analysis’, Critical Discourse Studies 9.3 (2012): 247-56. Web.
●
Bode, K (2017), The Equivalence of “Close” and “Distant” Reading; or, Toward a New Object for Data-Rich Literary History, Modern Language Quarterly (2017) 78 (1): 77–106.
DOI 10.1215/00267929-3699787
●
Cheng, W. (2013). ‘Corpus-based linguistic approaches to critical discourse analysis. In The encyclopedia of applied linguistics’ (pp. 1-8). Wiley-Blackwell.
https://doi.org/10.1002/9781405198431.wbeal0262 [full book chapter available from https://www.researchgate.net/publication/262070226]
●
Gadd. Ian. ‘The Use and Misuse of Early English Books Online’ in Literature Compass 6/3 (2009): 680–692 https://doi.org/10.1111/j.1741-4113.2009.00632.x
●
Hamed, D (2020), ‘Keywords and collocations in US presidential discourse since 1993: a corpus-assisted analysis’, in Journal of Humanities and Applied Social Sciences, Vol. 3 No.
2, 2021 pp. 137-158 Emerald Publishing Limited 2632-279X DOI 10.1108/JHASS-01-2020-0019
●
Kichuk, Diana. ‘Metamorphosis: Remediation in Early English Books Online (EEBO)’. Literary and Linguistic Computing 22.3 (2007): 291–303. [available from
https://hfroehlich.files.wordpress.com/2016/07/lit-linguist-computing-2007-kichuk-291-303.pdf
]
●
Leech, G. N., & Short, M. H. (1981). Style in Fiction. London: Longman.
●
Mahlberg, M. (2013), Corpus Stylistics and Dickens’s Fiction, Routledge.
●
Martin, Shawn. ‘EEBO, Microfilm, and Umberto Eco: Historical Lessons and Future Directions for Building Electronic Collections’. Microform & Imaging Review 36.4 (2007): 159–
64 [available from https://repository.upenn.edu/cgi/viewcontent.cgi?article=1072&context=library_papers
]
●
Showalter, E (2002), Teaching Literature, London: Wiley-Blackwell.
●
Sinclair, J (1991), Corpus, Concordance, Collocation, Oxford: OUP.
●
Rockwell, G (2005), ‘What is Text Analysis’ [https://web.archive.org/web/20150410205354/http://tada.mcmaster.ca/Main/WhatTA]
●
Underwood, Ted (2015), Seven ways humanists are using computers to understand text. (blog post at
https://tedunderwood.com/2015/06/04/seven-ways-humanists-are-using-computers-to-understand-text/
)
●
John Unsworth, “How Not To Read A Million Books,” with Tanya Clement, Sara Steger, and Kirsten Uszkalo, Harvard University, Cambridge, MA (October 2008) [blog post at
https://people.brandeis.edu/~unsworth/hownot2read.rutgers.html
]
●
Text Analysis in ‘Tooling up for Digital Humanties’ blog at http://toolingup.stanford.edu/?page_id=981
●
More information about the Text Creation Partnership https://quod.lib.umich.edu/e/eebogroup/

More Related Content

PDF
Forty Years of the OTA
Martin Wynne
 
PDF
Corpus Approaches to the Language of Literature 2008
Martin Wynne
 
PDF
Annotated Corpora for Research in the Humanities
Martin Wynne
 
PPT
eMargin Presentation given to Skills Funding Agency
RDUES
 
PPTX
Towards greater transparency in digital literary analysis
John Lavagnino
 
PPT
Discourse Analysis for Social Research
Dominik Lukes
 
PPT
What can a corpus tell us about discourse
Pascual Pérez-Paredes
 
PPT
eMargin at #tagginganna workshop, Leicester
RDUES
 
Forty Years of the OTA
Martin Wynne
 
Corpus Approaches to the Language of Literature 2008
Martin Wynne
 
Annotated Corpora for Research in the Humanities
Martin Wynne
 
eMargin Presentation given to Skills Funding Agency
RDUES
 
Towards greater transparency in digital literary analysis
John Lavagnino
 
Discourse Analysis for Social Research
Dominik Lukes
 
What can a corpus tell us about discourse
Pascual Pérez-Paredes
 
eMargin at #tagginganna workshop, Leicester
RDUES
 

Similar to MacroMicroZoom.pdf (20)

PPTX
Extreme Reading
Giorgio Guzzetta
 
PPTX
DeconstructionTheory.pptx
jeannmontejo1
 
PPTX
The New Past, and a Speculative Future, of Literature: A Brief Discussion of ...
NatGustafsonSundell
 
PDF
Word Toys Poetry And Technics 1st Edition Brian Kim Stefans
amorssellie
 
PPTX
MDST 3703 F10 Seminar 3
Rafael Alvarado
 
PPTX
Mdst3703 culturomics-2012-11-01
Rafael Alvarado
 
PPT
Decline (and disappearance) - the negative side of recent change in standard ...
VARIENG, University of Helsinki
 
PPTX
Comparative Literature in the Age of Digital Humanities _ On Possible Future ...
InsiyafatemaAlvani
 
PDF
Using Corpora In Discourse Analysis Paul Baker
aabiouses
 
PPT
Pliny: 4 perspectives
John Bradley
 
PPT
Large-Scale Computational Research in Arts & Humanities
Jisc
 
PDF
Reading for Meaning Strategies with Subtext and Actively Learn
edutechandy
 
PPTX
Comparative Literature in the Age of Digital Humanities _ On Possible Future ...
Hina Parmar
 
PPTX
close reading close reading close reading
brandonh22978
 
PPTX
Zoss High-Level Text Analysis and Techniques
DukeDigitalScholarship
 
PPT
eMargin Presentation to BA English Students
RDUES
 
PPTX
Reading-and-Writing-Skills-Lesson-1.pptx
JoshuaQuimpoReyes
 
PPTX
Laurel Stvan dh ant_conc 2/27/13
Jessica C. Murphy
 
PDF
Discourses as Measurable Networks
Simon Lindgren
 
PPTX
From work to text by Roland barthes in English literature
bhaliyaarjanbhai867
 
Extreme Reading
Giorgio Guzzetta
 
DeconstructionTheory.pptx
jeannmontejo1
 
The New Past, and a Speculative Future, of Literature: A Brief Discussion of ...
NatGustafsonSundell
 
Word Toys Poetry And Technics 1st Edition Brian Kim Stefans
amorssellie
 
MDST 3703 F10 Seminar 3
Rafael Alvarado
 
Mdst3703 culturomics-2012-11-01
Rafael Alvarado
 
Decline (and disappearance) - the negative side of recent change in standard ...
VARIENG, University of Helsinki
 
Comparative Literature in the Age of Digital Humanities _ On Possible Future ...
InsiyafatemaAlvani
 
Using Corpora In Discourse Analysis Paul Baker
aabiouses
 
Pliny: 4 perspectives
John Bradley
 
Large-Scale Computational Research in Arts & Humanities
Jisc
 
Reading for Meaning Strategies with Subtext and Actively Learn
edutechandy
 
Comparative Literature in the Age of Digital Humanities _ On Possible Future ...
Hina Parmar
 
close reading close reading close reading
brandonh22978
 
Zoss High-Level Text Analysis and Techniques
DukeDigitalScholarship
 
eMargin Presentation to BA English Students
RDUES
 
Reading-and-Writing-Skills-Lesson-1.pptx
JoshuaQuimpoReyes
 
Laurel Stvan dh ant_conc 2/27/13
Jessica C. Murphy
 
Discourses as Measurable Networks
Simon Lindgren
 
From work to text by Roland barthes in English literature
bhaliyaarjanbhai867
 
Ad

More from Martin Wynne (8)

PDF
CLARIN Supporting Horizon Europe proposals
Martin Wynne
 
PDF
CLARIN - Corpora, corpus tools and collaboration
Martin Wynne
 
PDF
Forty-five Years of the OTA
Martin Wynne
 
PDF
Exploring rhetoric in the Electronic Enlightenment
Martin Wynne
 
PDF
Corpus Linguistics for Language Teaching and Learning
Martin Wynne
 
PDF
Big data and Digital Transformations in the Humanities
Martin Wynne
 
PDF
Hacking EEBO: colour terms
Martin Wynne
 
PDF
When will there be a digital revolution in the humanities?
Martin Wynne
 
CLARIN Supporting Horizon Europe proposals
Martin Wynne
 
CLARIN - Corpora, corpus tools and collaboration
Martin Wynne
 
Forty-five Years of the OTA
Martin Wynne
 
Exploring rhetoric in the Electronic Enlightenment
Martin Wynne
 
Corpus Linguistics for Language Teaching and Learning
Martin Wynne
 
Big data and Digital Transformations in the Humanities
Martin Wynne
 
Hacking EEBO: colour terms
Martin Wynne
 
When will there be a digital revolution in the humanities?
Martin Wynne
 
Ad

Recently uploaded (20)

DOCX
SAROCES Action-Plan FOR ARAL PROGRAM IN DEPED
Levenmartlacuna1
 
PPTX
IMMUNIZATION PROGRAMME pptx
AneetaSharma15
 
PDF
Phylum Arthropoda: Characteristics and Classification, Entomology Lecture
Miraj Khan
 
PDF
5.EXPLORING-FORCES-Detailed-Notes.pdf/8TH CLASS SCIENCE CURIOSITY
Sandeep Swamy
 
PPTX
Care of patients with elImination deviation.pptx
AneetaSharma15
 
PDF
3.The-Rise-of-the-Marathas.pdfppt/pdf/8th class social science Exploring Soci...
Sandeep Swamy
 
PPTX
NOI Hackathon - Summer Edition - GreenThumber.pptx
MartinaBurlando1
 
PPTX
PREVENTIVE PEDIATRIC. pptx
AneetaSharma15
 
PPTX
family health care settings home visit - unit 6 - chn 1 - gnm 1st year.pptx
Priyanshu Anand
 
PPTX
How to Manage Global Discount in Odoo 18 POS
Celine George
 
PPTX
Understanding operators in c language.pptx
auteharshil95
 
PDF
7.Particulate-Nature-of-Matter.ppt/8th class science curiosity/by k sandeep s...
Sandeep Swamy
 
PPTX
Tips Management in Odoo 18 POS - Odoo Slides
Celine George
 
PDF
What is CFA?? Complete Guide to the Chartered Financial Analyst Program
sp4989653
 
PPTX
An introduction to Prepositions for beginners.pptx
drsiddhantnagine
 
PDF
UTS Health Student Promotional Representative_Position Description.pdf
Faculty of Health, University of Technology Sydney
 
PPTX
vedic maths in python:unleasing ancient wisdom with modern code
mistrymuskan14
 
PPT
Python Programming Unit II Control Statements.ppt
CUO VEERANAN VEERANAN
 
PPTX
How to Manage Leads in Odoo 18 CRM - Odoo Slides
Celine George
 
PPTX
Congenital Hypothyroidism pptx
AneetaSharma15
 
SAROCES Action-Plan FOR ARAL PROGRAM IN DEPED
Levenmartlacuna1
 
IMMUNIZATION PROGRAMME pptx
AneetaSharma15
 
Phylum Arthropoda: Characteristics and Classification, Entomology Lecture
Miraj Khan
 
5.EXPLORING-FORCES-Detailed-Notes.pdf/8TH CLASS SCIENCE CURIOSITY
Sandeep Swamy
 
Care of patients with elImination deviation.pptx
AneetaSharma15
 
3.The-Rise-of-the-Marathas.pdfppt/pdf/8th class social science Exploring Soci...
Sandeep Swamy
 
NOI Hackathon - Summer Edition - GreenThumber.pptx
MartinaBurlando1
 
PREVENTIVE PEDIATRIC. pptx
AneetaSharma15
 
family health care settings home visit - unit 6 - chn 1 - gnm 1st year.pptx
Priyanshu Anand
 
How to Manage Global Discount in Odoo 18 POS
Celine George
 
Understanding operators in c language.pptx
auteharshil95
 
7.Particulate-Nature-of-Matter.ppt/8th class science curiosity/by k sandeep s...
Sandeep Swamy
 
Tips Management in Odoo 18 POS - Odoo Slides
Celine George
 
What is CFA?? Complete Guide to the Chartered Financial Analyst Program
sp4989653
 
An introduction to Prepositions for beginners.pptx
drsiddhantnagine
 
UTS Health Student Promotional Representative_Position Description.pdf
Faculty of Health, University of Technology Sydney
 
vedic maths in python:unleasing ancient wisdom with modern code
mistrymuskan14
 
Python Programming Unit II Control Statements.ppt
CUO VEERANAN VEERANAN
 
How to Manage Leads in Odoo 18 CRM - Odoo Slides
Celine George
 
Congenital Hypothyroidism pptx
AneetaSharma15
 

MacroMicroZoom.pdf

  • 1. Microscope, macroscope and zoom lens: close, distant and scalable reading in the Humanities Digital Humanities Summer School, University of Oxford, 7th July 2023 Martin Wynne Senior Researcher in Corpus Linguistics Faculty of Linguistics, Philology and Phonetics https://orcid.org/0000-0002-4155-0530 [email protected]
  • 2. Martin Wynne Text Analysis 2 Summary ● What is text analysis? ● Close reading, distant reading, textual interpretation ● Corpus linguistics: the vanguard of digital humanities
  • 3. Martin Wynne Text Analysis 3 Types of text analysis ● the study of rhetoric ('how language is used to persuade') ● close reading ('focus on the words') ● stylistics ('study of the language of literature') ● stylometry ('quantifying aspects of the language of texts, especially for authorship attribution and investigating genres') ● corpus linguistics ('developing and analysing large electronic datasets representative of particular language varieties') ● distant reading ('studying and doing things with more text than you can read') ● macroanalysis (‘plotting features in large corpora over time’) ● discourse analysis ('categorizing and analysing structural elements of discourse') ● critical corpus discourse analysis ('using corpus linguistic methods to reveal hidden agendas and motivations in texts') ● deconstruction ('what the words are not saying or failing to say') ● forensic linguistics (‘gathering legal evidence about attribution and meaning’) ● qualitative social science ('annotation and analysis of interviews, survey results, etc.') ● ...and more...
  • 4. Martin Wynne Text Analysis 4 My meaning of text analysis A diverse, open and fluid set of methods, datasets and tools, to be used in support of a variety of research processes, with the aim of interpreting texts
  • 5. Martin Wynne Text Analysis 5 ...is not just for linguists, but also not for literary scholars, historians, political scientists, sociologists, journalists, activists, forensic scientists… ...and it can be useful for all or any of them My sort of text analysis
  • 6. Martin Wynne Text Analysis 6 “Text analysis tools aid the interpreter asking questions of texts” Geoffrey Rockwell https://web.archive.org/web/20150410205354/http://tada.mcmaster.ca/Main/WhatTA
  • 7. Martin Wynne Text Analysis 7 Methods and Techniques Search - search large texts quickly - search a collection of texts or corpus - search enhanced by linguistic annotation - complex searches Analyze - patterns of words - collocations - expanded co-text around words - wordlists, keywords - clusters, ngrams Compare - compare texts - compare sections of a text with each other - compare a text with a reference corpus Visualize - concordances - distribution of features in a text or corpus
  • 8. Close Reading Elaine Showalter describes close reading as: “...slow reading, a deliberate attempt to detach ourselves from the magical power of story-telling and pay attention to language, imagery, allusion, intertextuality, syntax and form.” It is, in her words, ‘a form of defamiliarisation we use in order to break through our habitual and casual reading practices’ (Teaching Literature, p.98). Further introductory reading: ● https://www.york.ac.uk/english/writing-at-york/writing-resources/close-reading/ ● https://writingcenter.fas.harvard.edu/pages/how-do-close-reading ● http://theliterarylink.com/closereading.html
  • 9. Close Reading ● Traditional criticism (biographical, social, historical, psychological) ...the new paradigm... ● Practical criticism, New Criticism (concentrate on the ‘words on the page’) ● Hermeneutics (theory and practice of interpretation) ● Interpretation (always provisional, never final) ● Inductive reasoning (not deductive and mathematical, based on experience and probabilities) ...the next new paradigm... ● New Historicism (literature must be understood in its historical context) ... https://www.english.cam.ac.uk/classroom/pracrit.htm https://www.oxfordbibliographies.com/view/document/obo-9780190221911/obo-978019022191 1-0015.xml
  • 10. Close reading as the paradigm for text-based humanities scholarship
  • 11. But what do you do with a million books? There are only about 30,000 days in a human life -- at a book a day, it would take 30 lifetimes to read a million books and our research libraries contain more than ten times that number. Only machines can read through the 400,000 books already publicly available for free download from the Open Content Alliance.  Gregory Crane, “What do you do with a million books?” D-Lib Magazine, March 2006
  • 12. And 5 million books? We constructed a corpus of digitized texts containing about 4% of all books ever printed. Analysis of this corpus enables us to investigate cultural trends quantitatively. We survey the vast terrain of “culturomics” focusing on linguistic and cultural phenomena that were reflected in the English language between 1800 and 2000. We show how this approach can provide insights about fields as diverse as lexicography, the evolution of grammar, collective memory, the adoption of technology, the pursuit of fame, censorship, and historical epidemiology. “Culturomics” extends the boundaries of rigorous quantitative inquiry to a wide array of new phenomena spanning the social sciences and the humanities. www.sciencexpress.org / 16 December 2010
  • 14. Distant reading: where distance, let me repeat it, is a condition of knowledge: it allows you to focus on units that are much smaller or much larger than the text: devices, themes, tropes—or genres and systems. And if, between the very small and the very large, the text itself disappears, well, it is one of those cases when one can justifiably say, less is more. If we want to understand the system in its entirety, we must accept losing something. We always pay a price for theoretical knowledge: reality is infinitely rich; concepts are abstract, are poor. But it’s precisely this ‘poverty’ that makes it possible to handle them, and therefore to know. This is why less is actually more. Franco Moretti, “Conjectures on World Literature” Distant Reading, 2013.
  • 15. Distant Reading A canon of 200 novels, for instance, sounds very large for 19th-century Britain (and is much larger than the current one), but it still less than 1% of the novels that were actually published […] and close reading won’t help here, a novel a day every day of the year would take a century or so … And it’s not even a matter of time, but of method: a field this large cannot be understood by stitching together separate bits of knowledge about individual cases, because it isn’t a sum of individual cases: it’s a collective system, that should be grasped as such, as a whole. Franco Moretti, Graphs, Maps, Trees: Abstract Models for Literary History, 2005
  • 18. What are we ultimately aiming for when it comes to digital scholarship in the Humanities? Ways to combine close reading with big data approaches.
  • 19. From “distant” (not) reading to close reading and back again... Digital Humanities as a locus for “scalable” reading practices DATA: digitally assisted text analysis Martin Mueller, Northwestern
  • 20. Martin Wynne Text Analysis 20 What do you need to know in order to move to interpretation? 1. You need to know what’s in your dataset. 2. You need to know how to find what you are looking for. 3. You need to know how to make sense of what you find.
  • 21. Martin Wynne Text Analysis 21 Software tools ● AntConc ● Sketch Engine ● CQPweb ● #LancsBox ● English-corpora.org ● KonText ● Voyant Tools ● CliC ● Hansard at Huddersfield ● ...and more
  • 22. Martin Wynne Text Analysis 22 Finding resources ● CLARIN Virtual Language Observatory (https://vlo.clarin.eu/) ● CLARIN Resource Families (https://www.clarin.eu/resource-families/)
  • 23. Martin Wynne Text Analysis 23 Corpus Query Tools: a CLARIN Resource Family https://www.clarin.eu/resource-families/corpus-query-tools
  • 24. The 'aftermath' of the seminar Subject: Les Francais des Corpus – Aftermath Dear colleagues, First, many thanks for presenting at /attending the Francais des Corpus Workshop and for making it such a success. I promised I would keep you in touch with one another and hope that the full list of your e- mail addresses above makes that possible. …
  • 25. KWIC concordance from Written BNC2014 generated in #lancsbox X (a representative corpus of British English released in 2021).
  • 26. 'aftermath' Collocates: War Gulf coup World disaster Tiananmen death revolution defeat Chernobyl affair riots battle massacre wars election Crisis events explosion invasion trial fire June Square victory accident attempt Significant collocates in the British National Corpus (a representative corpus of British English released in 1994). BNCWeb parameters: There are 1486 different types in your collocation database for the query "[word="aftermath"%c] [word="of"%c]". (Your query "aftermath of" returned 544 hits in 337 different texts) The selected range was 1 to 4. Corpus basis for calculation: the whole BNC. Type of calculation: Log-likelihood Tag restriction: any noun Collocates occur at least 5 times in the whole BNC. Words collocate at least 5 times.
  • 27. J. R. Firth (1890-1960) “The complete meaning of a word is always contextual, and no study of meaning apart from context can be taken seriously.” J. R. Firth (1935). "The Technique of Semantics." Transactions of the Philological Society, 36-72; p. 37 (Reprinted in Firth (1957). “You shall know a word by the company it keeps.” J. R. Firth (1957). "Papers in Linguistics, 1934-1951". Oxford: Oxford University Press.
  • 28. What is a corpus? “…a collection of pieces of language, selected and ordered according to explicit linguistic criteria in order to be used as a sample of the language.” (Sinclair 1996)
  • 29. What is Corpus Linguistics? (1) Focus on linguistic performance, rather than competence (2) Focus on linguistic description, rather than linguistic universals (3) Focus on quantitative, as well as qualitative models of language (4) Focus on a more empiricist, rather than rationalist view of scientific inquiry. (Leech 1992)
  • 30. Antconc: explore your own texts and corpora ● Download for free from https://www.laurenceanthony.net/software/antconc/ ● Use with any 'plain' text’ ● Multilingual capabilities ● Does not interpret mark-up or metadata
  • 31. #LancsBox Download for free from https://lancsbox.lancs.ac.uk/ ● Works with your own data or existing corpora ● Visualizes language data ● Analyses data in any language ● Automatically annotates data for part-of-speech (for some languages) ● Wizard tool produces a prose report ● Works with major operating systems (Windows, Mac, Linux) ● Latest version #LancsBox X launched 2023
  • 32. CQPweb: Online interface for indexed corpora http://cqpweb.lancs.ac.uk ...but now also with a new feature to upload data, in limited ways...
  • 33. SketchEngine: an online interface for your corpus https://www.sketchengine.eu/ Access to Sketch Engine is by paid subscription. Individual licences are available from €6.56 per month, with free trials available.
  • 34. Martin Wynne Text Analysis 34 A new opportunity "It is not easy to justify assertions about the alleged frequency of infrequency of some particular belief or attitude in the past. How many examples does one need to cite in order to prove the point? Lacking any satisfactory method of quantifying these matters, all I can do is to record my impressions after long immersion in the period." Keith Thomas, The Ends of Life, Oxford University Press, 2009. “But the sad truth is that much of what it has taken me a lifetime to build up by painful accumulation can now be achieved by a moderately diligent student in the course of a morning.” Keith Thomas, Diary, London Review of Books, 10 June 2010.
  • 35. Martin Wynne Text Analysis 35 Some (more or less) testable assertions Tudor  “The idea of a "Tudor era" in history is a misleading invention, claims an Oxford University historian. Cliff Davies says his research shows the term "Tudor" was barely ever used during the time of Tudor monarchs.” (http://www.bbc.co.uk/news/education-18240901 May 2012) Holocaust  “I will argue that “The Holocaust” is an ideological representation of the Nazi holocaust...Until recently, however, the Nazi holocaust barely figured in American life. Between the end of World War II and the late 60s, only a handful of books and films touched on the subject”. (Norman Finkelstein, The Holocaust Industry. Verso, 2000.) State ● “...no political writer before the middle of the sixteenth century used the word 'state' in anything like its modern political sense” [referring to the machinery of government and social control] (Quentin Skinner, The Foundations of Modern Political Thought, Cambridge University Press, 1978).
  • 36. 0 6 / 0 7 / 2 3 Annotation Annotation of texts should include structural markup, metadata, and linguistic annotation, including: - Standardized metadata for basic categories such as language, relevant dates, author, title and text type; - Part-of-speech tagging; - Lemmatization; and - Modernized (or otherwise normalized) forms ...and these can be the basis for further levels of annotation, such as: - semantic tags - named entity recognition - etc.
  • 39. Martin Wynne Text Analysis 39 Digital scholarship in the Humanities and Digital Science Issues and assumptions in scientific research: ● Consensus (and compromise) about funding priorities ● Adoption of technical standards ● Standards for the representation of knowledge and interpretations (agreement on concepts and categories!) ● Reproducibility and replicability of research ● Sharing of generic tools ● Curation of tools and data in professional service centres ● Support for software sustainability ● Promotion of interoperability of resources and tools ● Sharing research outputs ● Research leading to an accumulation of knowledge ● Increasingly data-driven research
  • 40. CLARIN ERIC in members and centres 40 Official membership • 23 members • 3 observers • 1 linked party A distributed network of >60 centres 25 CTS certified data centres, strong focus on FAIRness & interoperability • federated login: • central metadata harvesting for easy discovery: • chained services: • language data - in written, spoken, video or multimodal form • advanced tools - to discover, explore, exploit, annotate, analyse or combine data sets, wherever they are located
  • 41. CLARIN corpus resources and tools Corpora: at least 4130 - see VLO (https://vlo.clarin.eu/) ! Online interfaces: ● Corpuscle ● Korp ● KonText ● NoSketch Engine ● D* (Diacollo demo) ● TEITOK Federated content search: https://contentsearch.clarin.eu/ Resource Families: ● 13 curated guides to different types of corpora and how to get them ● Coming soon: Desktop corpus tools and Online corpus tools
  • 43. Online and desktop tools for corpus analysis “Corpus, concordance, collocation”
  • 44. Diachronic collocations in a text collection: DiaCollo from the Deutsches Textarchiv
  • 45. Diachronic collocations in a text collection: DiaCollo from the Deutsches Textarchiv
  • 48. Martin Wynne Text Analysis 48 Types of Text Analysis: Further Reading ● Baker, P (2006), Using Corpora in Discourse Analysis, London: Continuum [summary and further information at https://www.lancaster.ac.uk/staff/bakerjp/usingcorpora.htm ] ● Baker, P (2012), ‘Acceptable Bias? Using Corpus Linguistics Methods with Critical Discourse Analysis’, Critical Discourse Studies 9.3 (2012): 247-56. Web. ● Bode, K (2017), The Equivalence of “Close” and “Distant” Reading; or, Toward a New Object for Data-Rich Literary History, Modern Language Quarterly (2017) 78 (1): 77–106. DOI 10.1215/00267929-3699787 ● Cheng, W. (2013). ‘Corpus-based linguistic approaches to critical discourse analysis. In The encyclopedia of applied linguistics’ (pp. 1-8). Wiley-Blackwell. https://doi.org/10.1002/9781405198431.wbeal0262 [full book chapter available from https://www.researchgate.net/publication/262070226] ● Gadd. Ian. ‘The Use and Misuse of Early English Books Online’ in Literature Compass 6/3 (2009): 680–692 https://doi.org/10.1111/j.1741-4113.2009.00632.x ● Hamed, D (2020), ‘Keywords and collocations in US presidential discourse since 1993: a corpus-assisted analysis’, in Journal of Humanities and Applied Social Sciences, Vol. 3 No. 2, 2021 pp. 137-158 Emerald Publishing Limited 2632-279X DOI 10.1108/JHASS-01-2020-0019 ● Kichuk, Diana. ‘Metamorphosis: Remediation in Early English Books Online (EEBO)’. Literary and Linguistic Computing 22.3 (2007): 291–303. [available from https://hfroehlich.files.wordpress.com/2016/07/lit-linguist-computing-2007-kichuk-291-303.pdf ] ● Leech, G. N., & Short, M. H. (1981). Style in Fiction. London: Longman. ● Mahlberg, M. (2013), Corpus Stylistics and Dickens’s Fiction, Routledge. ● Martin, Shawn. ‘EEBO, Microfilm, and Umberto Eco: Historical Lessons and Future Directions for Building Electronic Collections’. Microform & Imaging Review 36.4 (2007): 159– 64 [available from https://repository.upenn.edu/cgi/viewcontent.cgi?article=1072&context=library_papers ] ● Showalter, E (2002), Teaching Literature, London: Wiley-Blackwell. ● Sinclair, J (1991), Corpus, Concordance, Collocation, Oxford: OUP. ● Rockwell, G (2005), ‘What is Text Analysis’ [https://web.archive.org/web/20150410205354/http://tada.mcmaster.ca/Main/WhatTA] ● Underwood, Ted (2015), Seven ways humanists are using computers to understand text. (blog post at https://tedunderwood.com/2015/06/04/seven-ways-humanists-are-using-computers-to-understand-text/ ) ● John Unsworth, “How Not To Read A Million Books,” with Tanya Clement, Sara Steger, and Kirsten Uszkalo, Harvard University, Cambridge, MA (October 2008) [blog post at https://people.brandeis.edu/~unsworth/hownot2read.rutgers.html ] ● Text Analysis in ‘Tooling up for Digital Humanties’ blog at http://toolingup.stanford.edu/?page_id=981 ● More information about the Text Creation Partnership https://quod.lib.umich.edu/e/eebogroup/