100% found this document useful (1 vote)
14 views39 pages

(Original PDF) Quantitative Corpus Linguistics With R Second Edition Download

The document is a promotional material for various eBooks related to quantitative methods and corpus linguistics, including 'Quantitative Corpus Linguistics with R' and other titles. It provides links for downloading these eBooks from the website ebooksecure.com. Additionally, it discusses the importance of using R for corpus-linguistic data processing and the advantages of programming knowledge in this field.

Uploaded by

borenelwaysz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
14 views39 pages

(Original PDF) Quantitative Corpus Linguistics With R Second Edition Download

The document is a promotional material for various eBooks related to quantitative methods and corpus linguistics, including 'Quantitative Corpus Linguistics with R' and other titles. It provides links for downloading these eBooks from the website ebooksecure.com. Additionally, it discusses the importance of using R for corpus-linguistic data processing and the advantages of programming knowledge in this field.

Uploaded by

borenelwaysz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

(Original PDF) Quantitative Corpus Linguistics

with R Second Edition download

https://ebooksecure.com/product/original-pdf-quantitative-corpus-
linguistics-with-r-second-edition/

Download full version ebook from https://ebooksecure.com


We believe these products will be a great fit for you. Click
the link to download now, or visit ebooksecure.com
to discover even more!

(eBook PDF) Quantitative Literacy: Thinking Between the


Lines Second Edition

http://ebooksecure.com/product/ebook-pdf-quantitative-literacy-
thinking-between-the-lines-second-edition/

Advanced R, Second Edition by Hadley Wickham

http://ebooksecure.com/product/advanced-r-second-edition-by-
hadley-wickham/

Quantitative Methods for Business 13th Edition by David


R. Anderson (eBook PDF)

http://ebooksecure.com/product/quantitative-methods-for-
business-13th-edition-by-david-r-anderson-ebook-pdf/

(eBook PDF) Research Methods in Linguistics

http://ebooksecure.com/product/ebook-pdf-research-methods-in-
linguistics/
(eBook PDF) Learner Corpus Research: New Perspectives
and Applications

http://ebooksecure.com/product/ebook-pdf-learner-corpus-research-
new-perspectives-and-applications/

An Introduction to Management Science: Quantitative


Approach 15th Edition David R. Anderson - eBook PDF

https://ebooksecure.com/download/an-introduction-to-management-
science-quantitative-approach-ebook-pdf/

(eBook PDF) Biology Now with Physiology (Second


Edition)

http://ebooksecure.com/product/ebook-pdf-biology-now-with-
physiology-second-edition/

Progress in Heterocyclic Chemistry Volume 29 1st


Edition - eBook PDF

https://ebooksecure.com/download/progress-in-heterocyclic-
chemistry-ebook-pdf/

(eBook PDF) A Concise Introduction to Linguistics 5th


Edition

http://ebooksecure.com/product/ebook-pdf-a-concise-introduction-
to-linguistics-5th-edition/
List of Contents vii

5.3.3 The Reduction of to be Before Verbs 212


5.3.4 Verb Collexemes After Must 215
5.3.5 Noun Collocates After Speed Adjectives in COCA (Fiction) 218
5.3.6 Collocates of Will and Shall in COHA (1810–1890) 221
5.3.7 Split Infinitives 225
5.4 Other Applications 228
5.4.1 Corpus Conversion: the ICE-GB 228
5.4.2 Three Indexing Applications 231
5.4.3 Playing With CELEX 235
5.4.4 Match All Numbers 237
5.4.5 Retrieving Adjective Sequences From Untagged Corpora 237
5.4.6 Type-Token Ratios/Vocabulary Growth: Hamlet vs. Macbeth 242
5.4.7 Hyphenated Forms and Their Alternative Spellings 248
5.4.8 Lexical Frequency Profiles 251
5.4.9 CHAT Files 1: Eve’s MLUs and ttrs 257
5.4.10 CHAT Files 2: Merging Multiple Files 263

6 Next Steps . . .  269

Appendix 271
Index 272
Figures

2.1 Representational format of corpus files and data frames 15


3.1 Representational format of corpus files and data frames 25
3.2 The contents of <_qclwr2/_inputfiles/dat_vector-a.txt> 37
3.3 The contents of <_qclwr2/_inputfiles/dat_vector-b.txt> 38
3.4 An example data frame 52
3.5 A few words from the BNC World Edition (SGML format) 108
3.6 The XML representation of “It” in the BNC World Edition 117
3.7 The XML representation of the number of sentence tags in the
BNC World Edition 118
3.8 A hypothetical XML representation of “It” in the BNC World Edition 118
3.9 The topmost hierarchical parts of BNC World Edition:
<corp_D94_part.xml> 119
3.10 The header of <corp_D94_part.xml> 120
3.11 The stext part of <corp_D94_part.xml> 122
3.12 The teiHeader/profileDesc part of <corp_H00.xml> 126
4.1 Interaction plot for GramRelation × ClauseType 1 146
4.2 Interaction plot for GramRelation × ClauseType 2 146
4.3 Interaction plot for GramRelation × ClauseType 3 147
4.4 Mosaic plot of the distribution of verb-particle constructions
in Gries (2003a) 156
4.5 Association plot of the distribution of verb-particle constructions
in Gries (2003a) 159
4.6 Average fictitious temperatures of two cities 162
4.7 Plots of the temperatures of the two cities 165
4.8 Plots of the lengths of subjects and objects 168
4.9 Scatterplots of the lengths of words in syllables and words 172
5.1 Dispersion results for perl in its Wikipedia entry 183
5.2 The annotation of even when as a multi-word unit 204
5.3 The first few lines of <wlp_fic_1990.txt> 218
5.4 Desired result of transforming Figure 5.2 into the COCA format
of Figure 5.3 221
5.5 A 3L-3R collocate display of shall as a modal verb (1810–1819
in COHA) 223
5.6 Three sentences from ICE-GB Release 2 229
List of Figures ix

5.7 The same three sentences after processing 230


5.8 Problematic (for us) uses of brackets in the ICE-GB Release 2 230
5.9 Three lines from CELEX, <EPL.CD> 235
5.10 Three lines from CELEX, <ESL.CD> 235
5.11 Thirty-nine character strings representing numbers to match and
one not to match 237
5.12 The first six lines of <BROWN1_A.txt> 241
5.13 The vocabulary-growth curve of tokens 243
5.14 The first ten lines of <baseword1.txt> 252
5.15 Excerpt of a file from the CHILDES database annotated in
the CHAT format 258
5.16 Intended output of the case study 263
Tables

2.1 Examples of differently ordered frequency lists 12


2.2 A collocate display of alphabetic based on the BNC 16
2.3 A collocate display of alphabetical based on the BNC 17
2.4 An example display of a concordance of before and after (sentence display) 18
2.5 An example display of a concordance of before and after (tabular) 19
4.1 Fictitious data set for a study on constituent lengths 145
4.2 A bad data frame 149
4.3 A better data frame 150
4.4 Observed distribution of verb-particle constructions in Gries (2003a) 152
4.5 Expected distribution of verb-particle constructions in Gries (2003a) 152
4.6 Observed distribution of alphabetical and order in the BNC 159
4.7 Structure of a quantitative corpus-linguistic paper 174
5.1 The frequency of w (=“perl”) and all other words in two ‘corpus files’ 197
5.2 Frequencies of sentences with and without alphabetical and order 208
5.3 Frequencies of must/other admit/other in BNC K 215
5.4 Desired co-occurrence results to be extracted from COC5.3A: fiction 218
Acknowledgments

This book is dedicated to the people who have been so kind as to be part of what I might
self-deprecatingly call my ‘support network’; they are in alphabetical order of last names:
PMC, SCD, MIF, BH, S[LW], MN, H[RW], and DS – I am very grateful to all you’ve
done and all your tolerance over the last year or so! I wish to thank the team at Routledge
for their interest in, and support of, a second edition of this textbook; also, I am grateful
to the members of my corpus linguistics and statistics newsgroups for their questions,
suggestions, and feedback on various issues and topics that have now made it into this
second edition. Finally, I am grateful to many students and participants of classes, summer
schools, and workshops/bootcamps where parts of this book were used.
1 Introduction

1.1 Why Another Introduction to Corpus Linguistics?


In some sense at least, this book is an introduction to corpus linguistics. If you are a little
familiar with the field, this probably immediately triggers the question “Why yet another
introduction to corpus linguistics?” This is a valid question because, given the upsurge of
studies using corpus data in linguistics, there are also already quite a few very good intro-
ductions available. Do we really need another one? Predictably, I think the answer is still
“yes” and “yes, even a second edition,” and the reason is that this introduction is radi-
cally different from every other introduction to corpus linguistics out there. For example,
there are a lot of things that are regularly dealt with at length in introductions to corpus
linguistics that I will not talk about much:

•• the history of corpus linguistics: Kaeding, Fries, early 1m word corpora, up to the
contemporary giga corpora and the still lively web-as-corpus discussion;
•• how to compile corpora: size, sampling, balancedness, representativity;
•• how to create corpus markup and annotation: lemmatization, tagging, parsing;
•• kinds and examples of corpora: synchronic vs. diachronic, annotated vs. unannotated;
•• what kinds of corpus-linguistic research have been done.

That is to say, rather than telling you about the discipline of corpus linguistics – its history,
its place in linguistics, its contributions to different fields, etc. – with this book, I will ‘only’
teach you how to do corpus-linguistic data processing with the programming language
R (see McEnery and Hardie 2011 for an excellent recent introduction). In other words,
this book presupposes that you know what you would like to explore but gives you tools
to do it that go beyond what most commonly used tools can offer and, thus, hopefully
also open up your minds about how to approach your corpus-linguistic questions. This
is important since, to me, corpus linguistics is a method of analysis, so talking about how
to do things should enjoy a high priority (see Gries 2010 and the rest of that special issue,
as well as Gries 2011 for my subjective takes on this matter). Therefore, I will mostly be
concerned with:

•• aspects of how exactly data are retrieved from corpora to be used in linguistically
informed analyses, specifically how to obtain from corpora frequency lists, dispersion
information, collocation displays, concordances, etc. (see Chapter 2 for explanation
and exemplification of these terms);
•• aspects of data manipulation and evaluation: how to process and convert corpus data;
how to save various kinds of results; how to import them into a spreadsheet pro-
gram for further annotation; how to analyze results statistically; how to represent the
results graphically; and how to report your results.
2 Introduction
A second important characteristic of this book is that it only uses freely available software:

•• R, the corpus linguist’s all-purpose tool (cf. R Core Team 2016): a software which is
a calculator, a statistics program, a (statistical) graphics program, and a programming
language at the same time. The versions used in this book are R (www.r-project.org)
and the freely available Microsoft R Open 3.3.1 (https://mran.revolutionanalytics.com/
open, the versions for Ubuntu 16.04 LTS (or Mint 18) and Microsoft Windows 10);
•• RStudio 0.99.1294 (www.rstudio.com);
•• LibreOffice 5.2.0.4 (www.libreoffice.org).

The choice of these software tools, especially the decision to use R, has a number of
important implications, which should be mentioned early on. As I just mentioned, R
is a full-fledged multi-purpose programming language and, thus, a very powerful tool.
However, this degree of power does come at a cost: In the beginning, it is undoubtedly
more difficult to do things with R than with ready-made (free or commercial) concord-
ancing software that has been written specifically for corpus-linguistic applications. For
example, if you want to generate a frequency list of a corpus or a concordance of a word
in a corpus with R, you must write a small script or a little bit of code in a programming
language, which is the technical way of saying you write lines of text that are instructions
to R. If you do not need pretty output, this script may consist of just a few lines, but it will
often also be longer than that. On the other hand, if you have a ready-made concordancer,
you click a few buttons (and enter a search term) to get the job done. One may therefore
ask why go through the trouble of learning R? There is a variety of very good reasons for
this, some of them related to corpus linguistics, some more general.
First, let me address this very argument, which is often made against using R (or other
programming languages): why use a lot of time and effort to learn a programming lan-
guage if you can get results from ready-made software within minutes? With regard to
the time that goes into learning R, yes, there is a learning curve. However, that time may
not be as long as you think: Many participants in my bootcamps and other workshops
develop a first good understanding of R that allows them to begin to proceed on their
own within just a few days. Plus, being able to program is an extremely useful skill for
academic purposes, but also for jobs outside of academia; I would go so far as to say that
learning to program is extremely useful in how it develops, or hones, a particular way of
analytical and rigorous thinking that is useful in general. With regard to the time that goes
into writing a script, much of that usually needs to be undertaken only once. As you will
see below, once you have written your first few scripts while going through this book, you
can usually reuse (parts of) them for many different tasks and corpora, and the amount
of time that is required to perform a particular task becomes very similar to that of using
a ready-made program. In fact, nearly all corpus-linguistic tasks in my own research are
done with (somewhat adjusted) scripts or small snippets of code from this book. In addi-
tion, once you explore how to write your own functions (see Section 3.10), you can easily
write your own versatile or specialized functions yourself; I will make several of those
available in subsequent chapters. This way, the actual effort of generating a frequency list,
a collocate display, a dispersion plot, etc. often reduces to about the time you need with
a concordance program. In fact, R may even be faster than competing applications: For
example, some concordance programs read in the corpus files once before they are pro-
cessed and then again for performing the actual task – R requires only one pass and may,
therefore, outperform some competitors in terms of processing time.
Another point related to the notion that programming knowledge is useful: The knowl-
edge you will acquire by working through this book is quite general, and I mean that in a
Introduction 3
good way. This is because you will not be restricted to just one particular software appli-
cation (or even one version of one particular software application) and its restricted set
of features. Rather, you will acquire knowledge of a programming language and regular
expressions which will allow you to use many different utilities and to understand scripts
in other programming languages, such as Perl or Python. (At the same time, I think R is
simpler than Perl or Python, but can also interface with them via RSPerl and RSPython,
respectively; see www.omegahat.org.) For example, if you ever come across scripts by
other people or decide to turn to these languages yourself, you will benefit from know-
ing R in a way that no ready-made concordancing software would allow for. If you are
already a bit familiar with corpus-linguistic work, you may now think “but why turn to
R and not use Perl or Python (especially since you say Perl and Python are similar anyway
and many people already use one of these languages)?” This is a good question, and I
myself used Perl for corpus processing before I turned to R. However, I think I also have
a good answer to why to use R instead. First, the issue of speed is much less of a problem
than one may think. R is fast enough and stable enough for most applications (especially if
you heed some of the advice given in Sections 3.6.3 and 3.10). Thus, if a script takes a bit
of time, you can simply run it over lunch, while you are in class, or even overnight and col-
lect the results afterwards. Second, R has other advantages. The main one is probably that,
in addition text-processing capabilities, R offers a large number of ready-made functions
for the statistical evaluation and graphical representation of data, which allows you to
perform just about all corpus-linguistic tasks within only one programming environment.
You can do your data processing, data retrieval, annotation, statistical evaluation, graphi-
cal representation . . . everything within just one environment, whereas if you wanted to
do all these things in Perl or Python, you would require a huge amount of separate pro-
gramming. Consider a very simple example: R has a function called table that generates
a frequency table. To perform the same in Perl you would either have to have a small loop
counting elements in an array and in a stepwise fashion increment their frequencies in a
hash or, later and more cleverly, program a subroutine which you would then always call
upon. While this is no problem with a one-dimensional frequency list, this is much harder
with multidimensional frequency tables: Perl’s arrays of arrays or hashes of arrays etc. are
not for the faint-hearted, whereas R’s table is easy to handle, and additional functions
(table, xtabs, ftable, etc.) allow you to handle such tables very easily. I believe learning
one environment can be sufficiently hard for beginners, and therefore recommend using
the more comprehensive environment with the greater number of simpler functions, which
to me clearly is R. And, once you have mastered the fundamentals of R and face situations
in which you need maximal computational power, switching to Perl or Python in a limited
number of cases will be easier for you anyway, especially since much of the programming
languages’ syntaxes is similar and the regular expressions used in this book are all Perl
compatible. (Let me tell you, though, that in all my years using R, there were a mere two
instances where I had to switch to Perl and that was only because I didn’t yet know how
to solve a particular problem in R.)
Second, by learning to do your analyses with a programming language, you usually
have more control over what you are actually doing: Different concordance programs
have different settings or different ways of handling searches that are not always obvious
to the (inexperienced) user. For instance, ready-made concordance tools often have slightly
different settings that specify what ‘a word’ is, which means you can get different results
if you have different programs perform the same search on the same corpus. Yes, those
settings can usually be tweaked, but that means that, actually, such a ready-made applica-
tion requires the same attention to detail as R, and with a programming language all of
your methodological choices are right there in the code for everyone to see and replicate.
4 Introduction
Third, if you use a particular concordancing software, you are at the mercy of its
developer. If the developers change its behavior, its results output, or its default settings,
you can only hope that this is documented well and/or does not affect your results. There
have been cases where even silent over-the-internet updates have changed the output of
such software from one day to the next. Worse, developers might even discontinue the
development of a tool altogether – and let us not even consider how sorry the state of the
discipline of corpus linguistics would be if a majority of its practitioners was dependent
on not even a handful of ready-made corpus tools and websites that allow you to search
a corpus online. Somewhat polemically speaking, being able to enter a URL and type in a
search word shouldn’t make you a corpus linguist.
The fourth and maybe most important reason for learning a programming language
such as R is that a programming language is a much more versatile tool than any ready-
made software application. For instance, many ready-made corpus tools can only offer the
functionality they aim to provide for corpora with particular formats, and then can only
provide a small number of kinds of output. R, as a programming language, can handle
pretty much any input and can generate pretty much any output you want – in fact, in my
bootcamps, I tell participants on day 1 that I don’t want to hear any questions that begin
with “Can R . . . ?” because the answer is “Yes”. For instance, with R you can readily use
the CELEX database, CHAT files from language acquisition corpora, the very hierarchi-
cally layered annotation of XML corpora, previously generated frequency lists for corpora
you no longer have access to, literature files from Project Gutenberg or similar sites, tabu-
lar corpus files such as those from the Corpus of Contemporary American English (http://
corpus.byu.edu/coca) or the Corpus of Historical American English (http://corpus.byu.
edu/coha), and so on and so forth. You can use files of whatever encoding, meaning
that data from any language/writing system can be straightforwardly processed, and R’s
general data-processing capabilities are mostly only limited by your working memory
and abilities (rather than, for instance, the number of rows your spreadsheet software
can handle). With very few exceptions, R works identically on all three major operating
systems: Linux/Unix, Windows, and Mac OS X. In a way, once you have mastered the
basic mechanisms, there is basically no limit to what you can do with it, both in terms of
linguistic processing and statistical evaluation.
But there are also additional important advantages in the fact that R is an open-source
tool/programming language. For instance, there is a large number of functions and pack-
ages that are contributed by users all over the world. These often allow effective shortcuts
that are not, or hardly, possible with ready-made applications, which you cannot tweak
as you wish. Also, contrary to commercial concordance software, bug-fixes are usu-
ally available very quickly. And a final, obvious, and very down-to-earth advantage of
using open-source software is of course that it comes free of charge. Any student or any
department’s computer lab can afford it without expensive licenses, temporally limited or
functionally restricted licenses, or irritating ads and nag screens. All this makes a strong
case for the choice of software made here.

1.2 Outline of the Book


This book has changed quite a bit from the first edition; it is now structured as follows.
Chapter 2 defines the notion of a corpus and provides a brief overview of what I con-
sider to be the most central corpus-linguistic methods, namely frequency lists, dispersion,
collocations, and concordances; in addition, I briefly mention different kinds of annota-
tion. The main change here is the addition of some discussion of the important notion of
dispersion.
Introduction 5
Chapter 3 introduces the fundamentals of R, covering a variety of functions from dif-
ferent domains, but the area which receives most consideration is that of text processing.
There are many small changes in the code and the examples (for instance, I now introduce
free-spacing), but the main differences to the first edition consist of: (1) a revision of the
section on Unicode, which is now more comprehensive; (2) the addition of a new section
specifically discussing how to get the most out of XML data using dedicated packages that
can parse the hierarchical structure of XML documents; (3) an improved version of my
exact.matches function; and (4) a new section on how to write your own functions for
text processing and other things – this is taken up a lot in Chapter 5.
Chapter 4 is what used to be Chapter 5 in the first edition. It introduces you to
some fundamental aspects of statistical thinking and testing. The questions to be covered
in this chapter include: What are hypotheses? How do I check whether my results are
noteworthy? How might I visualize results? Given considerations of space and focus, this
chapter is informative, I hope, but still short.
The main chapter of this edition, Chapter 5, is brand new and, in a sense, brings it all
together: More than 30 case studies in 27 sections illustrate various aspects of how the
methods introduced in Chapters 3 and 4 can be applied to corpus data. Using a variety
of different kinds of corpora, corpus-derived data, and other data, you will learn in detail
how to write your own programs in R for corpus-linguistic analyses, text processing, and
some statistical analysis and visualization in detailed step-by-step instructions. Every sin-
gle analysis is discussed on multiple levels of abstraction and altogether more than 6,000
lines of code, nearly every one of them commented, help you delve deeply into how power-
ful a tool R can be for your work.
Finally, Chapter 6 is a very brief conclusion that points you to a handful of useful R
packages that you might consider exploring next.
Before we begin, a few short comments on the nature of this book are necessary. This
book is kind of a sister publication to my introduction to statistics for linguists (Gries
2013), and shares with it multiple characteristics. For instance, and as already mentioned,
this introduction to corpus linguistics is different from every other introduction to corpus
linguistics I know in how it doesn’t even attempt to survey the discipline, but focuses on R
programming for corpus linguists. This has two consequences. On the one hand, this book
is not a book that requires much previous knowledge: It presupposes only basic (corpus-)
linguistic knowledge and no mathematical or any programming knowledge.
On the other hand, this book is an attempt to teach you a lot about how to be a good
corpus linguist. As a good corpus linguist, you have to combine many different methodo-
logical skills (and many equally important analytical skills that I will not be concerned
with here). Many of these methodological skills are addressed here, such as some very
basic knowledge of computers (operating systems, file types, etc.), data management, reg-
ular expressions, some elementary programming skills, some elementary knowledge of
statistics, etc. What you must know, therefore, is that (1) nobody has ever learned all of
this just by reading – you must do things – and (2) this is not an easy book that you can
read for ten minutes at a time in bed before you fall asleep. What these two things mean
is that you really must read this book while you are sitting at your computer so you can
run the code, see what it does, and work on the examples. This is particularly important
because the code file from the companion website contains more than 6,500 lines of code
and a huge amount of extra commentary to help you understand the code much better
than you can understand it from just reading the book; this is particularly relevant for
Chapter 5! You will need practice to master all the concepts introduced here, but will be
rewarded by acquiring skills that give you access to a variety of data and approaches you
may not have considered accessible to you – at least that’s what happened to me when I at
6 Introduction
one point decided to leave behind the ready-made tools I had become used to. Undergrads
in my corpus classes without prior programming experience have quickly learned to write
small programs that do things better than many concordance software, and you can do
the same.
In order to facilitate your learning process, there are four different ways in which I try
to help you get more out of this book. First, there are small Think Breaks. These are small
assignments which you should try to complete before you read on; answers to them follow
immediately in the text. Second, there are exercise boxes with small assignments. Ideally,
you should complete these and check your answers in the answer key before you read any
further, but it is not always necessary to complete them right away to understand what fol-
lows, so you can also return to them later at your own leisure. Third, there are many boxes
with recommendations for further study/exploration, which typically mention functions
that you do not need for the section in which they are mentioned the first time, but many
are used at a later stage (often this will be Chapter 5), which means I really encourage you
to follow-up on those soon after you have encountered them. Fourth, and in addition to
the above, I would like to encourage you to go to the companion website for this book at
http://tinyurl.com/QuantCorpLingWithR, as well as the Google group “CorpLing with R”
which I created and maintain. You need to go to the companion website to get all the files
that belong with this book, but if you also become a member of the Google group:

•• you can send questions about corpus linguistics with R to the list and, hopefully, get
useful responses from some kind soul(s);
•• post suggestions for revisions of this book there;
•• inform me and the other readers of errors you find and, of course, be informed when
other people or I find errata.

Thus, while this is not an easy book, I hope these aids help you to become a good
corpus linguist. If you work through the whole book, you will be able to do a large num-
ber of things you could not even do with commercial concordancing software; many of the
scripts you find here are taken from actual research, and are in fact simplified versions of
scripts I have used myself for published papers. In addition, if you also take up the many
recommendations for further exploration that are scattered throughout the book, you will
probably find ever new and more efficient ways of application.

References
Gries, Stefan Th. (2010). Corpus linguistics and theoretical linguistics: A love–hate relationship?
Not necessarily . . . International Journal of Corpus Linguistics 15(3), 327–343.
Gries, Stefan Th. (2011). Methodological and interdisciplinary stance in corpus linguistics. In
Geoffrey Barnbrook, Vander Viana, & Sonia Zyngier (Eds.), Perspectives on corpus linguistics:
Connections and controversies (pp. 81–98). Amsterdam: John Benjamins.
Gries, Stefan Th. (2013). Statistics for linguistics with R. 2nd rev. and ext. ed. Berlin: De Gruyter
Mouton.
McEnery, Tony, & Andrew Hardie. (2011). Corpus linguistics: Method, theory, and practice.
Cambridge: Cambridge University Press.
R Core Team. (2016). R: A language and environment for statistical computing. R Foundation for
Statistical Computing, Vienna, Austria. Retrieved from www.R-project.org.
2 The Four Central Corpus-Linguistic
Methods

This last point leads me, with some slight trepidation, to make a comment on our field in
general, an informal observation based largely on a number of papers I have read as submis-
sions in recent months. In particular, we seem to be witnessing as well a shift in the way
some linguists find and utilize data – many papers now use corpora as their primary data,
and many use internet data.
(Joseph 2004: 382)

In this chapter you will learn what a corpus is (plural: corpora) and what the four methods
are to which nearly all aspects of corpus-linguistic work can be reduced in some way.

2.1 Corpora
Before we start to actually look at corpus linguistics, we have to clarify our terminology a
little. While the actual programming tasks do not differ between them, in this book I will
distinguish between a corpus, a text archive, and an example collection.

2.1.1 What Is a Corpus?


In this book, the notion of a corpus refers to a machine-readable collection of (spoken
or written) texts that were produced in a natural communicative setting, and in which
the collection of texts is compiled with the intention (1) to be representative and bal-
anced with respect to a particular linguistic language, variety, register, or genre and
(2) to be analyzed linguistically. The parts of this definition need some further clarifica-
tion themselves:

•• “Machine-readable” refers to the fact that nowadays virtually all corpora are stored
in the form of plain ASCII or Unicode text files that can be loaded, manipulated, and
processed platform-independently. This does not mean, however, that corpus linguists
only deal with raw text files – quite the contrary: some corpora are shipped with
sophisticated retrieval software that makes it possible to look for precisely defined
lexical, syntactic, or other patterns. It does mean, however, that you would have a
hard time finding corpora on paper, in the form of punch cards or digitally in HTML
or Microsoft Word document formats; the probably most widely used format consists
of text files with a Unicode UTF-8 encoding and XML annotation.
•• “Produced in a natural communicative setting” means that the texts were spoken
or written for some authentic communicative purpose, but not for the purpose of
putting them into a corpus. For example, many corpora consist to a large degree of
8 The Four Central Corpus-Linguistic Methods
newspaper articles. These meet the criterion of having been produced in a natural
setting because journalists write the article to be published in newspapers and to
communicate something to their readers, not because they want to fill a linguist’s
corpus. Similarly, if I obtained permission to record all of a particular person’s con-
versations in one week, then hopefully, while the person and his interlocutors usually
are aware of their conversations being recorded, I will obtain authentic conversations
rather than conversations produced only for the sake of my corpus.
•• I use “representative [ . . . ] with respect to a particular language, variety . . . ” here to
refer to the fact that the different parts of the linguistic variety I am interested in should
all be manifested in the corpus (at least if you want to generalize much beyond your
sample, e.g., to the language in general). For example, if I was interested in phonologi-
cal reduction patterns of speech of adolescent Californians and recorded only parts of
their conversations with several people from their peer group, my corpus would not be
representative in the above sense because it would not reflect the fact that some sizable
proportion of the speech of adolescent Californians may also consist of dialogs with a
parent, a teacher, etc., which would therefore also have to be included.
•• I use “balanced with respect to a particular linguistic language, variety . . . ” to mean
that not only should all parts of which a variety consists be sampled into the corpus,
but also that the proportion with which a particular part is represented in a corpus
should reflect the proportion the part makes up in this variety and/or the importance
of the part in this variety (at least if you want to generalize much beyond your sam-
ple, e.g., to the language in general). For example, if I know that dialogs make up 65
percent of the speech of adolescent Californians, approximately 65 percent of my
corpus should consist of dialog recordings. This example already shows that this cri-
terion is more of a theoretical ideal: How would one even measure the proportion that
dialogs make-up of the speech of adolescent Californians? We can only record a tiny
sample of all adolescent Californians, and how would we measure the proportion of
dialogs? In terms of time? In terms of sentences? In terms of words? And how would
we measure the importance of a particular linguistic variety? The implicit assumption
that conversational speech is somehow the primary object of interest in linguistics also
prevails in corpus linguistics, which is why corpora often aim at including as much
spoken language as possible, but on the other hand a single newspaper headline read
by millions of people may have a much larger influence on every reader’s linguistic
system than 20 hours of dialog. In sum, balanced corpora are a theoretical ideal cor-
pus compilers constantly bear in mind, but the ultimate and exact way of compiling a
balanced corpus has remained mysterious so far.

It is useful to point out, however, that the above definition of a corpus is perhaps the
prototype, which implies that there are many other corpora that differ from the proto-
type and other kinds of corpora along a variety of dimensions. For instance, the TIMIT
Acoustic-Phonetic Continuous Speech Corpus is made up of audio recordings of 630
speakers of eight major dialects of American English, where each speaker read phoneti-
cally rich sentences, a setting which is not exactly a natural communicative setting. Or
consider the DCIEM Map Task Corpus, which consists of unscripted dialogs in which
one interlocutor describes a route on a map to the other after both interlocutors were
subjected to 60 hours of sleep deprivation and one of three drug treatments – again,
hardly a normal situation. Even a genre as widely used as newspaper text – journalese –
is not necessarily close to being a prototypical corpus, given how newspaper writing
is created much more deliberately and consciously than many other texts – plus they
often come with linguistically arbitrary restrictions regarding their length, are often not
The Four Central Corpus-Linguistic Methods 9
written by a single person, and are heavily edited, etc. Thus, the notion of corpus is really
a rather diverse one.
Many people would prefer to consider newspaper data not corpora, but text archives.
Those would be databases of texts which

•• may not have been produced in a natural setting;


•• have often not been compiled for the purposes of linguistic analysis; and
•• have often not been intended to be representative and/or balanced with respect to a
particular linguistic variety or speech community.

As the above discussion already indicated, however, the distinction between corpora and
text archives is often blurred. It is theoretically easy to make, but in practice often not
adhered to very strictly and, again, has very few implications for the kinds of (R) program-
ming they require. For example, if a publisher of a popular computing periodical makes all
the issues of the previous year available on their website, then the first criterion is met, but
not the last three. However, because of their availability and size, many corpus linguists
use them as resources, and as long as one bears their limitations in mind in terms of rep-
resentativity etc., there is little reason not to.
Finally, an example collection is just what the name says it is – a collection of examples
that, typically, the person who compiled the examples came across and noted down. For
example, much psycholinguistic research in the 1970s was based on collections of speech
errors compiled by the researchers themselves and/or their helpers. Occasionally, people
refer to such collections as error corpora, but we will not use the term corpus for these. It is
easy to see how such collections compare to corpora. On the one hand, for example, some
errors – while occurring frequently in authentic speech – are more difficult to perceive than
others and thus hardly ever make it into a collection. This would be an analog to the balanc-
edness problem outlined above. On the other hand, the perception of errors is contingent
on the acuity of the researcher while, with corpus research, the corpus compilation would
not be contingent on a particular person’s perceptual skills. Finally, because of the scarcity
of speech errors, usually all speech errors perceived (in a particular amount of time) are
included into the corpus, whereas, at least usually and ideally, corpus compilers are more
picky and select the material to be included with an eye to the criteria of representativity and
balancedness outlined above.1 Be that as it may, if only for the sake of terminological clar-
ity, it is useful to distinguish the notions of corpora, text archives, and example collections.

2.1.2 What Kinds of Corpora Are There?


Corpora differ in a variety of ways. There are a few distinctions you should be familiar with
if only to be able to find the right corpus for what you want to investigate. The most basic
distinction is that between general corpora and specific corpora. The former intend to be
representative and balanced for a language as a whole – within the above-mentioned limits,
that is – while the latter are by design restricted to a particular variety, register, genre, etc.
Another important distinction is that between raw corpora and annotated corpora. Raw
corpora consist of files only containing the corpus material (see (1) in the example below),
while annotated corpora in addition also contain additional information. Annotated cor-
pora are very often annotated according to the standards of the Text Encoding Initiative
(TEI, www.tei-c.org/index.xml) or the Corpus Encoding Standard (CES, www.cs.vassar.
edu/CES), and have two parts. The first part is called the header, which provides infor-
mation that is typically characterized as markup. This is information about (1) the text
itself, e.g., where the corpus data come from, which language is represented in the file,
10 The Four Central Corpus-Linguistic Methods
which (part of a) newspaper or book has been included, who recorded whom, where, and
when, who has the copyright, what annotation comes with the file; and information about
(2) its formatting, printing, processing, etc. Markup refers to objectively codable infor-
mation – the fact that there is a paragraph in a text or that a particular speaker is female
can typically be made without doubt; this is different from annotation, which is usually
specifically linguistic information – e.g., part-of-speech (POS) tagging, semantic infor-
mation, pragmatic information, etc. – and which is less objective (for instance, because
linguists may disagree about POS tags for specific words). This information helps users
to quickly determine, e.g., whether a particular file is part of the register one wishes to
investigate or not.
The second part is called the body and contains the corpus data proper – i.e., what people
actually said or wrote – as well as linguistic information that is usually based on some lin-
guistic theory: Parts of speech or syntactic patterns, for example, can be matters of debate.
In what follows I will briefly (and non-exhaustively!) discuss and exemplify a few common
annotation schemes (see Wynne 2005; McEnery, Xiao, & Tono 2006: A.3 and A.4; Beal,
Corrigan, & Hermann 2007a, 2007b; Gries & Newman 2013 for more discussion).
First, a corpus may be lemmatized such that each word in the corpus is followed (or
preceded) by its lemma, i.e., the form under which you would look it up in a dictionary
(see (2)). A corpus may have so-called part-of speech tags so that each word in the corpus
is followed by an abbreviation giving the word’s POS and sometimes also some morpho-
logical information (see (3)). A corpus may also be phonologically annotated (see (4)).
Then, a corpus may be syntactically parsed, i.e., contain information about the syntactic
structures of the text/utterances (see (5)). Finally, and as a last example, a corpus may
contain several different annotations on different lines (or tiers) at the same time, a format
especially common in language acquisition corpora (see (6)).

(1) I did get a postcard from him.


(2) I_I did_do get_get a_a postcard_postcard from_from him_he._punct
(3) I<PersPron> did<VerbPast> get<VerbInf> a<Det> postcard<NounSing>
from<Prep> him<PersPron>.<punct>
(4) [@:]·I·^did·get·a·!p\ostcard·fr/om·him#·-·-
(5) <Subject,·NP>
I<PersPron>
<Predicate,·VP>
did<Verb>
get<Verb>
<DirObject,·NP>
a<Det>
postcard<NounSing>
<Adverbial,·PP>
from<Prep>
him<PersPron>.
(6) *CHI: I did get a postcard from him
%mor: pro|I·v|do&PAST·v|get·det|a·n|postcard·prep|from·
pro|him·.
%lex: get
%syn: trans

Other annotation includes that with regard to semantic characteristics, stylistic aspects,
anaphoric relations (co-reference annotation), etc. Nowadays, most corpora come in the
The Four Central Corpus-Linguistic Methods 11
form of XML files, and we will explore many examples involving XML annotation in the
chapters to come. As is probably obvious from the above, annotation can sometimes be
done completely automatically (possibly with human error-checking), semi-automatically,
or must be done completely manually. POS tagging, the probably most frequent kind of
annotation, is usually done automatically, and for English taggers are claimed to achieve
accuracy rates of 97 percent – a number that I sometimes find hard to believe when I look
at corpora, but that is a different story.
Then, there is a difference between diachronic corpora and synchronic corpora. The
former aim at representing how a language/variety changes over time, while the latter
provide, so to speak, a snapshot of a language/variety at one particular point in time. Yet
another distinction is that between monolingual corpora and parallel corpora. As you
might already guess from the names, the former have been compiled to provide informa-
tion about one particular language/variety, whereas the latter ideally provide the same
text in several different languages. Examples include translations from EU Parliament
debates into the 23 languages of the European Union, or the Canadian Hansard corpus,
containing Canadian Parliament debates in English and French. Again, ideally, a parallel
corpus does not just have the translations in different languages, but has the transla-
tions sentence-aligned, such that for every sentence in language L1, you can automatically
retrieve its translation in the languages L2 to Ln.
The next distinction to be mentioned here is that of static corpora vs. dynamic/moni-
tor corpora. Static corpora have a fixed size (e.g., the Brown corpus, the LOB corpus, the
British National Corpus), whereas dynamic corpora do not since they may be constantly
extended with new material (e.g., the Bank of English).
The final distinction I would like to mention at least briefly involves the encoding of
the corpus files. Given especially the predominance of work on English in corpus linguis-
tics, until rather recently many corpora came in the so-called ASCII (American Standard
Code for Information Interchange) character encoding, an encoding scheme that encodes
27 = 128 characters as numbers and that is largely based on the Western alphabet. With
these characters, special characters that were not part of the ASCII character inventory
were often paraphrased, e.g., “é” was paraphrased as “&eacute;”. However, the number
of corpora for many more languages has been increasing steadily, and given the large
number of characters that writing systems such as Chinese have, this is not a practi-
cal approach. As such, language-specific character encodings were developed (e.g., ISO
8859-1 for Western European Languages vs. ISO 2022 for Chinese/Japanese/Korean lan-
guages). However, in the interest of overcoming compatibility problems that arose due to
how different languages used different character encodings, the field of corpus linguistics
has been moving towards using only one unified (i.e., not language-specific) multilingual
character encoding in the form of Unicode (most notably UTF-8). This development is
in tandem with the move toward XML corpus annotation and, more generally, UTF-8
becoming the most widely used character encoding on the internet.
Now that you know a bit about the kinds of corpora that exist, there is one other really
important point to be made. While we will see below that corpus linguistics has a lot to
offer to the analyst, it is worth pointing out that, strictly speaking at least, the only thing
corpora can provide is information on frequencies. Put differently, there is no meaning in
corpora, and no functions, only:

•• frequencies of occurrence of items – i.e., how often do morphemes, words, grammatical


patterns, etc. occur in (parts of) a corpus?; and
•• frequencies of co-occurrence of items – i.e., how often do morphemes occur with particular
words? How often do particular words occur in a certain grammatical construction? etc.
12 The Four Central Corpus-Linguistic Methods
It is up to the researcher to interpret these frequencies of occurrence and co-occurrence
in meaningful or functional terms. The assumption underlying basically all corpus-based
analyses, however, is that formal differences reflect functional differences: Different
frequencies of (co-)occurrences of formal elements are supposed to reflect functional
regularities, where functional is understood here in a very broad sense as anything – be
it semantic, discourse-pragmatic, etc. – that is intended to perform a particular com-
municative function. On a very general level, the frequency information a corpus offers
is exploited in four different ways, which will be the subject of this chapter: frequency
lists (Section 2.2), dispersion (Section 2.3), lexical co-occurrence lists/collocations
(Section 2.4), and concordances (Section 2.5).

2.2 Frequency Lists


The most basic corpus-linguistic tool is the frequency list. You generate a frequency list
when you want to know how often something – usually words – occur in a corpus. Thus,
a frequency list of a corpus is usually a two-column table with all words occurring in the
corpus in one column and the frequency with which they occur in the corpus in the other
column. Since the notion of word is a little ambiguous here, it is useful to introduce a
common distinction between (word) type and (word) token. The string “the word and
the phrase” contains five (word) tokens (“the”, “word”, “and”, “the”, and “phrase”),
but only four (word) types (“the”, “word”, “and”, and “phrase”), of which one (“the”)
occurs twice. In this parlance, a frequency list lists the types in one column and their token
frequencies in the other; often you will find the expression type frequency referring to the
number of different types attested in a corpus (or in a ‘slot’ such as a syntactically defined
slot in a grammatical construction).
Typically, one out of three different sorting styles is used: frequency order (ascending
or, more typically, descending; see the left panel of Table 2.1), alphabetical (ascending or
descending), and occurrence (each word occurs in a position reflecting its first occurrence
in the corpus).
Apart from this simple form in the leftmost panel, there are other varieties of frequency
lists that are sometimes found. First, a frequency list may provide the frequencies of all
words together with the words with their letters reversed. This may not seem particu-
larly useful at first, but even a brief look at the second panel of Table 2.1 clarifies that
this kind of display can sometimes be very helpful because it groups together words that
share a particular suffix – here the adverb marker -ly. Second, a frequency list may not list
each individual word token and their frequencies, but so-called n-grams, i.e., sequences of

Table 2.1 Examples of differently ordered frequency lists

Words Freq. Words Freq. Bigrams Freq. Words Tags Freq.

the 62,580 yllufdaerd 80 of the 4,892 the AT0 6,069


of 35,958 yllufecaep 1 in the 3,006 of PRF 4,106
and 27,789 yllufecarg 5 to the 1,751 a AT0 2,823
to 25,600 yllufecruoser 8 on the 1,228 and CJC 2,602
a 21,843 yllufeelg 1 and the 1,114 in PRP 2,449
in 19,446 yllufeow 1 for the 906 to TO0 1,678
that 10,296 ylluf 2 at the 832 is VBZ 1,589
is 9,938 yllufepoh 8 to be 799 to PRP 1,135
was 9,740 ylluferac 87 with the 783 for PRP 916
for 8,799 yllufesoprup 1 from the 720 be VBI 874
Another Random Scribd Document
with Unrelated Content
Me, jotka olimme tulleet laivaan Callaossa, saimme lähteä. Charles
Clarck oli luvannut paikan Stonefieldille ja irlantilaiselle. Fred oli
Mackin pyynnöstä päättänyt mennä hänen kotiinsa Skotlantiin. Mack
oli luvannut Fredille kauniin sisarensa, jos tämä vielä oli otettavissa.
Peg oli päättänyt toistaiseksi jäädä Columbiaan, ja Lelu koetti
pysytellä lähellä häntä, jotta taas yhdessä voisivat halkaisijanpuomin
päältä pyydystää bonitoja. Kapteeni Franckilla oli tarkoitus järjestää
siten, että nappula saisi olla harjoitteluaikansa Columbiassa, jonka
jälkeen kapteeni kustantaisi pojan koulunkäynnin kapteeniksi saakka.
Chimborazohan oli luvannut pitää huolen geishan tyttärestä.

Entä Chimb itse, mihin hän joutuu? Meille neljälle on käynyt


selväksi, että meidän täytyy erota Chimbistä. Kaikki rakastamme
Andein tytärtä, mutta hän istuu korkealla oksalla, ja me seisomme
puun juurella.

— Täällä tulee kylmä, mennään keittiöön, sanoi Chimb. Istuimme


siellä vähän aikaa ääneti, sitten sanoi Andein tytär:

— Minunhan piti kertoa teille ja keskustella kanssanne viime yönä


asiasta, joka koskee oikeastaan minua, mutta myöskin teitä. Intiaani
saa itkeä, mutta ei nauraa.

Juuri kun Chimbin piti jatkaa, päästyään liikutuksensa vallasta,


tuntui tärähdys, joka johtui siitä, että vene oli tullut kosketukseen
laivamme sivun kanssa.

— Uudenvuodenyönä ovat irlantilaiset liikkeellä. Muuan heistä on


sikahumalassa. Hänet aiotaan nostaa laivaan. Veneessä on myös
nainen, huomautti Mack, kurkistettuaan laidan yli.
XXI.

Mr. Cristopher oli päässyt Atlantin yli. Queenstownissa hän on ollut


viikkokausia ja sähköttänyt vimmatusti, sinne ja tänne. Hänen
matkatoverinsa eivät ole jättäneet häntä rauhaan, ei yöllä eikä
päivällä. Hänet on ajettu pois hotellista ja viety putkaan. Täältä
hänet on siirretty sairaalaan. Putkassa ei oltu niin viisaita, että olisi
pidetty hänen matkatovereitansa siellä, sillä nehän ovat syypäät mr.
Cristopherin rikoksiin ja hullutuksiin, vaan ne saivat seurata
herraansa. Kuultuaan sairaan hourivan, on lääkäri sanonut, ettei hän
parane ennen kuin eräs Columbia-niminen laiva saapuu
Queenstowniin.

*****

Mies, joka nostettiin Columbian kannelle, ei ollutkaan pahantekijä


eikä humalainen, vaan sairas. Mukana ollut sairaanhoitajatar kertoi,
mitä tautia sairas poti. Tietysti Chimb oli heti muuttunut
sairaanhoitajattareksi ja tarjoutui auttamaan. Me neljä astuimme
Chimbin viittauksesta salonkiin, jossa sairas nyt istui pirteämmän
näköisenä kuin laivaan tullessaan.
— Tunnetteko minua, pojat? Tuumimme ja tarkastimme vähän
aikaa.

— Mr. Malcolm! Joko Lima—Oroyan rautatie ulottuu Irlantiin asti?


Että olette rakentanut rautateitä kuilujen ja kohisevien koskien ylitse,
sen tiedämme, mutta että olette rakentanut rautatien Atlantin alitse,
sitä emme olisi uskaltaneet uskoa, sanoi Mack.

— Minä olen tullut tänne Suuren Hengen opastamana, pyytämään


Chimborazo Lauricochaa vaimokseni. Kuukausi sen jälkeen kun olitte
vieneet hänet, pääsin selville siitä, että rakastan häntä. Rakastan
ensimmäisen ja viimeisen kerran elämässäni. Jonkin ajan sen jälkeen
sain varmuuden siitä, että hän myöskin, millä kohtaa maapalloa
mahtoi ollakaan, oli ruvennut rakastamaan minua, kertoi mr.
Malcolm.

— Se oli sen jälkeen, kun olimme jättäneet Falklannin saaret,


sanoi
Chimb.

— Siis, sanoi mr. Malcolm ja ojensi kätensä Chimbille.

Chimb nousi, tuli ja suuteli meidän jokaisen neljän otsaa ja kysyi:

— Annatteko minut mr. Malcolmille?

— Annamme, vastasimme.

Chimb otti nyt mr. Malcolmin käden ja veti suunsa hymyyn —


intiaani nauroi. Tämän jälkeen poistuivat mr. Malcolmin matkatoverit.
*** END OF THE PROJECT GUTENBERG EBOOK ANDEIN TYTÄR ***

Updated editions will replace the previous one—the old editions will
be renamed.

Creating the works from print editions not protected by U.S.


copyright law means that no one owns a United States copyright in
these works, so the Foundation (and you!) can copy and distribute it
in the United States without permission and without paying
copyright royalties. Special rules, set forth in the General Terms of
Use part of this license, apply to copying and distributing Project
Gutenberg™ electronic works to protect the PROJECT GUTENBERG™
concept and trademark. Project Gutenberg is a registered trademark,
and may not be used if you charge for an eBook, except by following
the terms of the trademark license, including paying royalties for use
of the Project Gutenberg trademark. If you do not charge anything
for copies of this eBook, complying with the trademark license is
very easy. You may use this eBook for nearly any purpose such as
creation of derivative works, reports, performances and research.
Project Gutenberg eBooks may be modified and printed and given
away—you may do practically ANYTHING in the United States with
eBooks not protected by U.S. copyright law. Redistribution is subject
to the trademark license, especially commercial redistribution.

START: FULL LICENSE


THE FULL PROJECT GUTENBERG LICENSE
PLEASE READ THIS BEFORE YOU DISTRIBUTE OR USE THIS WORK

To protect the Project Gutenberg™ mission of promoting the free


distribution of electronic works, by using or distributing this work (or
any other work associated in any way with the phrase “Project
Gutenberg”), you agree to comply with all the terms of the Full
Project Gutenberg™ License available with this file or online at
www.gutenberg.org/license.

Section 1. General Terms of Use and


Redistributing Project Gutenberg™
electronic works
1.A. By reading or using any part of this Project Gutenberg™
electronic work, you indicate that you have read, understand, agree
to and accept all the terms of this license and intellectual property
(trademark/copyright) agreement. If you do not agree to abide by all
the terms of this agreement, you must cease using and return or
destroy all copies of Project Gutenberg™ electronic works in your
possession. If you paid a fee for obtaining a copy of or access to a
Project Gutenberg™ electronic work and you do not agree to be
bound by the terms of this agreement, you may obtain a refund
from the person or entity to whom you paid the fee as set forth in
paragraph 1.E.8.

1.B. “Project Gutenberg” is a registered trademark. It may only be


used on or associated in any way with an electronic work by people
who agree to be bound by the terms of this agreement. There are a
few things that you can do with most Project Gutenberg™ electronic
works even without complying with the full terms of this agreement.
See paragraph 1.C below. There are a lot of things you can do with
Project Gutenberg™ electronic works if you follow the terms of this
agreement and help preserve free future access to Project
Gutenberg™ electronic works. See paragraph 1.E below.
1.C. The Project Gutenberg Literary Archive Foundation (“the
Foundation” or PGLAF), owns a compilation copyright in the
collection of Project Gutenberg™ electronic works. Nearly all the
individual works in the collection are in the public domain in the
United States. If an individual work is unprotected by copyright law
in the United States and you are located in the United States, we do
not claim a right to prevent you from copying, distributing,
performing, displaying or creating derivative works based on the
work as long as all references to Project Gutenberg are removed. Of
course, we hope that you will support the Project Gutenberg™
mission of promoting free access to electronic works by freely
sharing Project Gutenberg™ works in compliance with the terms of
this agreement for keeping the Project Gutenberg™ name associated
with the work. You can easily comply with the terms of this
agreement by keeping this work in the same format with its attached
full Project Gutenberg™ License when you share it without charge
with others.

1.D. The copyright laws of the place where you are located also
govern what you can do with this work. Copyright laws in most
countries are in a constant state of change. If you are outside the
United States, check the laws of your country in addition to the
terms of this agreement before downloading, copying, displaying,
performing, distributing or creating derivative works based on this
work or any other Project Gutenberg™ work. The Foundation makes
no representations concerning the copyright status of any work in
any country other than the United States.

1.E. Unless you have removed all references to Project Gutenberg:

1.E.1. The following sentence, with active links to, or other


immediate access to, the full Project Gutenberg™ License must
appear prominently whenever any copy of a Project Gutenberg™
work (any work on which the phrase “Project Gutenberg” appears,
or with which the phrase “Project Gutenberg” is associated) is
accessed, displayed, performed, viewed, copied or distributed:
This eBook is for the use of anyone anywhere in the United
States and most other parts of the world at no cost and with
almost no restrictions whatsoever. You may copy it, give it away
or re-use it under the terms of the Project Gutenberg License
included with this eBook or online at www.gutenberg.org. If you
are not located in the United States, you will have to check the
laws of the country where you are located before using this
eBook.

1.E.2. If an individual Project Gutenberg™ electronic work is derived


from texts not protected by U.S. copyright law (does not contain a
notice indicating that it is posted with permission of the copyright
holder), the work can be copied and distributed to anyone in the
United States without paying any fees or charges. If you are
redistributing or providing access to a work with the phrase “Project
Gutenberg” associated with or appearing on the work, you must
comply either with the requirements of paragraphs 1.E.1 through
1.E.7 or obtain permission for the use of the work and the Project
Gutenberg™ trademark as set forth in paragraphs 1.E.8 or 1.E.9.

1.E.3. If an individual Project Gutenberg™ electronic work is posted


with the permission of the copyright holder, your use and distribution
must comply with both paragraphs 1.E.1 through 1.E.7 and any
additional terms imposed by the copyright holder. Additional terms
will be linked to the Project Gutenberg™ License for all works posted
with the permission of the copyright holder found at the beginning
of this work.

1.E.4. Do not unlink or detach or remove the full Project


Gutenberg™ License terms from this work, or any files containing a
part of this work or any other work associated with Project
Gutenberg™.

1.E.5. Do not copy, display, perform, distribute or redistribute this


electronic work, or any part of this electronic work, without
prominently displaying the sentence set forth in paragraph 1.E.1
with active links or immediate access to the full terms of the Project
Gutenberg™ License.

1.E.6. You may convert to and distribute this work in any binary,
compressed, marked up, nonproprietary or proprietary form,
including any word processing or hypertext form. However, if you
provide access to or distribute copies of a Project Gutenberg™ work
in a format other than “Plain Vanilla ASCII” or other format used in
the official version posted on the official Project Gutenberg™ website
(www.gutenberg.org), you must, at no additional cost, fee or
expense to the user, provide a copy, a means of exporting a copy, or
a means of obtaining a copy upon request, of the work in its original
“Plain Vanilla ASCII” or other form. Any alternate format must
include the full Project Gutenberg™ License as specified in
paragraph 1.E.1.

1.E.7. Do not charge a fee for access to, viewing, displaying,


performing, copying or distributing any Project Gutenberg™ works
unless you comply with paragraph 1.E.8 or 1.E.9.

1.E.8. You may charge a reasonable fee for copies of or providing


access to or distributing Project Gutenberg™ electronic works
provided that:

• You pay a royalty fee of 20% of the gross profits you derive
from the use of Project Gutenberg™ works calculated using the
method you already use to calculate your applicable taxes. The
fee is owed to the owner of the Project Gutenberg™ trademark,
but he has agreed to donate royalties under this paragraph to
the Project Gutenberg Literary Archive Foundation. Royalty
payments must be paid within 60 days following each date on
which you prepare (or are legally required to prepare) your
periodic tax returns. Royalty payments should be clearly marked
as such and sent to the Project Gutenberg Literary Archive
Foundation at the address specified in Section 4, “Information
about donations to the Project Gutenberg Literary Archive
Foundation.”

• You provide a full refund of any money paid by a user who


notifies you in writing (or by e-mail) within 30 days of receipt
that s/he does not agree to the terms of the full Project
Gutenberg™ License. You must require such a user to return or
destroy all copies of the works possessed in a physical medium
and discontinue all use of and all access to other copies of
Project Gutenberg™ works.

• You provide, in accordance with paragraph 1.F.3, a full refund of


any money paid for a work or a replacement copy, if a defect in
the electronic work is discovered and reported to you within 90
days of receipt of the work.

• You comply with all other terms of this agreement for free
distribution of Project Gutenberg™ works.

1.E.9. If you wish to charge a fee or distribute a Project Gutenberg™


electronic work or group of works on different terms than are set
forth in this agreement, you must obtain permission in writing from
the Project Gutenberg Literary Archive Foundation, the manager of
the Project Gutenberg™ trademark. Contact the Foundation as set
forth in Section 3 below.

1.F.

1.F.1. Project Gutenberg volunteers and employees expend


considerable effort to identify, do copyright research on, transcribe
and proofread works not protected by U.S. copyright law in creating
the Project Gutenberg™ collection. Despite these efforts, Project
Gutenberg™ electronic works, and the medium on which they may
be stored, may contain “Defects,” such as, but not limited to,
incomplete, inaccurate or corrupt data, transcription errors, a
copyright or other intellectual property infringement, a defective or
damaged disk or other medium, a computer virus, or computer
codes that damage or cannot be read by your equipment.

1.F.2. LIMITED WARRANTY, DISCLAIMER OF DAMAGES - Except for


the “Right of Replacement or Refund” described in paragraph 1.F.3,
the Project Gutenberg Literary Archive Foundation, the owner of the
Project Gutenberg™ trademark, and any other party distributing a
Project Gutenberg™ electronic work under this agreement, disclaim
all liability to you for damages, costs and expenses, including legal
fees. YOU AGREE THAT YOU HAVE NO REMEDIES FOR
NEGLIGENCE, STRICT LIABILITY, BREACH OF WARRANTY OR
BREACH OF CONTRACT EXCEPT THOSE PROVIDED IN PARAGRAPH
1.F.3. YOU AGREE THAT THE FOUNDATION, THE TRADEMARK
OWNER, AND ANY DISTRIBUTOR UNDER THIS AGREEMENT WILL
NOT BE LIABLE TO YOU FOR ACTUAL, DIRECT, INDIRECT,
CONSEQUENTIAL, PUNITIVE OR INCIDENTAL DAMAGES EVEN IF
YOU GIVE NOTICE OF THE POSSIBILITY OF SUCH DAMAGE.

1.F.3. LIMITED RIGHT OF REPLACEMENT OR REFUND - If you


discover a defect in this electronic work within 90 days of receiving
it, you can receive a refund of the money (if any) you paid for it by
sending a written explanation to the person you received the work
from. If you received the work on a physical medium, you must
return the medium with your written explanation. The person or
entity that provided you with the defective work may elect to provide
a replacement copy in lieu of a refund. If you received the work
electronically, the person or entity providing it to you may choose to
give you a second opportunity to receive the work electronically in
lieu of a refund. If the second copy is also defective, you may
demand a refund in writing without further opportunities to fix the
problem.

1.F.4. Except for the limited right of replacement or refund set forth
in paragraph 1.F.3, this work is provided to you ‘AS-IS’, WITH NO
OTHER WARRANTIES OF ANY KIND, EXPRESS OR IMPLIED,
INCLUDING BUT NOT LIMITED TO WARRANTIES OF
MERCHANTABILITY OR FITNESS FOR ANY PURPOSE.

1.F.5. Some states do not allow disclaimers of certain implied


warranties or the exclusion or limitation of certain types of damages.
If any disclaimer or limitation set forth in this agreement violates the
law of the state applicable to this agreement, the agreement shall be
interpreted to make the maximum disclaimer or limitation permitted
by the applicable state law. The invalidity or unenforceability of any
provision of this agreement shall not void the remaining provisions.

1.F.6. INDEMNITY - You agree to indemnify and hold the Foundation,


the trademark owner, any agent or employee of the Foundation,
anyone providing copies of Project Gutenberg™ electronic works in
accordance with this agreement, and any volunteers associated with
the production, promotion and distribution of Project Gutenberg™
electronic works, harmless from all liability, costs and expenses,
including legal fees, that arise directly or indirectly from any of the
following which you do or cause to occur: (a) distribution of this or
any Project Gutenberg™ work, (b) alteration, modification, or
additions or deletions to any Project Gutenberg™ work, and (c) any
Defect you cause.

Section 2. Information about the Mission


of Project Gutenberg™
Project Gutenberg™ is synonymous with the free distribution of
electronic works in formats readable by the widest variety of
computers including obsolete, old, middle-aged and new computers.
It exists because of the efforts of hundreds of volunteers and
donations from people in all walks of life.

Volunteers and financial support to provide volunteers with the


assistance they need are critical to reaching Project Gutenberg™’s
goals and ensuring that the Project Gutenberg™ collection will
remain freely available for generations to come. In 2001, the Project
Gutenberg Literary Archive Foundation was created to provide a
secure and permanent future for Project Gutenberg™ and future
generations. To learn more about the Project Gutenberg Literary
Archive Foundation and how your efforts and donations can help,
see Sections 3 and 4 and the Foundation information page at
www.gutenberg.org.

Section 3. Information about the Project


Gutenberg Literary Archive Foundation
The Project Gutenberg Literary Archive Foundation is a non-profit
501(c)(3) educational corporation organized under the laws of the
state of Mississippi and granted tax exempt status by the Internal
Revenue Service. The Foundation’s EIN or federal tax identification
number is 64-6221541. Contributions to the Project Gutenberg
Literary Archive Foundation are tax deductible to the full extent
permitted by U.S. federal laws and your state’s laws.

The Foundation’s business office is located at 809 North 1500 West,


Salt Lake City, UT 84116, (801) 596-1887. Email contact links and up
to date contact information can be found at the Foundation’s website
and official page at www.gutenberg.org/contact

Section 4. Information about Donations to


the Project Gutenberg Literary Archive
Foundation
Project Gutenberg™ depends upon and cannot survive without
widespread public support and donations to carry out its mission of
increasing the number of public domain and licensed works that can
be freely distributed in machine-readable form accessible by the
widest array of equipment including outdated equipment. Many
small donations ($1 to $5,000) are particularly important to
maintaining tax exempt status with the IRS.

The Foundation is committed to complying with the laws regulating


charities and charitable donations in all 50 states of the United
States. Compliance requirements are not uniform and it takes a
considerable effort, much paperwork and many fees to meet and
keep up with these requirements. We do not solicit donations in
locations where we have not received written confirmation of
compliance. To SEND DONATIONS or determine the status of
compliance for any particular state visit www.gutenberg.org/donate.

While we cannot and do not solicit contributions from states where


we have not met the solicitation requirements, we know of no
prohibition against accepting unsolicited donations from donors in
such states who approach us with offers to donate.

International donations are gratefully accepted, but we cannot make


any statements concerning tax treatment of donations received from
outside the United States. U.S. laws alone swamp our small staff.

Please check the Project Gutenberg web pages for current donation
methods and addresses. Donations are accepted in a number of
other ways including checks, online payments and credit card
donations. To donate, please visit: www.gutenberg.org/donate.

Section 5. General Information About


Project Gutenberg™ electronic works
Professor Michael S. Hart was the originator of the Project
Gutenberg™ concept of a library of electronic works that could be
freely shared with anyone. For forty years, he produced and
distributed Project Gutenberg™ eBooks with only a loose network of
volunteer support.
Project Gutenberg™ eBooks are often created from several printed
editions, all of which are confirmed as not protected by copyright in
the U.S. unless a copyright notice is included. Thus, we do not
necessarily keep eBooks in compliance with any particular paper
edition.

Most people start at our website which has the main PG search
facility: www.gutenberg.org.

This website includes information about Project Gutenberg™,


including how to make donations to the Project Gutenberg Literary
Archive Foundation, how to help produce our new eBooks, and how
to subscribe to our email newsletter to hear about new eBooks.
Welcome to Our Bookstore - The Ultimate Destination for Book Lovers
Are you passionate about testbank and eager to explore new worlds of
knowledge? At our website, we offer a vast collection of books that
cater to every interest and age group. From classic literature to
specialized publications, self-help books, and children’s stories, we
have it all! Each book is a gateway to new adventures, helping you
expand your knowledge and nourish your soul
Experience Convenient and Enjoyable Book Shopping Our website is more
than just an online bookstore—it’s a bridge connecting readers to the
timeless values of culture and wisdom. With a sleek and user-friendly
interface and a smart search system, you can find your favorite books
quickly and easily. Enjoy special promotions, fast home delivery, and
a seamless shopping experience that saves you time and enhances your
love for reading.
Let us accompany you on the journey of exploring knowledge and
personal growth!

ebooksecure.com

You might also like