0% found this document useful (0 votes)
12 views

Chapter2-answers

The document provides answers to exercises from Brezina's book on statistics in corpus linguistics, detailing the identification of tokens, types, lemmas, and lexemes in various examples. It also includes calculations of relative frequencies, predictions using Zipf's law, and statistical measures such as range, standard deviation, and Juilland's D. Additionally, the book serves as a practical guide for understanding statistical principles in linguistic research and offers supplementary online resources.

Uploaded by

dercioalbertoj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Chapter2-answers

The document provides answers to exercises from Brezina's book on statistics in corpus linguistics, detailing the identification of tokens, types, lemmas, and lexemes in various examples. It also includes calculations of relative frequencies, predictions using Zipf's law, and statistical measures such as range, standard deviation, and Juilland's D. Additionally, the book serves as a practical guide for understanding statistical principles in linguistic research and offers supplementary online resources.

Uploaded by

dercioalbertoj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Materials from Brezina, V. (2018). Statistics in Corpus Linguistics: A Practical Guide.

Cambridge: Cambridge
University Press.
PHOTOCOPIABLE

Chapter 2: Exercises – answers


1) Identify the no. of tokens, types, lemmas and lexemes.

a)

Tokens (26) Types (231) Lemmas (232) Lexemes (23)

The; City; is; braced; the; city; is; braced; the; City; be; brace; THE; CITY; BE;
for; far; worse; for; far; worse; for; far; bad; figure; BRACE; FOR; FAR;
figures; to; come; in; figures; to; come; in; to; come; in; BAD; FIGURE; TO;
the; coming; coming; months; coming; month; COME; IN; COMING;
months; unless; the; unless; government; unless; Government; MONTH; UNLESS;
Government; recovery; package; recovery; package; GOVERNMENT;
recovery; package; produces; a; produce; a; startling; RECOVERY;
produces; a; startling; turn; turn; round; PACKAGE;
startling; turn; round; optimism optimism PRODUCE; A;
round; in; optimism STARTLING; TURN;
ROUND; OPTIMISM

b)

Tokens (293) Types (274) Lemmas (245) Lexemes (24)

Of; 354; fifth-; and; of; 354; fifth-; and; Of; <NUMBER>; OF; <NUMBER>;
sixth-formers; who; sixth-formers; who; fifth-; and; sixth- FIFTH-; AND; SIXTH-
left; Sharon's; left; sharon's; formers; who; leave; FORMERS; WHO;
school; in; the; school; in; the; Sharon; school; in; LEAVE; SHARON;
summer; of; 1981; summer; 1981; the; summer; forty; SCHOOL; IN; THE;
forty; had; found; forty; had; found; have; find; real; job; SUMMER; FORTY;
real; jobs; by; 18; real; jobs; by; 18; by; November; four; HAVE; FIND; REAL;
November; four; of; november; four; these; enter; JOB; BY;
these; having; these; having; military; service NOVEMBER; FOUR;
entered; military; entered; military; THESE; ENTER;
service service MILITARY; SERVICE

1
An alternative solution: 24 if the case sensitive option is selected – The and the would be counted as two types.
2
Alternative solutions: a) 22 if turn round is understood as one lexical unit b) 22 if coming is lumped under the
headword come.
3
An alternative solution: 30 if hyphen considered as a token separator; in that case sixth and formers would be
considered as two tokens.
4
An alternative solution: 28 if the case sensitive option is selected – Of and of would be counted as two types.
5
An alternative solution: 25 if possessive suffix ’s is counted as a separate lemma.

1
Materials from Brezina, V. (2018). Statistics in Corpus Linguistics: A Practical Guide. Cambridge: Cambridge
University Press.
PHOTOCOPIABLE

c)

Tokens (14) Types (126) Lemmas (12) Lexemes (107)

Erm; erm; erm; but; erm; but; yeah; and; erm; but; yeah; and; BUT; YEAH; AND;
yeah; and; people; people; er; have; people; er; have; PEOPLE; HAVE;
er; have; great; great; areas; of; great; area; of; that; GREAT; AREA; OF;
areas; of; that; taken that; taken take THAT; TAKE

d) This is a very specific example which includes meta-linguistic comments on the meanings/uses of the
form bow.

Tokens (26) Types (18) Lemmas (19) Lexemes (20)

Homonyms; are; homonyms; are; Homonyms; be; Homonyms; be;


headwords; to; headwords; to; headword; to; headword; to;
different; entries; different; entries; different; entry; different; entry;
that; are; spelt; in; that; spelt; in; the; that; spell; in; the; that; spell; in; the;
the; same; way; e.g.; same; way; e.g.; same; way; e.g.; same; way; e.g.;
bow; the; weapon; bow; weapon; bow; weapon; bow; weapon; bow;
bow; the; action; action; verb; action; bow; verb; action; bow; verb;
bow; the; verb; expressing; expressing; expressing;
expressing; the;
action

2) and 3) –

4) Calculate the relative frequencies.

a) muggle: 0.2 per 10k

b) intriguingly: 0.3 per million

b) worse: 49.6 per million

6
An alternative solution: 12 if the case sensitive option is selected – Erm and erm would be counted as two types.
7
The paralinguistic hesitation sounds (erm and er) in this utterance from a transcript of spoken conversation were
excluded because they do not have a semantic meaning.

2
Materials from Brezina, V. (2018). Statistics in Corpus Linguistics: A Practical Guide. Cambridge: Cambridge
University Press.
PHOTOCOPIABLE

5) Use Zipf’s law to predict absolute frequencies.

rank word absolute frequency


1. the 6,041,234
2. of 3,020,617
3. and 2,013,745
4. to 1,510,309
5. a 1,208,247
10. was 604,123
50. so 120,825
100. way 60,412
1,000. limited 6,041
10,000. conveniently 604

6) N.B. Zipf’s law is only an approximation and the actual absolute frequencies in the table below differ
to some extent from the predicted ones.

rank word absolute frequency


1. the 6,041,234
2. of 3,042,376
3. and 2,616,708
4. to 2,593,729
5. a 2,164,238
10. was 881,473
50. so 239,116
100. way 95,701
1,000. limited 10,312
10,000. conveniently 622

7) Calculate the Range, the Standard deviation, the Coefficient of variation and Juilland’s D.

Note that the first step is to convert all absolute frequencies to relative frequencies as seen in the
table below.

3
Materials from Brezina, V. (2018). Statistics in Corpus Linguistics: A Practical Guide. Cambridge: Cambridge
University Press.
PHOTOCOPIABLE

BNC section Total no. of some (RF) smile (RF) theory (RF) chance (RF)
tokens
Fictionandverse 16,143,913 1,525 341 21 164
News-papers 9,412,174 1,118 32 28 275
Non-academic 24,178,674 1,785 16 164 91
proseand
biography
Academic prose 15,778,028 1,920 4 418 58
Otherwritten 22,390,782 1,691 22 57 148
material
Spoken 10,409,858 1,978 11 35 109

a) Range

some: 6

smile: 6

theory: 6

chance: 6

b) Standard deviation

some: 287.74

smile: 121.06

theory: 141.54

chance: 69.46

c) the Coefficient of variation

some: 0.17

smile: 1.71

theory: 1.17

chance: 0.49

4
Materials from Brezina, V. (2018). Statistics in Corpus Linguistics: A Practical Guide. Cambridge: Cambridge
University Press.
PHOTOCOPIABLE

d) Juilland’s D

some: 0.92

smile: 0.24

theory: 0.47

chance: 0.78

8) Use Juilland’s U usage coefficient to rank the words some, smile, theory and chance according to their
relative importance.

Juilland's D AF (whole Juilland's U (Juilland's


corpus) D × AF)

1. some 0.92 167,050 153,686.00

2. chance 0.78 12,809 9,991.02

3. theory 0.47 12,809 6,020.23

4. smile 0.24 6,848 1,643.52

9) Calculate the ARF of the selected words in the BE06 corpus (985,628 tokens):

a) frigid: ARF = 1.02

b) chemistry: ARF = 3.17

c) porn: ARF = 4.6

5
Materials from Brezina, V. (2018). Statistics in Corpus Linguistics: A Practical Guide. Cambridge: Cambridge
University Press.
PHOTOCOPIABLE

Brezina, V. (2018). Statistics in Corpus Linguistics: A Practical


Guide. Cambridge: Cambridge University Press.

Do you use language corpora in your research or study, but find


that you struggle with statistics? This practical introduction will
equip you to understand the key principles of statistical thinking
and apply these concepts to your own research, without the need
for prior statistical knowledge. The book gives step-by-step
guidance through the process of statistical analysis and provides
multiple examples of how statistical techniques can be used to
analyse and visualise linguistic data. It also includes a useful
selection of discussion questions and exercises which you can use
to check your understanding.

The book comes with a Companion website, which provides additional materials (answers to
exercises, datasets, advanced materials, teaching slides etc.) and Lancaster Stats Tools online, a free
click-and-analyse statistical tool for easy calculation of the statistical measures discussed in the book.

You might also like