0% found this document useful (0 votes)

3 views31 pages

Statistically Understanding Text

The document discusses the distinction between function words and content words, highlighting their roles in sentence structure and meaning. It also covers concepts such as type vs. tokens, type/token ratio, and Zipf's Law, which relates word frequency to its rank. Additionally, it introduces Heaps' Law regarding vocabulary growth in relation to corpus size and provides empirical observations from various texts, including Tom Sawyer.

Uploaded by

RISHU CHAUHAN (RA2011003011371)

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views31 pages

Statistically Understanding Text

Uploaded by

RISHU CHAUHAN (RA2011003011371)

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

Statistically Understanding Text

2/10/2022 Statistically Understanding Text

Function Words vs. Content Words
Function words have little lexical meaning but serve as important elements to
the structure of sentences.

2/10/2022 Statistically Understanding Text

Function Words vs. Content Words
Function words have little lexical meaning but serve as important elements to
the structure of sentences.
Example
Prepositions: in, of, between, on, with, by, at, without, through, over,
across, around, into, within
Pronouns: she, they, he, it, him, her, you, me, anybody, somebody,
someone, anyone
Determiners: the, a, that, my, more, much, either, neither
Conjunctions: and, but, for, yet, neither, or, so, when, although, however,
as, because, before
Auxiliary verbs: be (is, am, are), have, got, do
Particles: no, not, nor, as

2/10/2022 Statistically Understanding Text

Function words are closed-class words

2/10/2022 Statistically Understanding Text

Function Words vs. Content Words

Content words contain more lexical meaning than function words.

Example
Nouns: john, room, answer
Adjectives: happy, new, large, grey
Full verbs: search, grow, hold, have
Adverbs: really, completely, very, also, enough

2/10/2022 Statistically Understanding Text

Function Words vs. Content Words

Examples
Our friends called us yesterday and asked if we’d like to visit them next
month.
The best time to study is early in the morning or late in the evening.

2/10/2022 Statistically Understanding Text

Function Words vs. Content Words

Examples
Our friends called us yesterday and asked if we’d like to visit them next
month.
The best time to study is early in the morning or late in the evening.

2/10/2022 Statistically Understanding Text

Tom Sawyer (by Mark Twain)

Text download: https://www.gutenberg.org/files/74/74-0.txt

2/10/2022 Statistically Understanding Text

Most Common Words in Tom Sawyer

Most words are smaller in length but have important grammatical roles.
They are determiners, prepositions, conjunctions, pronouns, etc.

2/10/2022 Statistically Understanding Text

What about other texts?

Shakespeare
https://ocw.mit.edu/ans7870/6/6.006/s08/lecturenotes/files/t8.shakespeare.txt

News articles
https://www.kaggle.com/uciml/news-aggregator-dataset

Amazon reviews
https://www.kaggle.com/bittlingmayer/amazonreviews

2/10/2022 Statistically Understanding Text

Type vs. Tokens

Types
Number of distinct words in the corpus (size of vocabulary).

Tokens
Total number of running words in the corpus.

Example
They picnicked by the pool, then lay back on the grass and looked at the stars.
Tokens: 16
Types: 14

2/10/2022 Statistically Understanding Text

Type/Token Ratio

TTR
The type/token ratio (TTR) is the ratio of the number of different words
(types) to the number of running words (tokens) in a given text or corpus.
This index indicates how often, on average, a new ‘word form’ appears in
the text or corpus.

2/10/2022 Statistically Understanding Text

Comparison Across Texts

Mark Twain’s Tom Sawyer

77,491 word tokens
8,486 word types
TTR = 0.11

Complete Shakespeare work

928,012 word tokens
29,454 word types
TTR = 0.032

2/10/2022 Statistically Understanding Text

Empirical Observations on Various Texts

Comparing Conversation, academic prose, news, fiction

Longman Grammar of Spoken and Written English, Biber et al. (1999).
TTR scores the lowest value (tendency to use the same words) in
conversation.
TTR scores the highest value (tendency to use different words) in news.
Academic prose writing has the second lowest TTR.

2/10/2022 Statistically Understanding Text

Empirical Observations on Various Texts

Comparing Conversation, academic prose, news, fiction

Not a valid measure of ‘text complexity’ by itself

The value varies with the size of the text.
For a valid measure, a running average is computed on consecutive
1000-word chunks of the text.

2/10/2022 Statistically Understanding Text

Word Distribution from Tom Sawyer

Frequency Frequency of frequency TTR = 0.11 ⇒ Words occur on

1 4222 average 9 times each.
2 1398 But words have a very uneven
3 705 distribution.
4 454
5 245
6 213
7 174
8 141
9 85
10 93
11 61
12 55
13 50
14 45
15 26
2/10/2022 Statistically Understanding Text
Word Distribution from Tom Sawyer

Frequency Frequency of frequency TTR = 0.11 ⇒ Words occur on

1 4222 average 9 times each.
2 1398 But words have a very uneven
3 705 distribution.
4 454
5 245 Most words are rare
6 213
4222 (50%) word types appear
7 174
8 141 only once
9 85 They are called happax legomena
10 93 (Greek for ‘read only once’)
11 61
12 55
13 50
14 45
15 26
2/10/2022 Statistically Understanding Text
Word Distribution from Tom Sawyer

Frequency Frequency of frequency TTR = 0.11 ⇒ Words occur on

1 4222 average 9 times each.
2 1398 But words have a very uneven
3 705 distribution.
4 454
5 245 Most words are rare
6 213
4222 (50%) word types appear
7 174
8 141 only once
9 85 They are called happax legomena
10 93 (Greek for ‘read only once’)
11 61
12 55
But common words are very common
13 50
14 45 100 words account for 51% of all
15 26 tokens of all text
2/10/2022 Statistically Understanding Text
Zipf’s Law

Count the frequency of each word type in a large corpus

List the word types in decreasing order of their frequency

Zipf’s Law
A relationship between the frequency of a word ( f ) and its position in the list
(its rank r).

1
f∝
r
or, there is a constant k such that

f .r = k

i.e. the 50th most common word should occur with 3 times the frequency of
the 150th most common word.

2/10/2022 Statistically Understanding Text

Zipf’s Law

Let
pr denote the probability of word of rank r
N denote the total number of word occurrences
f A
pr = =
N r
The value of A is found closer to 0.1 for corpus

2/10/2022 Statistically Understanding Text

Empirical Evaluation from Tom Sawyer

Freq(f) Rank(r) fr Freq(f) Rank(r) fr

3523 1 3523 43 243 10449
3052 2 6104 43 244 10492
1861 3 5583 43 245 10535
1797 4 7188 43 246 10578
1565 5 7825 43 247 10621
1165 6 6990 42 248 10416
1144 7 8008 41 249 10209
1018 8 8144 41 250 10250
975 9 8775 41 251 10291
970 10 9700 41 252 10332
929 11 10219 41 253 10373
869 12 10428 41 254 10414

2/10/2022 Statistically Understanding Text

Zipf’s Other Laws

Correlation: Number of meanings and word frequency

The number of meanings m of a word obeys the law:
√
m∝ f

2/10/2022 Statistically Understanding Text

Zipf’s Other Laws

Correlation: Number of meanings and word frequency

The number of meanings m of a word obeys the law:
√
m∝ f

Given the First law

1
m ∝√
r

2/10/2022 Statistically Understanding Text

Zipf’s Other Laws

Correlation: Number of meanings and word frequency

The number of meanings m of a word obeys the law:
√
m∝ f

Given the First law

1
m ∝√
r

Empirical Support
Rank ≈ 10000, average 2.1 meanings
Rank ≈ 5000, average 3 meanings
Rank ≈ 2000, average 4.6 meanings

2/10/2022 Statistically Understanding Text

Zipf’s Other Laws

Correlation: Word length and word frequency

Word frequency is inversely proportional to their length.

1
l∝
f

2/10/2022 Statistically Understanding Text

Zipf’s Other Laws

Correlation: Word length and word frequency

Word frequency is inversely proportional to their length.

1
l∝
f
Given the First law
l ∝r

2/10/2022 Statistically Understanding Text

Impact of Zipf’s Law

The Good part

Functional words account for a large fraction of text, thus eliminating them
greatly reduces the number of tokens in a text.

The Bad part

Most words are extremely rare and thus, gathering sufficient data for
meaningful statistical analysis is difficult for most words.

2/10/2022 Statistically Understanding Text

Vocabulary Growth

How does the size of the overall vocabulary (number of unique words) grow
with the size of the corpus?

Heaps’ Law
Let |V | be the size of vocabulary and N be the number of tokens.

|V | = KNβ

Typically
K ≈ 10-100
β ≈ 0.4 - 0.6 (roughly square root)

2/10/2022 Statistically Understanding Text

Heaps’ Law: Empirical evidence from last year assignments

2/10/2022 Statistically Understanding Text

Heaps’ Law: Empirical evidence from last year assignments

K = 15.3
β = 0.56

2/10/2022 Statistically Understanding Text

Take Home Exercise

Tom Sawyer
Download Tom Sawyer dataset.
Compute tokens, types, and TTR.
Check if Zipf’s law holds true for meanings and length.
Plot Heaps’ law

2/10/2022 Statistically Understanding Text

Worldly Wise Book 7 (3e)
75% (8)
Worldly Wise Book 7 (3e)
113 pages
Greek Frequency Dictionary - 1000 Key & Common Words in Context: Greek, #0
From Everand
Greek Frequency Dictionary - 1000 Key & Common Words in Context: Greek, #0
Jolie Laide LTD
No ratings yet
2000 Most Common Turkish Words in Context: Get Fluent & Increase Your Turkish Vocabulary with 2000 Turkish Phrases
From Everand
2000 Most Common Turkish Words in Context: Get Fluent & Increase Your Turkish Vocabulary with 2000 Turkish Phrases
Lingo Mastery
No ratings yet
G9 Vocab Practice
No ratings yet
G9 Vocab Practice
336 pages
English 4 Module 2 Final
100% (1)
English 4 Module 2 Final
31 pages
Introduction of - The Complete System of Self-Healing - Internal Exercises
No ratings yet
Introduction of - The Complete System of Self-Healing - Internal Exercises
3 pages
G9 Vocab Practice
No ratings yet
G9 Vocab Practice
336 pages
4000 Essential English Words 5 0
100% (3)
4000 Essential English Words 5 0
194 pages
09K K1 Drum
100% (3)
09K K1 Drum
3 pages
11 Plus Leap CEM4 B
No ratings yet
11 Plus Leap CEM4 B
21 pages
The Psychedelics PowerPoint
No ratings yet
The Psychedelics PowerPoint
18 pages
Inventory Models
No ratings yet
Inventory Models
62 pages
Basic Text Process
No ratings yet
Basic Text Process
3 pages
Zipf's Law and Heaps Law
No ratings yet
Zipf's Law and Heaps Law
10 pages
Kornai - How Many Words Are There
No ratings yet
Kornai - How Many Words Are There
26 pages
Random Texts Exhibit Zipf 's-Law-Like Word Frequency Distribution
100% (2)
Random Texts Exhibit Zipf 's-Law-Like Word Frequency Distribution
8 pages
Ex Words Answers
No ratings yet
Ex Words Answers
5 pages
Lec 4
No ratings yet
Lec 4
19 pages
N-Grams - Text Representation
No ratings yet
N-Grams - Text Representation
23 pages
Zipfs Law For Word Frequencies Word Forms Versus
No ratings yet
Zipfs Law For Word Frequencies Word Forms Versus
21 pages
1904 00812 PDF
No ratings yet
1904 00812 PDF
42 pages
Quantum
No ratings yet
Quantum
6 pages
Chapter2-answers
No ratings yet
Chapter2-answers
6 pages
Teacherswordbook00thor 0
No ratings yet
Teacherswordbook00thor 0
296 pages
1 Zipfmystery
No ratings yet
1 Zipfmystery
33 pages
Processing Text: 4.1 From Words To Terms
No ratings yet
Processing Text: 4.1 From Words To Terms
52 pages
ENS344M Vocabulary Acquisition Week 1
No ratings yet
ENS344M Vocabulary Acquisition Week 1
20 pages
BF03200731
No ratings yet
BF03200731
11 pages
Frequency Estimation
No ratings yet
Frequency Estimation
22 pages
English 4 Module 2 Final PDF Dictionary Lexicology 2
No ratings yet
English 4 Module 2 Final PDF Dictionary Lexicology 2
1 page
From The Teacher's Word Book PDF
No ratings yet
From The Teacher's Word Book PDF
12 pages
2 Text Operation
No ratings yet
2 Text Operation
42 pages
The Word Compexity of ... Kearns-and-Hiebert
No ratings yet
The Word Compexity of ... Kearns-and-Hiebert
31 pages
G. A. Miller and J.A. Selfridge. (1950) - Verbal Context and The Recall of Meaningful Material. American Journal of Psychology Vol 63 Hal 176-85.
No ratings yet
G. A. Miller and J.A. Selfridge. (1950) - Verbal Context and The Recall of Meaningful Material. American Journal of Psychology Vol 63 Hal 176-85.
10 pages
Que No Es La Complejidad
No ratings yet
Que No Es La Complejidad
26 pages
4000 Essential English Words 5
0% (1)
4000 Essential English Words 5
194 pages
Modeling Statistical Properties of Written Text
No ratings yet
Modeling Statistical Properties of Written Text
8 pages
English 4 Week 2
No ratings yet
English 4 Week 2
2 pages
Math Assignment Draft - Nishanth
No ratings yet
Math Assignment Draft - Nishanth
7 pages
Zipf 1945 The Meaning-Frequency Relationship of Words
No ratings yet
Zipf 1945 The Meaning-Frequency Relationship of Words
8 pages
Week4 - Vocabulary Lexicon
No ratings yet
Week4 - Vocabulary Lexicon
32 pages
2_text operation
No ratings yet
2_text operation
35 pages
Morpho Syntax
No ratings yet
Morpho Syntax
136 pages
A
No ratings yet
A
291 pages
Speaking and Writing Distinct Patterns of Wor 1988 Journal of Memory and La 2
No ratings yet
Speaking and Writing Distinct Patterns of Wor 1988 Journal of Memory and La 2
14 pages
chapter two IR
No ratings yet
chapter two IR
44 pages
Type Token Ratio PDF
No ratings yet
Type Token Ratio PDF
3 pages
Word Study Homework
100% (1)
Word Study Homework
6 pages
Chapter4DevelopingMeaning Ocabulary
No ratings yet
Chapter4DevelopingMeaning Ocabulary
14 pages
Making Academic English More Accessible
No ratings yet
Making Academic English More Accessible
14 pages
Ges 101 - Odl - Unit 5
No ratings yet
Ges 101 - Odl - Unit 5
14 pages
(Stubbs) Phrases Text Ypes
No ratings yet
(Stubbs) Phrases Text Ypes
44 pages
The Goals of Vocabulary Learning: Counting Words
No ratings yet
The Goals of Vocabulary Learning: Counting Words
35 pages
The OEC: Facts About The Language: How Many Words Are There in English?
No ratings yet
The OEC: Facts About The Language: How Many Words Are There in English?
5 pages
1 Word
No ratings yet
1 Word
25 pages
Tweedie1998 PDF
No ratings yet
Tweedie1998 PDF
30 pages
DLL Englishq1w2
No ratings yet
DLL Englishq1w2
7 pages
Zipf's Law and The Creation of Musical Context
No ratings yet
Zipf's Law and The Creation of Musical Context
13 pages
Worksheet 5
No ratings yet
Worksheet 5
4 pages
Zipf Law and The Excellence of The Quran
No ratings yet
Zipf Law and The Excellence of The Quran
8 pages
Q4 ENGLISH 4 WEEK 2 DAY 1
No ratings yet
Q4 ENGLISH 4 WEEK 2 DAY 1
42 pages
Statistical Laws Governing Fluctuations in Word Use From Word Birth To Word Death
No ratings yet
Statistical Laws Governing Fluctuations in Word Use From Word Birth To Word Death
31 pages
CH 2_text operation
No ratings yet
CH 2_text operation
38 pages
How to Speak and Write Correctly
From Everand
How to Speak and Write Correctly
Joseph Devlin
No ratings yet
Concept of Ombudsman
No ratings yet
Concept of Ombudsman
3 pages
Paulo Freire Critical Pedagogy and Its Implications in PDF
No ratings yet
Paulo Freire Critical Pedagogy and Its Implications in PDF
8 pages
DLL Q3-MATH 9 WEEK 4-Solving Problems Involving Parallelogram, Trapezoids and Kites
100% (2)
DLL Q3-MATH 9 WEEK 4-Solving Problems Involving Parallelogram, Trapezoids and Kites
5 pages
Summative Assessment: Answer Key For The Exercises in The Lesson
No ratings yet
Summative Assessment: Answer Key For The Exercises in The Lesson
4 pages
AcFn1032 - CH# 04-Partnership
No ratings yet
AcFn1032 - CH# 04-Partnership
19 pages
Zbukvic Read About Tea Coffee
92% (13)
Zbukvic Read About Tea Coffee
5 pages
Panduan Mengenali Masalah Tanaman Padi
100% (1)
Panduan Mengenali Masalah Tanaman Padi
387 pages
2984 14000 3 PB
No ratings yet
2984 14000 3 PB
8 pages
Chapter 4: Fate vs. Free Will: Does Man Control His Own Destiny?
No ratings yet
Chapter 4: Fate vs. Free Will: Does Man Control His Own Destiny?
4 pages
Kinematics Fundamentals
No ratings yet
Kinematics Fundamentals
58 pages
Phy101l 1
No ratings yet
Phy101l 1
4 pages
Writing Aptis
No ratings yet
Writing Aptis
3 pages
Non-Discrimination Principle
No ratings yet
Non-Discrimination Principle
15 pages
Equity and Trust
No ratings yet
Equity and Trust
4 pages
Nature of theology
No ratings yet
Nature of theology
9 pages
AISAT College - Dasmariñas: Aisat Building, Emilio Aguinaldo Highway, Zone IV Dasmariñas City Cavite K-12 Program
No ratings yet
AISAT College - Dasmariñas: Aisat Building, Emilio Aguinaldo Highway, Zone IV Dasmariñas City Cavite K-12 Program
17 pages
Unidades Interiores Ductos MHB
No ratings yet
Unidades Interiores Ductos MHB
32 pages
Pentecostal Choirs With Chords
No ratings yet
Pentecostal Choirs With Chords
43 pages
Grade 9 Writing
No ratings yet
Grade 9 Writing
5 pages
PROF. PRIYANI SOYZA v. RIENZIE ARSECULARATNE Summary
No ratings yet
PROF. PRIYANI SOYZA v. RIENZIE ARSECULARATNE Summary
1 page
Leadership in Health Services: Article Information
No ratings yet
Leadership in Health Services: Article Information
22 pages
Race - An Evolving Issue in Western Social Thought
No ratings yet
Race - An Evolving Issue in Western Social Thought
8 pages
Project Planning, Appraisal and Control
No ratings yet
Project Planning, Appraisal and Control
4 pages
Demo
No ratings yet
Demo
6 pages
HDLC and PPP
No ratings yet
HDLC and PPP
19 pages
Computer/human Interface Design
No ratings yet
Computer/human Interface Design
295 pages

Statistically Understanding Text

Uploaded by

Statistically Understanding Text

Uploaded by

Statistically Understanding Text

2/10/2022 Statistically Understanding Text

2/10/2022 Statistically Understanding Text

2/10/2022 Statistically Understanding Text

Function words are closed-class words

2/10/2022 Statistically Understanding Text

Content words contain more lexical meaning than function words.

2/10/2022 Statistically Understanding Text

2/10/2022 Statistically Understanding Text

2/10/2022 Statistically Understanding Text

Text download: https://www.gutenberg.org/files/74/74-0.txt

2/10/2022 Statistically Understanding Text

2/10/2022 Statistically Understanding Text

2/10/2022 Statistically Understanding Text

2/10/2022 Statistically Understanding Text

2/10/2022 Statistically Understanding Text

Mark Twain’s Tom Sawyer

Complete Shakespeare work

2/10/2022 Statistically Understanding Text

Comparing Conversation, academic prose, news, fiction

2/10/2022 Statistically Understanding Text

Comparing Conversation, academic prose, news, fiction

Not a valid measure of ‘text complexity’ by itself

2/10/2022 Statistically Understanding Text

Frequency Frequency of frequency TTR = 0.11 ⇒ Words occur on

Frequency Frequency of frequency TTR = 0.11 ⇒ Words occur on

Frequency Frequency of frequency TTR = 0.11 ⇒ Words occur on

Count the frequency of each word type in a large corpus

2/10/2022 Statistically Understanding Text

2/10/2022 Statistically Understanding Text

Freq(f) Rank(r) f*r Freq(f) Rank(r) f*r

2/10/2022 Statistically Understanding Text

Correlation: Number of meanings and word frequency

2/10/2022 Statistically Understanding Text

Correlation: Number of meanings and word frequency

Given the First law

2/10/2022 Statistically Understanding Text

Correlation: Number of meanings and word frequency

Given the First law

2/10/2022 Statistically Understanding Text

Correlation: Word length and word frequency

2/10/2022 Statistically Understanding Text

Correlation: Word length and word frequency

2/10/2022 Statistically Understanding Text

The Good part

The Bad part

2/10/2022 Statistically Understanding Text

2/10/2022 Statistically Understanding Text

2/10/2022 Statistically Understanding Text

2/10/2022 Statistically Understanding Text

2/10/2022 Statistically Understanding Text

You might also like

Freq(f) Rank(r) fr Freq(f) Rank(r) fr