Statistically Understanding Text
Statistically Understanding Text
Example
Nouns: john, room, answer
Adjectives: happy, new, large, grey
Full verbs: search, grow, hold, have
Adverbs: really, completely, very, also, enough
Examples
Our friends called us yesterday and asked if we’d like to visit them next
month.
The best time to study is early in the morning or late in the evening.
Examples
Our friends called us yesterday and asked if we’d like to visit them next
month.
The best time to study is early in the morning or late in the evening.
Most words are smaller in length but have important grammatical roles.
They are determiners, prepositions, conjunctions, pronouns, etc.
Shakespeare
https://ocw.mit.edu/ans7870/6/6.006/s08/lecturenotes/files/t8.shakespeare.txt
News articles
https://www.kaggle.com/uciml/news-aggregator-dataset
Amazon reviews
https://www.kaggle.com/bittlingmayer/amazonreviews
Types
Number of distinct words in the corpus (size of vocabulary).
Tokens
Total number of running words in the corpus.
Example
They picnicked by the pool, then lay back on the grass and looked at the stars.
Tokens: 16
Types: 14
TTR
The type/token ratio (TTR) is the ratio of the number of different words
(types) to the number of running words (tokens) in a given text or corpus.
This index indicates how often, on average, a new ‘word form’ appears in
the text or corpus.
Zipf’s Law
A relationship between the frequency of a word ( f ) and its position in the list
(its rank r).
1
f∝
r
or, there is a constant k such that
f .r = k
i.e. the 50th most common word should occur with 3 times the frequency of
the 150th most common word.
Let
pr denote the probability of word of rank r
N denote the total number of word occurrences
f A
pr = =
N r
The value of A is found closer to 0.1 for corpus
Empirical Support
Rank ≈ 10000, average 2.1 meanings
Rank ≈ 5000, average 3 meanings
Rank ≈ 2000, average 4.6 meanings
1
l∝
f
1
l∝
f
Given the First law
l ∝r
How does the size of the overall vocabulary (number of unique words) grow
with the size of the corpus?
Heaps’ Law
Let |V | be the size of vocabulary and N be the number of tokens.
|V | = KNβ
Typically
K ≈ 10-100
β ≈ 0.4 - 0.6 (roughly square root)
K = 15.3
β = 0.56
Tom Sawyer
Download Tom Sawyer dataset.
Compute tokens, types, and TTR.
Check if Zipf’s law holds true for meanings and length.
Plot Heaps’ law