Words and Corpora J+M
Words and Corpora J+M
Basic Text
Processing
How many words in a sentence?
"I do uh main- mainly business data processing"
◦ Fragments, filled pauses
"Seuss’s cat in the hat is different from other cats!"
◦ Lemma: same stem, part of speech, rough word sense
◦ cat and cats = same lemma
◦ Wordform: the full inflected surface form
◦ cat and cats = different wordforms
How many words in a sentence?
they lay back on the San Francisco grass and looked at the stars
and their
Motivation:
• Why was the corpus collected?
• By whom?
• Who funded it?
Situation: In what situation was the text written?
Collection process: If it is a subsample how was it sampled? Was
there consent? Pre-processing?
+Annotation process, language variety, demographics, etc.
Words and Corpora
Basic Text
Processing
Word tokenization
Basic Text
Processing
Text Normalization
1945 A
72 AARON
19 ABBESS
5 ABBOT 25 Aaron
6 Abate
... ...
1 Abates
5 Abbess
6 Abbey
3 Abbot
.... …
The first step: tokenizing
tr -sc ’A-Za-z’ ’\n’ < shakes.txt | head
THE
SONNETS
by
William
Shakespeare
From
fairest
creatures
We
...
The second step: sorting
tr -sc ’A-Za-z’ ’\n’ < shakes.txt | sort | head
A
A
A
A
A
A
A
A
A
...
More counting
Merging upper and lower case
tr ‘A-Z’ ‘a-z’ < shakes.txt | tr –sc ‘A-Za-z’ ‘\n’ | sort | uniq –c
23243 the
22225 i
18618 and
16339 to
15687 of
12780 a
12163 you What happened here?
10839 my
10005 in
8954 d
Issues in Tokenization
Can't just blindly remove punctuation:
◦ m.p.h., Ph.D., AT&T, cap’n
◦ prices ($45.55)
◦ dates (01/02/06)
◦ URLs (http://www.northwestern.edu)
◦ hashtags (#nlproc)
◦ email addresses ([email protected])
Clitic: a word that doesn't stand on its own
◦ "are" in we're, French "je" in j'ai, "le" in l'honneur
When should multiword expressions (MWE) be words?
◦ New York, rock ’n’ roll
Tokenization in NLTK
Bird, Loper and Klein (2009), Natural Language Processing with Python. O’Reilly
Tokenization in languages without
spaces
Many languages (like Chinese, Japanese, Thai) don't
use spaces to separate words!
Merge e r to er
BPE
Merge er _ to er_
BPE
Merge n e to ne
BPE
The next merges are:
BPE token segmenter algorithm
On the test data, run each merge learned from the
training data:
◦ Greedily
◦ In the order we learned them
◦ (test frequencies don't play a role)
So: merge every e r to er, then merge er _ to er_, etc.
Result:
◦ Test set "n e w e r _" would be tokenized as a full word
◦ Test set "l o w e r _" would be two tokens: "low er_"
Properties of BPE tokens
Usually include frequent words
And frequent subwords
• Which are often morphemes like -est or –er
A morpheme is the smallest meaning-bearing unit of a
language
• unlikeliest has 3 morphemes un-, likely, and -est
Byte Pair Encoding
Basic Text
Processing
Word Normalization and
other issues
Basic Text
Processing
Word Normalization
Putting words/tokens in a standard format
◦ U.S.A. or USA
◦ uhhuh or uh-huh
◦ Fed or fed
◦ am, is, be, are
Case folding
Applications like IR: reduce all letters to lower case
◦ Since users tend to use lower case
◦ Possible exception: upper case in mid-sentence?
◦ e.g., General Motors
◦ Fed vs. fed
◦ SAIL vs. sail