2 TextProc 2023
2 TextProc 2023
Basic Text
Processing
Regular expressions are used everywhere
◦ Part of every text processing task
◦ Not a general NLP solution (for that we use large NLP
systems we will see in later lectures)
◦ But very useful as part of those systems (e.g., for pre-
processing or text formatting)
◦ Necessary for data analysis of text data
◦ A widely used tool in industry and academics
2
Regular expressions
A formal language for specifying text strings
How can we search for mentions of these cute animals in text?
◦ woodchuck
◦ woodchucks
◦ Woodchuck
◦ Woodchucks
◦ Groundhog
◦ groundhogs
Regular Expressions: Disjunctions
Letters inside square brackets []
Pattern Matches
[wW]oodchuck Woodchuck, woodchuck
[1234567890] Any one digit
Ranges using the dash [A-Z]
Pattern Matches
[A-Z] An upper case letter Drenched Blossoms
[a-z] A lower case letter my beans were impatient
[0-9] A single digit Chapter 1: Down the Rabbit Hole
Regular Expressions: Negation in Disjunction
Carat as first character in [] negates the list
◦ Note: Carat means negation only when it's first in []
◦ Special characters (., *, +, ?) lose their special meaning inside []
Pattern Matches
groundhog|woodchuck woodchuck
yours|mine yours
a|b|c = [abc]
[gG]roundhog|[Ww]oodchuck Woodchuck
Wildcards, optionality, repetition: . ? * +
Pattern Matches Examples
beg.n Any char begin begun
beg3n beg n
woodchucks? Optional s woodchuck
woodchucks
to* 0 or more of t to too tooo
previous char
Stephen C Kleene
to+ 1 or more of to too tooo
previous char toooo Kleene *, Kleene +
Regular Expressions: Anchors ^ $
Pattern Matches
^[A-Z] Palo Alto
^[^A-Za-z] 1 “Hello”
\.$ The end.
.$ The end? The end!
A note about Python regular expressions
◦ Regex and Python both use backslash "\" for
special characters. You must type extra backslashes!
◦ "\\d+" to search for 1 or more digits
◦ "\n" in Python means the "newline" character, not a
"slash" followed by an "n". Need "\\n" for two characters.
◦ Instead: use Python's raw string notation for regex:
◦ r"[tT]he"
◦ r"\d+" matches one or more digits
◦ instead of "\\d+"
10
The iterative process of writing regex's
Find me all instances of the word “the” in a text.
the
Misses capitalized examples
[tT]he
Incorrectly returns other or Theology
\W[tT]he\W
False positives and false negatives
The process we just went through was based on
fixing two kinds of errors:
1. Not matching things that we should have matched
(The)
False negatives
14
Regular Expressions
Basic Text
Processing
More Regular Expressions:
Substitutions and ELIZA
Basic Text
Processing
Substitutions
Substitution in Python and UNIX commands:
s/regexp1/pattern/
e.g.:
s/colour/color/
Capture Groups
• Say we want to put angles around all numbers:
the 35 boxes à the <35> boxes
• Use parens () to "capture" a pattern into a
numbered register (1, 2, 3…)
• Use \1 to refer to the contents of the register
s/([0-9]+)/<\1>/
Capture groups: multiple registers
/the (.*)er they (.*), the \1er we \2/
Matches
the faster they ran, the faster we ran
But not
the faster they ran, the faster we ate
But suppose we don't want to capture?
Parentheses have a double function: grouping terms, and
capturing
Non-capturing groups: add a ?: after paren:
/(?:some|a few) (people|cats) like some \1/
matches
◦ some cats like some cats
but not
◦ some cats like some some
Lookahead assertions
(?= pattern) is true if pattern matches, but is
zero-width; doesn't advance character pointer
(?! pattern) true if a pattern does not match
How to match, at the beginning of a line, any single
word that doesn’t start with “Volcano”:
/ˆ(?!Volcano)[A-Za-z]+/
Simple Application: ELIZA
Early NLP system that imitated a Rogerian
psychotherapist
◦ Joseph Weizenbaum, 1966.
they lay back on the San Francisco grass and looked at the stars
and their
Motivation:
• Why was the corpus collected?
• By whom?
• Who funded it?
Situation: In what situation was the text written?
Collection process: If it is a subsample how was it sampled? Was
there consent? Pre-processing?
+Annotation process, language variety, demographics, etc.
Words and Corpora
Basic Text
Processing
Word tokenization
Basic Text
Processing
Text Normalization
1945 A
72 AARON
19 ABBESS
25 Aaron
5 ABBOT
6 Abate
... ... 1 Abates
5 Abbess
6 Abbey
3 Abbot
.... …
The first step: tokenizing
tr -sc ’A-Za-z’ ’\n’ < shakes.txt | head
THE
SONNETS
by
William
Shakespeare
From
fairest
creatures
We
...
The second step: sorting
tr -sc ’A-Za-z’ ’\n’ < shakes.txt | sort | head
A
A
A
A
A
A
A
A
A
...
More counting
Merging upper and lower case
tr ‘A-Z’ ‘a-z’ < shakes.txt | tr –sc ‘A-Za-z’ ‘\n’ | sort | uniq –c
23243 the
22225 i
18618 and
16339 to
15687 of
12780 a
12163 you What happened here?
10839 my
10005 in
8954 d
Issues in Tokenization
Can't just blindly remove punctuation:
◦ m.p.h., Ph.D., AT&T, cap’n
◦ prices ($45.55)
◦ dates (01/02/06)
◦ URLs (http://www.stanford.edu)
◦ hashtags (#nlproc)
◦ email addresses ([email protected])
Clitic: a word that doesn't stand on its own
◦ "are" in we're, French "je" in j'ai, "le" in l'honneur
When should multiword expressions (MWE) be words?
◦ New York, rock ’n’ roll
ficient finite state automata. For example, Fig. 2.12 shows an example of a basic
Tokenization in NLTK
regular expression that can be used to tokenize with the nltk.regexp tokenize
function of the Python-based Natural Language Toolkit (NLTK) (Bird et al. 2009;
Bird, Loper and Klein (2009), Natural Language Processing with Python. O’Reilly
http://www.nltk.org).
Figure 2.13 The token learner part of the BPE algorithm for taking a corpus broken up
into individual characters or bytes, and learning a vocabulary by iteratively merging tokens.
Byte Pair Encoding (BPE) Addendum
Most subword algorithms are run inside space-
separated tokens.
So we commonly first add a special end-of-word
symbol '__' before space in training corpus
Next, separate into letters.
ER 2 • BPE token
R EGULAR learner
E XPRESSIONS , T EXT N ORMALIZATION , E DIT D ISTANCE
corpus vocabulary
corpus vocabulary
5 l o w , d, e, i, l, n, o, r, s, t, w, er
5 l o w , d, e, i, l, n, o, r, s, t, w, er
2 l o w e s t
2 l o w e s t
6 n e w er
6 n e w er
3 w i d er
3 w i d er
2 n e w
2 n e w
Now the most frequent pair is er , which we merge; our system has learned
Merge er
Now the most _ to er_
frequent pair is er , which we merge; our system has learned
that there should be a token for word-final er, represented as er :
that there should be a token for word-final er, represented as er :
corpus vocabulary
corpus vocabulary
5 l o w , d, e, i, l, n, o, r, s, t, w, er, er
5 l o w , d, e, i, l, n, o, r, s, t, w, er, er
2 l o w e s t
2 l o w e s t
6 n e w er
3 6 n e
w i d er w er
2 3n e ww i d er
2 n e w
22 nn ee ww
that
Now BPE
Now thethe most
most frequent
frequent pairpair isis er
er ,, which
which we we merge;
merge; ourour system
system has
has learned
learned
that there
there should
should bebe aa token
token for
for word-final
word-final er, er, represented
represented as er ::
as er
corpus
corpus vocabulary
vocabulary
55 ll oo ww ,, d,
d, e,
e, i,
i, l,
l, n,
n, o,
o, r,
r, s,
s, t,
t, w,
w, er,
er, er
er
22 ll oo ww ee ss tt
66 nn ee ww er er
33 ww ii dd er er
22 nn ee ww
Next nnMerge
Next ee (total n of
(total count
count e 8)to
of 8) getne
get merged
merged to to ne:
ne:
corpus
corpus vocabulary
vocabulary
55 ll oo ww ,, d,
d, e,
e, i,
i, l,
l, n,
n, o,
o, r,
r, s,
s, t,
t, w,
w, er, er ,, ne
er, er ne
22 ll oo ww ee ss tt
66 ne ne ww erer
33 ww ii dd er er
22 ne ne ww
IfIf we
we continue,
continue, the
the next
next merges
merges are:
are:
5 l o w , d, e, i, l, n, o, r, s, t, w, er, er , ne
BPE
2 l o w e s t
6 ne w er
3 w i d er
The2next
nemerges
w are:
If we continue, the next merges are:
Merge Current Vocabulary
(ne, w) , d, e, i, l, n, o, r, s, t, w, er, er , ne, new
(l, o) , d, e, i, l, n, o, r, s, t, w, er, er , ne, new, lo
(lo, w) , d, e, i, l, n, o, r, s, t, w, er, er , ne, new, lo, low
(new, er ) , d, e, i, l, n, o, r, s, t, w, er, er , ne, new, lo, low, newer
(low, ) , d, e, i, l, n, o, r, s, t, w, er, er , ne, new, lo, low, newer , low
Once we’ve learned our vocabulary, the token parser is used to tokenize a te
sentence. The token parser just runs on the test data the merges we have learne
1 Note that there can be ties; we could have instead chosen to merge r first, since that also has
BPE token segmenter algorithm
On the test data, run each merge learned from the
training data:
◦ Greedily
◦ In the order we learned them
◦ (test frequencies don't play a role)
So: merge every e r to er, then merge er _ to er_, etc.
Result:
◦ Test set "n e w e r _" would be tokenized as a full word
◦ Test set "l o w e r _" would be two tokens: "low er_"
Properties of BPE tokens
Usually include frequent words
And frequent subwords
• Which are often morphemes like -est or –er
A morpheme is the smallest meaning-bearing unit of a
language
• unlikeliest has 3 morphemes un-, likely, and -est
Byte Pair Encoding
Basic Text
Processing
Word Normalization and
other issues
Basic Text
Processing
Word Normalization
Putting words/tokens in a standard format
◦ U.S.A. or USA
◦ uhhuh or uh-huh
◦ Fed or fed
◦ am, is, be, are
Case folding
Applications like IR: reduce all letters to lower case
◦ Since users tend to use lower case
◦ Possible exception: upper case in mid-sentence?
◦ e.g., General Motors
◦ Fed vs. fed
◦ SAIL vs. sail
Porter Stemmer
and the written notes.
produces the following stemmed output:
Thi wa not the map we found in Billi Bone s chest but an
Based oncopi
accur a series
completofinrewrite rules
all thing namerun
and in series
height and sound
with the singl except of the red cross and the written note
◦ A cascade, in which output of each pass fed to next pass
ascade The algorithm is based on series of rewrite rules run in series, as a cascade, in
Some
which sample
the output rules:
of each pass is fed as input to the next pass; here is a sampling of
the rules:
ATIONAL ! ATE (e.g., relational ! relate)
ING ! ✏ if stem contains vowel (e.g., motoring ! motor)
SSES ! SS (e.g., grasses ! grass)
Detailed rule lists for the Porter stemmer, as well as code (in Java, Python, etc.)
can be found on Martin Porter’s homepage; see also the original paper (Porter, 1980).
Simple stemmers can be useful in cases where we need to collapse across differ-
Dealing with complex morphology is necessary
for many languages
◦ e.g., the Turkish word:
◦ Uygarlastiramadiklarimizdanmissinizcasina
◦ `(behaving) as if you are among those whom we could not civilize’
◦ Uygar `civilized’ + las `become’
+ tir `cause’ + ama `not able’
+ dik `past’ + lar ‘plural’
+ imiz ‘p1pl’ + dan ‘abl’
+ mis ‘past’ + siniz ‘2pl’ + casina ‘as if’
Sentence Segmentation
!, ? mostly unambiguous but period “.” is very ambiguous
◦ Sentence boundary
◦ Abbreviations like Inc. or Dr.
◦ Numbers like .02% or 4.3
Common algorithm: Tokenize first: use rules or ML to
classify a period as either (a) part of the word or (b) a
sentence-boundary.
◦ An abbreviation dictionary can help
Sentence segmentation can then often be done by rules
based on this tokenization.
Word Normalization and
other issues
Basic Text
Processing