0% found this document useful (0 votes)

37 views

2 TextProc 2023

Speech and Language Processing (3rd ed. draft) Dan Jurafsky and James H. Martin Chap 2a Regular Expressions, Text Normalization, Edit Distance 2: Text Processing

Uploaded by

khcheng

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

37 views

2 TextProc 2023

Speech and Language Processing (3rd ed. draft) Dan Jurafsky and James H. Martin Chap 2a Regular Expressions, Text Normalization, Edit Distance 2: Text Processing

Uploaded by

khcheng

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 74

Regular Expressions

Basic Text
Processing
Regular expressions are used everywhere
◦ Part of every text processing task
◦ Not a general NLP solution (for that we use large NLP
systems we will see in later lectures)
◦ But very useful as part of those systems (e.g., for pre-
processing or text formatting)
◦ Necessary for data analysis of text data
◦ A widely used tool in industry and academics

2
Regular expressions
A formal language for specifying text strings
How can we search for mentions of these cute animals in text?

◦ woodchuck
◦ woodchucks
◦ Woodchuck
◦ Woodchucks
◦ Groundhog
◦ groundhogs
Regular Expressions: Disjunctions
Letters inside square brackets []

Pattern Matches
[wW]oodchuck Woodchuck, woodchuck
[1234567890] Any one digit
Ranges using the dash [A-Z]
Pattern Matches
[A-Z] An upper case letter Drenched Blossoms
[a-z] A lower case letter my beans were impatient
[0-9] A single digit Chapter 1: Down the Rabbit Hole
Regular Expressions: Negation in Disjunction
Carat as first character in [] negates the list
◦ Note: Carat means negation only when it's first in []
◦ Special characters (., *, +, ?) lose their special meaning inside []

Pattern Matches Examples

[^A-Z] Not an upper case letter Oyfn pripetchik
[^Ss] Neither ‘S’ nor ‘s’ I have no exquisite reason”
[^.] Not a period Our resident Djinn
[e^] Either e or ^ Look up ^ now
Regular Expressions: Convenient aliases
Pattern Expansion Matches Examples
\d [0-9] Any digit Fahreneit 451
\D [^0-9] Any non-digit Blue Moon
\w [a-ZA-Z0-9_] Any alphanumeric or _ Daiyu
\W [^\w] Not alphanumeric or _ Look!
\s [ \r\t\n\f] Whitespace (space, tab) Look␣up
\S [^\s] Not whitespace Look up
Regular Expressions: More Disjunction
Groundhog is another name for woodchuck!
The pipe symbol | for disjunction

Pattern Matches
groundhog|woodchuck woodchuck
yours|mine yours
a|b|c = [abc]
[gG]roundhog|[Ww]oodchuck Woodchuck
Wildcards, optionality, repetition: . ? * +
Pattern Matches Examples
beg.n Any char begin begun
beg3n beg n
woodchucks? Optional s woodchuck
woodchucks
to* 0 or more of t to too tooo
previous char
Stephen C Kleene
to+ 1 or more of to too tooo
previous char toooo Kleene *, Kleene +
Regular Expressions: Anchors ^ $

Pattern Matches
^[A-Z] Palo Alto
^[^A-Za-z] 1 “Hello”
\.$ The end.
.$ The end? The end!
A note about Python regular expressions
◦ Regex and Python both use backslash "\" for
special characters. You must type extra backslashes!
◦ "\\d+" to search for 1 or more digits
◦ "\n" in Python means the "newline" character, not a
"slash" followed by an "n". Need "\\n" for two characters.
◦ Instead: use Python's raw string notation for regex:
◦ r"[tT]he"
◦ r"\d+" matches one or more digits
◦ instead of "\\d+"

10
The iterative process of writing regex's
Find me all instances of the word “the” in a text.

the
Misses capitalized examples

[tT]he
Incorrectly returns other or Theology

\W[tT]he\W
False positives and false negatives
The process we just went through was based on
fixing two kinds of errors:
1. Not matching things that we should have matched
(The)
False negatives

2. Matching strings that we should not have matched

(there, then, other)
False positives
Characterizing work on NLP
In NLP we are always dealing with these kinds of errors.
Reducing the error rate for an application often
involves two antagonistic efforts:
◦ Increasing coverage (or recall) (minimizing false negatives).
◦ Increasing accuracy (or precision) (minimizing false positives)
Regular expressions play a surprisingly large role

Widely used in both academics and industry

1. Part of most text processing tasks, even for big
neural language model pipelines
◦ including text formatting and pre-processing
2. Very useful for data analysis of any text data

14
Regular Expressions
Basic Text
Processing
More Regular Expressions:
Substitutions and ELIZA
Basic Text
Processing
Substitutions
Substitution in Python and UNIX commands:

s/regexp1/pattern/
e.g.:
s/colour/color/
Capture Groups
• Say we want to put angles around all numbers:
the 35 boxes à the <35> boxes
• Use parens () to "capture" a pattern into a
numbered register (1, 2, 3…)
• Use \1 to refer to the contents of the register
s/([0-9]+)/<\1>/
Capture groups: multiple registers
/the (.*)er they (.*), the \1er we \2/
Matches
the faster they ran, the faster we ran
But not
the faster they ran, the faster we ate
But suppose we don't want to capture?
Parentheses have a double function: grouping terms, and
capturing
Non-capturing groups: add a ?: after paren:
/(?:some|a few) (people|cats) like some \1/
matches
◦ some cats like some cats
but not
◦ some cats like some some
Lookahead assertions
(?= pattern) is true if pattern matches, but is
zero-width; doesn't advance character pointer
(?! pattern) true if a pattern does not match
How to match, at the beginning of a line, any single
word that doesn’t start with “Volcano”:
/ˆ(?!Volcano)[A-Za-z]+/
Simple Application: ELIZA
Early NLP system that imitated a Rogerian
psychotherapist
◦ Joseph Weizenbaum, 1966.

Uses pattern matching to match, e.g.,:

◦ “I need X”
and translates them into, e.g.
◦ “What would it mean to you if you got X?
Simple Application: ELIZA
Men are all alike.
IN WHAT WAY
They're always bugging us about something or other.
CAN YOU THINK OF A SPECIFIC EXAMPLE
Well, my boyfriend made me come here.
YOUR BOYFRIEND MADE YOU COME HERE
He says I'm depressed much of the time.
I AM SORRY TO HEAR YOU ARE DEPRESSED
How ELIZA works
s/.* I’M (depressed|sad) .*/I AM SORRY TO HEAR YOU ARE \1/
s/.* I AM (depressed|sad) .*/WHY DO YOU THINK YOU ARE \1/
s/.* all .*/IN WHAT WAY?/
s/.* always .*/CAN YOU THINK OF A SPECIFIC EXAMPLE?/
More Regular Expressions:
Substitutions and ELIZA
Basic Text
Processing
Words and Corpora
Basic Text
Processing
How many words in a sentence?
"I do uh main- mainly business data processing"
◦ Fragments, filled pauses
"Seuss’s cat in the hat is different from other cats!"
◦ Lemma: same stem, part of speech, rough word sense
◦ cat and cats = same lemma
◦ Wordform: the full inflected surface form
◦ cat and cats = different wordforms
How many words in a sentence?

they lay back on the San Francisco grass and looked at the stars
and their

Type: an element of the vocabulary.

Token: an instance of that type in running text.
How many?
◦ 15 tokens (or 14)
◦ 13 types (or 12) (or 11?)
and in fact this relationship between the number of types |V | and nu
HowHerdan’s
N is called many words in a corpus?
Law (Herdan, 1960) or Heaps’ Law (Heaps, 1
iscoverers (in linguistics
N = number of tokens and information retrieval respectively). It is sh
1, where k and b are positive constants, and 0 < b < 1.
V = vocabulary = set of types, |V| is size of vocabulary
b where often .67 < β < .75
Heaps Law = Herdan's Law = |V | = kN
i.e., vocabulary size grows with > square root of the number of word tokens

Tokens = N Types = |V|

Switchboard phone conversations 2.4 million 20 thousand
Shakespeare 884,000 31 thousand
COCA 440 million 2 million
Google N-grams 1 trillion 13+ million
Corpora
Words don't appear out of nowhere!
A text is produced by
• a specific writer(s),
• at a specific time,
• in a specific variety,
• of a specific language,
• for a specific function.
Corpora vary along dimension like
◦ Language: 7097 languages in the world
◦ Variety, like African American Language varieties.
◦ AAE Twitter posts might include forms like "iont" (I don't)
◦ Code switching, e.g., Spanish/English, Hindi/English:
S/E: Por primera vez veo a @username actually being hateful! It was beautiful:)
[For the first time I get to see @username actually being hateful! it was beautiful:) ]
H/E: dost tha or ra- hega ... dont wory ... but dherya rakhe
[“he was and will remain a friend ... don’t worry ... but have faith”]
◦ Genre: newswire, fiction, scientific articles, Wikipedia
◦ Author Demographics: writer's age, gender, ethnicity, SES
Corpus datasheets
Gebru et al (2020), Bender and Friedman (2018)

Motivation:
• Why was the corpus collected?
• By whom?
• Who funded it?
Situation: In what situation was the text written?
Collection process: If it is a subsample how was it sampled? Was
there consent? Pre-processing?
+Annotation process, language variety, demographics, etc.
Words and Corpora
Basic Text
Processing
Word tokenization
Basic Text
Processing
Text Normalization

Every NLP task requires text normalization:

1. Tokenizing (segmenting) words
2. Normalizing word formats
3. Segmenting sentences
Space-based tokenization
A very simple way to tokenize
◦ For languages that use space characters between words
◦ Arabic, Cyrillic, Greek, Latin, etc., based writing systems
◦ Segment off a token between instances of spaces
Unix tools for space-based tokenization
◦ The "tr" command
◦ Inspired by Ken Church's UNIX for Poets
◦ Given a text file, output the word tokens and their frequencies
Simple Tokenization in UNIX
(Inspired by Ken Church’s UNIX for Poets.)
Given a text file, output the word tokens and their frequencies
tr -sc ’A-Za-z’ ’\n’ < shakes.txt Change all non-alpha to newlines
| sort Sort in alphabetical order
| uniq –c Merge and count each type

1945 A
72 AARON
19 ABBESS
25 Aaron
5 ABBOT
6 Abate
... ... 1 Abates
5 Abbess
6 Abbey
3 Abbot
.... …
The first step: tokenizing
tr -sc ’A-Za-z’ ’\n’ < shakes.txt | head

THE
SONNETS
by
William
Shakespeare
From
fairest
creatures
We
...
The second step: sorting
tr -sc ’A-Za-z’ ’\n’ < shakes.txt | sort | head

A
A
A
A
A
A
A
A
A
...
More counting
Merging upper and lower case
tr ‘A-Z’ ‘a-z’ < shakes.txt | tr –sc ‘A-Za-z’ ‘\n’ | sort | uniq –c

Sorting the counts

tr ‘A-Z’ ‘a-z’ < shakes.txt | tr –sc ‘A-Za-z’ ‘\n’ | sort | uniq –c | sort –n –r

23243 the
22225 i
18618 and
16339 to
15687 of
12780 a
12163 you What happened here?
10839 my
10005 in
8954 d
Issues in Tokenization
Can't just blindly remove punctuation:
◦ m.p.h., Ph.D., AT&T, cap’n
◦ prices ($45.55)
◦ dates (01/02/06)
◦ URLs (http://www.stanford.edu)
◦ hashtags (#nlproc)
◦ email addresses ([email protected])
Clitic: a word that doesn't stand on its own
◦ "are" in we're, French "je" in j'ai, "le" in l'honneur
When should multiword expressions (MWE) be words?
◦ New York, rock ’n’ roll
ficient finite state automata. For example, Fig. 2.12 shows an example of a basic
Tokenization in NLTK
regular expression that can be used to tokenize with the nltk.regexp tokenize
function of the Python-based Natural Language Toolkit (NLTK) (Bird et al. 2009;
Bird, Loper and Klein (2009), Natural Language Processing with Python. O’Reilly
http://www.nltk.org).

>>> text = ’That U.S.A. poster-print costs $12.40...’

>>> pattern = r’’’(?x) # set flag to allow verbose regexps
... ([A-Z]\.)+ # abbreviations, e.g. U.S.A.
... | \w+(-\w+)* # words with optional internal hyphens
... | \$?\d+(\.\d+)?%? # currency and percentages, e.g. $12.40, 82%
... | \.\.\. # ellipsis
... | [][.,;"’?():-_‘] # these are separate tokens; includes ], [
... ’’’
>>> nltk.regexp_tokenize(text, pattern)
[’That’, ’U.S.A.’, ’poster-print’, ’costs’, ’$12.40’, ’...’]
Figure 2.12 A Python trace of regular expression tokenization in the NLTK Python-based
natural language processing toolkit (Bird et al., 2009), commented for readability; the (?x)
verbose flag tells Python to strip comments and whitespace. Figure from Chapter 3 of Bird
Tokenization in languages without spaces
Many languages (like Chinese, Japanese, Thai) don't
use spaces to separate words!

How do we decide where the token boundaries

should be?
Word tokenization in Chinese
Chinese words are composed of characters called
"hanzi" (or sometimes just "zi")
Each one represents a meaning unit called a morpheme.
Each word has on average 2.4 of them.
But deciding what counts as a word is complex and not
agreed upon.
How to do word tokenization in Chinese?

姚明进入总决赛 “Yao Ming reaches the finals”

3 words?
姚明进入总决赛
YaoMing reaches finals
5 words?
姚明进入总决赛
Yao Ming reaches overall finals
7 characters? (don't use words at all):
姚明进入总决赛
Yao Ming enter enter overall decision game
Word tokenization / segmentation
So in Chinese it's common to just treat each character
(zi) as a token.
• So the segmentation step is very simple
In other languages (like Thai and Japanese), more
complex word segmentation is required.
• The standard algorithms are neural sequence models
trained by supervised machine learning.
Word tokenization
Basic Text
Processing
Byte Pair Encoding
Basic Text
Processing
Another option for text tokenization
Instead of
• white-space segmentation
• single-character segmentation
Use the data to tell us how to tokenize.
Subword tokenization (because tokens can be parts
of words as well as whole words)
Subword tokenization

Three common algorithms:

◦ Byte-Pair Encoding (BPE) (Sennrich et al., 2016)
◦ Unigram language modeling tokenization (Kudo, 2018)
◦ WordPiece (Schuster and Nakajima, 2012)
All have 2 parts:
◦ A token learner that takes a raw training corpus and induces
a vocabulary (a set of tokens).
◦ A token segmenter that takes a raw test sentence and
tokenizes it according to that vocabulary
Byte Pair Encoding (BPE) token learner
Let vocabulary be the set of all individual characters
= {A, B, C, D,…, a, b, c, d….}
Repeat:
◦ Choose the two symbols that are most frequently
adjacent in the training corpus (say 'A', 'B')
◦ Add a new merged symbol 'AB' to the vocabulary
◦ Replace every adjacent 'A' 'B' in the corpus with 'AB'.
Until k merges have been done.
BPE token learner algorithm
2.4 • T EXT N ORMALIZATION 19

function B YTE - PAIR ENCODING (strings C, number of merges k) returns vocab V

V all unique characters in C # initial set of tokens is characters

for i = 1 to k do # merge tokens til k times
tL , tR Most frequent pair of adjacent tokens in C
tNEW tL + tR # make new token by concatenating
V V + tNEW # update the vocabulary
Replace each occurrence of tL , tR in C with tNEW # and update the corpus
return V

Figure 2.13 The token learner part of the BPE algorithm for taking a corpus broken up
into individual characters or bytes, and learning a vocabulary by iteratively merging tokens.
Byte Pair Encoding (BPE) Addendum
Most subword algorithms are run inside space-
separated tokens.
So we commonly first add a special end-of-word
symbol '__' before space in training corpus
Next, separate into letters.
ER 2 • BPE token
R EGULAR learner
E XPRESSIONS , T EXT N ORMALIZATION , E DIT D ISTANCE

Original (very fascinating🙄) corpus:

The algorithm is usually run inside words (not merging across word boundaries),
so thelow
inputlow
corpus
low is first
lowwhite-space-separated
low lowest lowest to give a set ofnewer
newer strings, each corre-
newer
sponding to the characters of a word, plus a special end-of-word symbol , and its
newer newer newer wider wider wider new new
counts. Let’s see its operation on the following tiny input corpus of 18 word tokens
with Add
countsend-of-word
for each word (the word low appears 5 times, the word
tokens, resulting in this vocabulary: newer 6 times,
and so on), which would have a starting vocabulary of 11 letters:
corpus representation vocabulary
5 l o w , d, e, i, l, n, o, r, s, t, w
2 l o w e s t
6 n e w e r
3 w i d e r
2 n e w
and soLet’s
counts. on),see
which would have
its operation on athe
starting vocabulary
following of corpus
tiny input 11 letters:
of 18 word tokens
BPE token learner
with counts
and so on),
corpus
for each word (the word low
5 whichl owould
vocabulary
appears 5 times, the word newer 6 times,
w have a starting vocabulary
, d, e, i, of 11
l,letters:
n, o, r, s, t, w
2
corpus l o w e s t vocabulary
5 6 l n o we w e r , d, e, i, l, n, o, r, s, t, w
2 3 l w o wi ed se tr
6 2 n n e we ew r
3 BPE
The w i algorithm
d e r first count all pairs of adjacent symbols: the most frequent
is the2 pairne er wbecause it occurs in newer (frequency of 6) and wider (frequency of
3)The
for BPE
a total
algorithm first count1 .allWe
of 9 occurrences then
pairs ofmerge these
adjacent symbols,
symbols: the treating er as one
most frequent
Merge
symbol, and e r
count to er
again:
is the pair e r because it occurs in newer (frequency of 6) and wider (frequency of
a total of 9 occurrences1 . Wevocabulary
3) for corpus then merge these symbols, treating er as one
symbol, 5 andlcount
o w again: , d, e, i, l, n, o, r, s, t, w, er
corpus
2 l o w e s t vocabulary
5 6 l no w
e w er , d, e, i, l, n, o, r, s, t, w, er
2 3 l wo w
i e
d sert
6 2 n ne w
e er
w
3 w i d er
is the pair e r because it occurs in newer (frequency of 6) and wider (frequency of
is the pair e r because it 1occurs in newer (frequency of 6) and wider (frequency of
BPE
3) for a total of 9 occurrences . We
symbol, and count again:
symbol, and count again:
1 then merge these symbols, treating er as one
3) for a total of 9 occurrences . We then merge these symbols, treating er as one

corpus vocabulary
corpus vocabulary
5 l o w , d, e, i, l, n, o, r, s, t, w, er
5 l o w , d, e, i, l, n, o, r, s, t, w, er
2 l o w e s t
2 l o w e s t
6 n e w er
6 n e w er
3 w i d er
3 w i d er
2 n e w
2 n e w
Now the most frequent pair is er , which we merge; our system has learned
Merge er
Now the most _ to er_
frequent pair is er , which we merge; our system has learned
that there should be a token for word-final er, represented as er :
that there should be a token for word-final er, represented as er :
corpus vocabulary
corpus vocabulary
5 l o w , d, e, i, l, n, o, r, s, t, w, er, er
5 l o w , d, e, i, l, n, o, r, s, t, w, er, er
2 l o w e s t
2 l o w e s t
6 n e w er
3 6 n e
w i d er w er
2 3n e ww i d er
2 n e w
22 nn ee ww

that
Now BPE
Now thethe most
most frequent
frequent pairpair isis er
er ,, which
which we we merge;
merge; ourour system
system has
has learned
learned
that there
there should
should bebe aa token
token for
for word-final
word-final er, er, represented
represented as er ::
as er
corpus
corpus vocabulary
vocabulary
55 ll oo ww ,, d,
d, e,
e, i,
i, l,
l, n,
n, o,
o, r,
r, s,
s, t,
t, w,
w, er,
er, er
er
22 ll oo ww ee ss tt
66 nn ee ww er er
33 ww ii dd er er
22 nn ee ww
Next nnMerge
Next ee (total n of
(total count
count e 8)to
of 8) getne
get merged
merged to to ne:
ne:
corpus
corpus vocabulary
vocabulary
55 ll oo ww ,, d,
d, e,
e, i,
i, l,
l, n,
n, o,
o, r,
r, s,
s, t,
t, w,
w, er, er ,, ne
er, er ne
22 ll oo ww ee ss tt
66 ne ne ww erer
33 ww ii dd er er
22 ne ne ww
IfIf we
we continue,
continue, the
the next
next merges
merges are:
are:
5 l o w , d, e, i, l, n, o, r, s, t, w, er, er , ne
BPE
2 l o w e s t
6 ne w er
3 w i d er
The2next
nemerges
w are:
If we continue, the next merges are:
Merge Current Vocabulary
(ne, w) , d, e, i, l, n, o, r, s, t, w, er, er , ne, new
(l, o) , d, e, i, l, n, o, r, s, t, w, er, er , ne, new, lo
(lo, w) , d, e, i, l, n, o, r, s, t, w, er, er , ne, new, lo, low
(new, er ) , d, e, i, l, n, o, r, s, t, w, er, er , ne, new, lo, low, newer
(low, ) , d, e, i, l, n, o, r, s, t, w, er, er , ne, new, lo, low, newer , low
Once we’ve learned our vocabulary, the token parser is used to tokenize a te
sentence. The token parser just runs on the test data the merges we have learne
1 Note that there can be ties; we could have instead chosen to merge r first, since that also has
BPE token segmenter algorithm
On the test data, run each merge learned from the
training data:
◦ Greedily
◦ In the order we learned them
◦ (test frequencies don't play a role)
So: merge every e r to er, then merge er _ to er_, etc.
Result:
◦ Test set "n e w e r _" would be tokenized as a full word
◦ Test set "l o w e r _" would be two tokens: "low er_"
Properties of BPE tokens
Usually include frequent words
And frequent subwords
• Which are often morphemes like -est or –er
A morpheme is the smallest meaning-bearing unit of a
language
• unlikeliest has 3 morphemes un-, likely, and -est
Byte Pair Encoding
Basic Text
Processing
Word Normalization and
other issues
Basic Text
Processing
Word Normalization
Putting words/tokens in a standard format
◦ U.S.A. or USA
◦ uhhuh or uh-huh
◦ Fed or fed
◦ am, is, be, are
Case folding
Applications like IR: reduce all letters to lower case
◦ Since users tend to use lower case
◦ Possible exception: upper case in mid-sentence?
◦ e.g., General Motors
◦ Fed vs. fed
◦ SAIL vs. sail

For sentiment analysis, MT, Information extraction

◦ Case is helpful (US versus us is important)
Lemmatization

Represent all words as their lemma, their shared root

= dictionary headword form:
◦ am, are, is ® be
◦ car, cars, car's, cars' ® car
◦ Spanish quiero (‘I want’), quieres (‘you want’)
® querer ‘want'
◦ He is reading detective stories
® He be read detective story
Lemmatization is done by Morphological Parsing
Morphemes:
◦ The small meaningful units that make up words
◦ Stems: The core meaning-bearing units
◦ Affixes: Parts that adhere to stems, often with grammatical
functions
Morphological Parsers:
◦ Parse cats into two morphemes cat and s
◦ Parse Spanish amaren (‘if in the future they would love’) into
morpheme amar ‘to love’, and the morphological features
3PL and future subjunctive.
Stemming
Reduce terms to stems, chopping off affixes crudely
This was not the map we
Thi wa not the map we
found in Billy Bones’s
found in Billi Bone s chest
chest, but an accurate
but an accur copi complet
copy, complete in all
in all thing name and
things-names and heights
height and sound with the
and soundings-with the
singl except of the red
single exception of the
cross and the written note
red crosses and the
.
written notes.
and soundings-with the single exception of the red crosses

Porter Stemmer
and the written notes.
produces the following stemmed output:
Thi wa not the map we found in Billi Bone s chest but an
Based oncopi
accur a series
completofinrewrite rules
all thing namerun
and in series
height and sound
with the singl except of the red cross and the written note
◦ A cascade, in which output of each pass fed to next pass
ascade The algorithm is based on series of rewrite rules run in series, as a cascade, in
Some
which sample
the output rules:
of each pass is fed as input to the next pass; here is a sampling of
the rules:
ATIONAL ! ATE (e.g., relational ! relate)
ING ! ✏ if stem contains vowel (e.g., motoring ! motor)
SSES ! SS (e.g., grasses ! grass)
Detailed rule lists for the Porter stemmer, as well as code (in Java, Python, etc.)
can be found on Martin Porter’s homepage; see also the original paper (Porter, 1980).
Simple stemmers can be useful in cases where we need to collapse across differ-
Dealing with complex morphology is necessary
for many languages
◦ e.g., the Turkish word:
◦ Uygarlastiramadiklarimizdanmissinizcasina
◦ `(behaving) as if you are among those whom we could not civilize’
◦ Uygar `civilized’ + las `become’
+ tir `cause’ + ama `not able’
+ dik `past’ + lar ‘plural’
+ imiz ‘p1pl’ + dan ‘abl’
+ mis ‘past’ + siniz ‘2pl’ + casina ‘as if’
Sentence Segmentation
!, ? mostly unambiguous but period “.” is very ambiguous
◦ Sentence boundary
◦ Abbreviations like Inc. or Dr.
◦ Numbers like .02% or 4.3
Common algorithm: Tokenize first: use rules or ML to
classify a period as either (a) part of the word or (b) a
sentence-boundary.
◦ An abbreviation dictionary can help
Sentence segmentation can then often be done by rules
based on this tokenization.
Word Normalization and
other issues
Basic Text
Processing

Technology
No ratings yet
Technology
20 pages
Regular Expression and BPE
No ratings yet
Regular Expression and BPE
68 pages
2 TextProc Mar 25 2021
No ratings yet
2 TextProc Mar 25 2021
71 pages
Basic Text Processing: Regular Expressions & Automata in NLP
No ratings yet
Basic Text Processing: Regular Expressions & Automata in NLP
27 pages
9 Chunking
No ratings yet
9 Chunking
45 pages
Morphological Analysis (1)
No ratings yet
Morphological Analysis (1)
118 pages
CS 491 Natural Language Processing Module 2: Basic Text Processing
No ratings yet
CS 491 Natural Language Processing Module 2: Basic Text Processing
24 pages
2 TextProc 2023
No ratings yet
2 TextProc 2023
35 pages
02 Textprocessingboth
No ratings yet
02 Textprocessingboth
46 pages
Corpora
No ratings yet
Corpora
48 pages
Grading: Final Term: 40 % Term Paper: 30% Assignments and Quizzes: 30%
No ratings yet
Grading: Final Term: 40 % Term Paper: 30% Assignments and Quizzes: 30%
46 pages
2-Regular expressions, Text Normalization, Edit Distance
No ratings yet
2-Regular expressions, Text Normalization, Edit Distance
42 pages
2.BasicTextProcessing NEW
No ratings yet
2.BasicTextProcessing NEW
39 pages
LCTR 2 TextProc 2022
No ratings yet
LCTR 2 TextProc 2022
9 pages
Basic Text Processing: Regular Expressions
No ratings yet
Basic Text Processing: Regular Expressions
46 pages
regex
No ratings yet
regex
2 pages
Basic Text Processing: Regular Expressions
No ratings yet
Basic Text Processing: Regular Expressions
41 pages
Ai Unit 5
No ratings yet
Ai Unit 5
19 pages
Lecture-2n-04032024-081220pm-19022025-105409am
No ratings yet
Lecture-2n-04032024-081220pm-19022025-105409am
38 pages
2-Introduction To Language Engineering - Part2
No ratings yet
2-Introduction To Language Engineering - Part2
26 pages
Week 2
No ratings yet
Week 2
90 pages
lecture5-ngrams
No ratings yet
lecture5-ngrams
40 pages
NLP Notes Unit-1
No ratings yet
NLP Notes Unit-1
20 pages
Chapter Two
No ratings yet
Chapter Two
72 pages
320 Problem Set 7
No ratings yet
320 Problem Set 7
6 pages
Apex Institute of Technology Natural Language Processing (20CST354)
No ratings yet
Apex Institute of Technology Natural Language Processing (20CST354)
43 pages
13B RegExp
No ratings yet
13B RegExp
38 pages
NLP Experiment 04
No ratings yet
NLP Experiment 04
3 pages
PCD Lab Manual
No ratings yet
PCD Lab Manual
28 pages
NLP_basics
No ratings yet
NLP_basics
119 pages
Why Study The Theory of Computation?: Implementations Come and Go
No ratings yet
Why Study The Theory of Computation?: Implementations Come and Go
68 pages
Lecture13 String Processing
No ratings yet
Lecture13 String Processing
22 pages
Lecture12 - Word RepEmb
No ratings yet
Lecture12 - Word RepEmb
28 pages
st copy
No ratings yet
st copy
20 pages
Lecture 3
No ratings yet
Lecture 3
70 pages
3 Types of Structures Used in Modeling Computation
No ratings yet
3 Types of Structures Used in Modeling Computation
26 pages
2 Regular Expressions
No ratings yet
2 Regular Expressions
34 pages
CSS Unit 5
No ratings yet
CSS Unit 5
61 pages
21cse356t Nlp Unit 2
No ratings yet
21cse356t Nlp Unit 2
89 pages
Unit3 Toc
No ratings yet
Unit3 Toc
97 pages
Natural Language Processing - Session 3 - Regular Expressions
No ratings yet
Natural Language Processing - Session 3 - Regular Expressions
39 pages
Natural Language Processing Dossier 20231110 141736 0000
No ratings yet
Natural Language Processing Dossier 20231110 141736 0000
114 pages
Week2
No ratings yet
Week2
44 pages
NLP Chapter 5
No ratings yet
NLP Chapter 5
70 pages
Final Summary NLP
No ratings yet
Final Summary NLP
446 pages
NLP
No ratings yet
NLP
38 pages
Chapter 1
No ratings yet
Chapter 1
31 pages
Math Language BS Physics Jan 2025
No ratings yet
Math Language BS Physics Jan 2025
45 pages
English Grammar - Articles, Determiners and Quantifiers
No ratings yet
English Grammar - Articles, Determiners and Quantifiers
9 pages
NLP 04
No ratings yet
NLP 04
3 pages
03.1- Regular Expressions
No ratings yet
03.1- Regular Expressions
34 pages
Theory of Computation: Dr. Krishnendu Rarhi E: Krishnendu.e9621@cumail - in
No ratings yet
Theory of Computation: Dr. Krishnendu Rarhi E: Krishnendu.e9621@cumail - in
44 pages
Lecture 2
No ratings yet
Lecture 2
70 pages
Grapheme:: Morpheme
No ratings yet
Grapheme:: Morpheme
20 pages
Count and Noncount Nouns
No ratings yet
Count and Noncount Nouns
9 pages
Lecture 7 - Z-lanugage With Library System
No ratings yet
Lecture 7 - Z-lanugage With Library System
23 pages
Chapter Four 1
No ratings yet
Chapter Four 1
91 pages
Mathematics Language and Symbols
No ratings yet
Mathematics Language and Symbols
20 pages
2-Introduction to NLP_part2
No ratings yet
2-Introduction to NLP_part2
27 pages
French Essentials
From Everand
French Essentials
Miriam Ellis
4/5 (3)
Word Juggling
From Everand
Word Juggling
John Cramer
No ratings yet
2412.14215v1
No ratings yet
2412.14215v1
16 pages
2504.06216v1
No ratings yet
2504.06216v1
22 pages
4 NB 2024
No ratings yet
4 NB 2024
82 pages
Beyond The Hype Capturing The Potential of Ai and Gen Ai in TMT
100% (1)
Beyond The Hype Capturing The Potential of Ai and Gen Ai in TMT
126 pages
3 LM 2024
No ratings yet
3 LM 2024
78 pages
2504.00633v1
No ratings yet
2504.00633v1
8 pages
Kdd2014 Domingos Scale Modeling 01
No ratings yet
Kdd2014 Domingos Scale Modeling 01
52 pages
Eccv2014 Mathias Face Detection 01
No ratings yet
Eccv2014 Mathias Face Detection 01
66 pages
2 EditDistance 2023
No ratings yet
2 EditDistance 2023
35 pages
5 LR Apr 7 2021
No ratings yet
5 LR Apr 7 2021
94 pages
1 3
No ratings yet
1 3
3 pages
Enhancing Geometric Representations For Molecules With Equivariant Vector-Scalar Interactive Message Passing
No ratings yet
Enhancing Geometric Representations For Molecules With Equivariant Vector-Scalar Interactive Message Passing
13 pages
Eccv2014 Zeiler Convolutional Networks 01
No ratings yet
Eccv2014 Zeiler Convolutional Networks 01
39 pages
Development and Application of A Chemical Profiling Method For The Assessment of The Quality and Consistency of The Pelargonium Sidoides Extract
No ratings yet
Development and Application of A Chemical Profiling Method For The Assessment of The Quality and Consistency of The Pelargonium Sidoides Extract
10 pages
In Support of Early Career Researchers: Editorial
No ratings yet
In Support of Early Career Researchers: Editorial
2 pages
Single Atom Catalysts Push The Boundaries of Heterogeneous Catalysis
No ratings yet
Single Atom Catalysts Push The Boundaries of Heterogeneous Catalysis
2 pages
McKinsey ChartsOTC2008 PDF
No ratings yet
McKinsey ChartsOTC2008 PDF
18 pages
Get Slang Rules A Practical Guide for English Learners Merriam Webster Learner s Lexicographer And Author Of Language Reference Books Orin Hargraves free all chapters
100% (3)
Get Slang Rules A Practical Guide for English Learners Merriam Webster Learner s Lexicographer And Author Of Language Reference Books Orin Hargraves free all chapters
82 pages
bài ktr giữa kì 2 lớp 6
No ratings yet
bài ktr giữa kì 2 lớp 6
4 pages
Adverbs of Manner The Key
No ratings yet
Adverbs of Manner The Key
2 pages
Were Going To Be Friends Song Worksheet
No ratings yet
Were Going To Be Friends Song Worksheet
1 page
Focus4 2E Vocabulary Quiz Unit2 GroupA
100% (1)
Focus4 2E Vocabulary Quiz Unit2 GroupA
2 pages
Baigent - Patrick Geddes, Lewis Mumford and Jean Gottman Divisions Over Megalopolis
No ratings yet
Baigent - Patrick Geddes, Lewis Mumford and Jean Gottman Divisions Over Megalopolis
14 pages
(English) Describing Business Strategy, Markets and Products - Business English Lesson (DownSub - Com)
No ratings yet
(English) Describing Business Strategy, Markets and Products - Business English Lesson (DownSub - Com)
7 pages
All the Badges Cub-Badge-Workbook
No ratings yet
All the Badges Cub-Badge-Workbook
27 pages
Conversation Secrets On How To Talk To Anyone About Anything
No ratings yet
Conversation Secrets On How To Talk To Anyone About Anything
4 pages
CIVICS
No ratings yet
CIVICS
2 pages
Revision: Passive Voice Present Perfect
No ratings yet
Revision: Passive Voice Present Perfect
11 pages
A Survey Study of The Online Dictionary Use: A Case Study of Thai Undergraduate Students
No ratings yet
A Survey Study of The Online Dictionary Use: A Case Study of Thai Undergraduate Students
41 pages
Grade 12 Creative Non Fiction
No ratings yet
Grade 12 Creative Non Fiction
6 pages
3.descriptores Nuevos II Nivel 2018 CINDEA
No ratings yet
3.descriptores Nuevos II Nivel 2018 CINDEA
190 pages
How to Make Learning as Addictive as Soc 13
No ratings yet
How to Make Learning as Addictive as Soc 13
5 pages
Dust Cream Almost Dark Black Pastel Simple Minimalist Illustration All Purpose Presentation Template
No ratings yet
Dust Cream Almost Dark Black Pastel Simple Minimalist Illustration All Purpose Presentation Template
15 pages
B.inggris3 7A Group Task10 Group 6
No ratings yet
B.inggris3 7A Group Task10 Group 6
7 pages
Bicolano Proverbs
80% (5)
Bicolano Proverbs
6 pages
Miracles in Vedic Science PDF
No ratings yet
Miracles in Vedic Science PDF
81 pages
Dependent Prepositions (With Verbs, Adjectives and Nouns)
No ratings yet
Dependent Prepositions (With Verbs, Adjectives and Nouns)
4 pages
PROJECT ctba
No ratings yet
PROJECT ctba
13 pages
Passive Voice: Event/Time Present Past Future Past Future
No ratings yet
Passive Voice: Event/Time Present Past Future Past Future
2 pages
Mindanao Island 5.1 (PPT 2) Group 2
No ratings yet
Mindanao Island 5.1 (PPT 2) Group 2
31 pages
Work Vocabulary
No ratings yet
Work Vocabulary
5 pages
Rubrics For TVL
No ratings yet
Rubrics For TVL
1 page
1 Descriptive Psychopathology-1-Copy
No ratings yet
1 Descriptive Psychopathology-1-Copy
13 pages
Voices Beginner TeachersBook Unit Notes All Units
No ratings yet
Voices Beginner TeachersBook Unit Notes All Units
225 pages
Paiman's Update Cv
No ratings yet
Paiman's Update Cv
4 pages
Música + Pronome Interrogativo
No ratings yet
Música + Pronome Interrogativo
3 pages

2 TextProc 2023

Uploaded by

2 TextProc 2023

Uploaded by

Regular Expressions

Pattern Matches Examples

2. Matching strings that we should not have matched

Widely used in both academics and industry

Uses pattern matching to match, e.g.,:

Type: an element of the vocabulary.

Tokens = N Types = |V|

Every NLP task requires text normalization:

Sorting the counts

>>> text = ’That U.S.A. poster-print costs $12.40...’

How do we decide where the token boundaries

姚明进入总决赛 “Yao Ming reaches the finals”

姚明进入总决赛 “Yao Ming reaches the finals”

姚明进入总决赛 “Yao Ming reaches the finals”

姚明进入总决赛 “Yao Ming reaches the finals”

Three common algorithms:

function B YTE - PAIR ENCODING (strings C, number of merges k) returns vocab V

V all unique characters in C # initial set of tokens is characters

Original (very fascinating🙄) corpus:

For sentiment analysis, MT, Information extraction

Represent all words as their lemma, their shared root

You might also like