0% found this document useful (0 votes)
11 views54 pages

NLP_Week_02

The document provides an overview of text normalization and tokenization in natural language processing (NLP), detailing processes such as word segmentation, morphology, and the importance of language models. It discusses techniques like stemming, lemmatization, and Byte-Pair Encoding for effective tokenization, along with challenges in tokenization across different languages. Additionally, it covers fundamental concepts of probability relevant to language models, including joint and conditional probabilities, and introduces Bayes' theorem.

Uploaded by

Faizad Ullah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views54 pages

NLP_Week_02

The document provides an overview of text normalization and tokenization in natural language processing (NLP), detailing processes such as word segmentation, morphology, and the importance of language models. It discusses techniques like stemming, lemmatization, and Byte-Pair Encoding for effective tokenization, along with challenges in tokenization across different languages. Additionally, it covers fundamental concepts of probability relevant to language models, including joint and conditional probabilities, and introduces Bayes' theorem.

Uploaded by

Faizad Ullah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 54

CSCS 366 – Intro.

to NLP
Faizad Ullah

1
Text Normalization

2
Tokenization
• Before almost any natural language processing of a
text, the text has to be normalized, a task called text
normalization.

1. Tokenizing (segmenting) words


2. Normalizing word formats
3. Segmenting sentences
Tokenization
• Tokenization: To extract linguistic unit of interests from running text

• Linguistic Unit?

• character, word, sentence, paragraph, …

• The most common is word

• The 1989 edition of the Oxford English Dictionary had 615,000


entries.
Words
• Type: An element of the vocabulary or the number of
distinct words in a corpus

• If the set of words in the vocabulary is word instance V,


the number of types is the vocabulary size |V|.

• Token/Instance: An instance of that type in running text

• Word instances are the total number N of running words.


Words
• How many words/tokens and types are in the following
sentence?

• He stepped out into the hall, was delighted to


encounter a water brother.

• This sentence has 13 words if we don’t count punctuation marks


as words, 15 if we count punctuation.

• Whether we treat period (“.”), comma (“,”), and so on as words


depends on the task.
Words
• I do uh main- mainly business data processing.

• This utterance has two kinds of disfluencies.


• The broken-off word main- is fragment called a fragment.
• Words like uh and um are called fillers or filled pauses.

• We consider these to be words?


• It depends on the application.
Morphology
• A morpheme is the smallest meaning-bearing unit of a
language; for example the word unwashable has the
morphemes un-, wash, and -able.

• Some languages, like Japanese, don’t have spaces


between words, so word tokenization becomes more
difficult.
Word Normalization
• Word normalization is the task of putting words or tokens in a
standard format.

• The world has 7097 languages at the time of this writing, according
to the online Ethnologue catalog (Simons and Fennig, 2018).

• It is important to test algorithms on more than one language, and


particularly on languages with different properties; by contrast there
is an unfortunate current tendency for NLP algorithms to be
developed or tested just on English (Bender, 2019).

• code switching
How many words?

N = number of tokens/Instances
V = vocabulary = set of types
|V| is the size of the vocabulary
Issues in Tokenization

• Finland’s capital  Finland Finlands Finland’s ?


• what’re, I’m, isn’t  What are, I am, is not
• Hewlett-Packard  Hewlett Packard ?
• state-of-the-art  state of the art ?
• Lowercase  lower-case lowercase lower case ?
• San Francisco  one token or two?
• m.p.h., Ph.D.  ??
Word Tokenization in Chinese

• Also called Word Segmentation

• Chinese words are composed of characters

• Characters are generally 1 syllable and 1 morpheme.

• Standard baseline segmentation algorithm:

• Maximum Matching (also called Greedy)


Maximum Matching Word
Segmentation
• Given a wordlist of Chinese, and a string.

• Start a pointer at the beginning of the string

• Find the longest word in dictionary that matches the


string starting at pointer

• Move the pointer over the word in string

• Go to 2
Maximum Matching Word
Segmentation
• Thecatinthehat the cat in the hat

• Thetabledownthere the table down there


theta bled own there

• Doesn’t generally work in English!


Example
• He sat on the chair, but he likes sitting on the floor.

•N=?
•V=?

• Normalization: lowercasing, stemming, lemmatization


• stopwords removing, punctuation removing,
vectorization
Stemming and Lemmatization
• Stemming: Reduces words to their base or root form by
chopping off affixes (e.g., "running" → "run").

• Lemmatization: Converts words to their dictionary form


(e.g., "better" → "good" or "running" → "run") using
context and linguistic rules.
Stemming and Lemmatization
• He sat on the chair but he likes sitting on the floor

• he sit <SW> <SW> chair but he like sit <SW> <SW> floor

•N = ?
•V = ?

• <DATE>, <UNK>
Byte-Pair Encoding: A Bottom-up

Tokenization Algorithm

18
Byte-Pair Encoding
• BPE is most commonly used by large language models for
word tokenization.

• Instead of defining tokens as words (whether delimited by


spaces or more complex algorithms), or as characters (as in
Chinese), we can use our data to automatically tell us what
the tokens should be.

• NLP algorithms often learn some facts about language from


one corpus (a training corpus) and then use these facts to
make decisions about a separate test corpus and its language.
Byte-Pair Encoding
• Thus, if our training corpus contains, say the words low,
new, newer, but not lower, then if the word lower
appears in our test corpus, our system will not know
what to do with it.

• To deal with this unknown word problem, modern


tokenizers automatically induce sets of tokens that
include tokens smaller than words, called subwords.
Byte-Pair Encoding Algorithm
• The BPE algorithm starts with a vocabulary containing only individual
characters.
• It scans the training corpus to find the two symbols that are most
frequently adjacent (e.g., ‘A’ and ‘B’).
• A new merged symbol (e.g., ‘AB’) is added to the vocabulary, and every
occurrence of adjacent ‘A’ and ‘B’ is replaced with ‘AB’ in the corpus.
• This process of counting and merging continues, forming longer character
strings until k merges have been completed, resulting in k novel tokens.
• K is a parameter of the algorithm, determining the number of new tokens.
• The final vocabulary consists of the original set of characters plus the k
new symbols.
Byte-Pair Encoding Algorithm
• The algorithm is usually run inside words (not merging
across word boundaries).
• The input corpus is first white-space-separated to give
a set of strings, each corresponding to the characters of
a word, plus a special end-of-word symbol , and its
counts.
Byte-Pair Encoding Algorithm
Byte-Pair Encoding Algorithm

Corpus of 18 word tokens with counts for each


word (the word low appears 5 times, the word
newer 6 times, and so on), which would have a
starting vocabulary of 11 letters.
Byte-Pair Encoding Algorithm
Byte-Pair Encoding Algorithm
Byte-Pair Encoding Algorithm
Byte-Pair Encoding Algorithm
Sentence Segmentation

29
Sentence Segmentation
• !, ? are relatively unambiguous
• Period “.” is quite ambiguous
• Sentence boundary
• Abbreviations like Inc. or Dr.
• Numbers like .02% or 4.3
• Build a binary classifier
• Looks at a “.”
• Decides End-of-Sentence/Not-End-of-Sentence
• Classifiers: hand-written rules, regular expressions, or
machine learning
Determining if a word is end-of-
sentence
Language Models

32
Language Models
• A language model is a machine learning model LM that
predicts upcoming words.

• More formally, a language model assigns a probability


to each possible next word, or equivalently gives a
probability distribution over possible next works.

• Language models can also assign a probability to an


entire sentence.
Language Models
Language Models
• Thus, an LM could tell us that the following sequence
has a much higher probability of appearing in a text:

• all of a sudden I notice three guys standing on the sidewalk

• than does this same set of words in a different order:

• on guys all I of notice sidewalk three a sudden standing the


Basic Probability

36
Probability

• Fair Coin Toss:


• Probability of heads: ½ 𝑷( 𝑯)→ 𝟎. 𝟓

• Probability of tails: ½  𝑷( 𝑻)→ 𝟎. 𝟓

• Fair Coin Toss universe has only two outcomes. There is


no other possibility.
Probability

• Fair Dice roll

• Probability of getting a 6: 1/6  P(‘6’) = 0.16666666666

• All possible outcomes in the current universe are 6.


Joint Probability
• Joint probability refers to a statistical measure that calculates the
likelihood of two events occurring together and at the same point
in time.

• Suppose we throw a white and black die simultaneously. What is


the probability that the outcome would sum to 3?

• (1,2) and (2,1) are the only two out of 36 possibilities that sum to
3.

• So: 𝑃(𝑠𝑢𝑚𝑠 𝑡𝑜 3) = 2/36


Conditional Probability
• Conditional probability is known as the possibility of an
event or outcome happening, based on the existence of
a previous event or outcome.
• Now let us suppose we have already thrown the black
dice and got a 2.
• What is the probability of “sums to 3” given this event?

• So: 𝑃(𝑠𝑢𝑚𝑠 𝑡𝑜 3 | 𝑎𝑙𝑟𝑒𝑎𝑑𝑦 𝑎 2 𝑜𝑛 𝑏𝑙𝑎𝑐𝑘 𝑑𝑖𝑐𝑒) = 1/6


• Only one possibility out of 6 possible outcomes remains.
Conditional Probability
• A Universe with all possible outcomes
• Interested in some subset of them (some event)
• Assume we are studying diabetes:
• We observe people and see whether they have diabetes or not
• If we take as our Universe, all the people participating in our
study, then there are two possible outcomes for any individual:
Either they have diabetes, or they do not have diabetes
• We can then split our universe in two events:
• The event “people with diabetes” (designated as 𝐴)
• The event “people with no diabetes” (designated as ~𝐴)
Conditional Probability

• So, what is the probability that a randomly chosen


person has diabetes?

elements in 𝑈 (universe)
• The number of elements in A divided by the number of

• We denote the number of elements of A as |A| (the

• We define the probability of A, 𝑃(𝐴) as:


cardinality of A)

elements as U, the probability 𝑃(𝐴) can be at most


• Since A can have at most the same number of

1.
Conditional Probability
• Let’s say there is a new screening test
that is supposed to measure
something
• That test will be “positive” for some
people, and “negative” for others.
• If we take the event B to be “people
for whom the test is positive”
• What is the probability that the test
will be “positive” for a randomly
selected person?
The Two Events Jointly
• What happens if we put them together?

• So, we can compute the probability of both events


occurring as:
The Two Events Jointly
• We are dealing with:
• An entire Universe (all people)
• The event A (people with cancer)

• There is also an overlap, the event AB (𝐴 ∩ 𝐵)


• The event B (people for whom the test is positive)

• There is also the event 𝐵 − 𝐴𝐵:


• “People with diabetes and with a positive test result”.

• “People with a positive test result and without

• And the event 𝐴 − 𝐴𝐵:


diabetes”

• “People with diabetes and with a negative test result”


Conditional Probability
• “Given that the test is positive for a randomly selected individual,
what is the probability that said individual has diabetes?”
• In terms of our Venn Diagram:
• Given that we are in region B, what is the probability
that we are in region AB?

• Or stated differently:
• “If we make region B our new Universe, what is the probability of

• The notation for this is 𝑃(𝐴|𝐵) (Probability of A given B)


A?”
Conditional Probability

• Dividing both the numerator and denominator by |𝑈|, we


• Let us convert the counts to probabilities

get:

= P(AB)/P(B)  Equation 1

• What we’ve effectively done is change the Universe from U


(all people) to B (people for whom the test is positive), but
we are still dealing with probabilities defined in U
Conditional Probability
• Now let’s ask the converse question:
• “given that a randomly selected individual has cancer
(event A), what is the probability that the test is positive
for that individual (event AB)?

= P(AB)/P(A)  Equation
2
The Bayes Theorem
• Now we have everything we need to derive Bayes theorem,
putting equation 1 and 2 together, we get:

• Which is to say 𝑃(𝐴 ∩ 𝐵) is the same whether you’re looking at it


from the point of view of A or B.
The Bayes Theorem
Independence
• If the probability of occurrence of an event A is not affected
by the occurrence of another event B, then A and B are said
to be independent events.
• A = “Today is Friday”
• B = “Heads on fair coin”
• If A and B are independent:
• P(A∩B) = P(A)P(B)
• Or stated a bit differently:
• P(A|B) = P(A) if P(B) > 0 and P(B|A) = P(B) if P(A) > 0
• P(A|B) = P(A∩B) / P(B) is not defined when P(B) = 0
• P(A|B) = P(A∩B) / P(A) is not defined when P(A) = 0
Independence and Mutual Exclusion
Summary
• For independent events A and B:

• For independent events A b and C:

• For dependent event A and B:

• For dependent events A, B and C:


Sources
• https://web.stanford.edu/~jurafsky/slp3/3.pdf

You might also like