0% found this document useful (0 votes)

68 views

Chapter V - Working With Text Data

The document discusses different types of text data and methods for processing text data for machine learning models. It covers bag of words representation, removing stopwords, stemming and lemmatization, and topic modeling. The final section will provide a summary.

Uploaded by

Halbeega Waayaha

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

68 views

Chapter V - Working With Text Data

Uploaded by

Halbeega Waayaha

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 30

Chapter V:

Working with Text Data

Machine Learning
Road map
◼ Overview
◼ Types of Data Represented as Strings
◼ Bag of Words
◼ Stopwords
◼ Stemming & Lemmatization
◼ Topic Modeling
◼ Summary

Chapter 5: Working with Text Data 2

Overview
◼ If we want to classify an email message as
either a legitimate email or spam, the content
of the email will certainly contain important
information for this classification task.
◼ Or maybe we want to learn about the opinion
of a politician on the topic of immigration.
◼ Individual’s speeches or tweets might provide
useful information.
◼ In customer service, we often want to find out
if a message is a complaint or an inquiry.
Chapter 5: Working with Text Data 3
Cont..
◼ We can use the subject line and content of a
message to automatically determine the
customer’s intent, which allows us to send the
message to the appropriate department, or even
send a fully automatic reply.
◼ Text data is usually represented as strings, made
up of characters. In any of the examples just
given, the length of the text data will vary.
◼ This feature is clearly very different from the
numeric features that we’ve discussed so far, and
we will need to process the data before we can
apply our machine learning algorithms to it.
Chapter 5: Working with Text Data 4
Road map
◼ Overview
◼ Types of Data Represented as Strings
◼ Bag of Words
◼ Stopwords
◼ Stemming & Lemmatization
◼ Topic Modeling
◼ Summary

Chapter 5: Working with Text Data 5

Types of Data Represented as Strings
There are four kinds of string data you might
see:
◼ Categorical data

◼ Free strings that can be semantically

mapped to categories
◼ Structured string data

◼ Text data

Chapter 5: Working with Text Data 6

Categorical Data
◼ Categorical data is data that comes from a
fixed list. Say you collect data via a survey
where you ask people their favorite color,
with a drop-down menu that allows them
to select from “red,” “green,” “blue,”
“yellow,” “black,” “white,” “purple,” and
“pink.”
◼ This will result in a dataset with exactly
eight different possible values, which
clearly encode a categorical variable.
Chapter 5: Working with Text Data 7
Free strings
◼ Now imagine instead of providing a drop-down menu,
you provide a text field for the users to provide their own
favorite colors.
◼ Many people might respond with a color name like
“black” or “blue.” Others might make typographical
errors, use different spellings like “gray” and “grey,” or
use more evocative and specific names like “mid‐ night
blue.”
◼ You will also have some very strange entries, which are
hard to map to colors automatically (or at all). The
responses you can obtain from a text field belong to the
second category in the list, free strings that can be
semantically mapped to categories.

Chapter 5: Working with Text Data 8

Structured string data
◼ Often, manually entered values do not
correspond to fixed categories, but still have
some underlying structure, like addresses,
names of places or people, dates, telephone
numbers, or other identifiers.
◼ These kinds of strings are often very hard to
parse, and their treatment is highly dependent
on context and domain.
◼ A systematic treatment of these cases is beyond
the scope of this Course.

Chapter 5: Working with Text Data 9

Text Data
◼ The final category of string data is freeform text data that consists of
phrases or sentences.
◼ Examples include tweets, chat logs, and hotel reviews, as well as
the collected works of Shakespeare, the content of Wikipedia, or the
Project Gutenberg collection of 50,000 e-books.
◼ All of these collections contain information mostly as sentences
composed of words.
◼ For simplicity’s sake, let’s assume all our documents are in one
language, English.
◼ In the context of text analysis, the dataset is often called the corpus,
and each data point, represented as a single text, is called a
document.
◼ These terms come from the information retrieval (IR) and natural
language processing (NLP) community, which both deal mostly in
text data.

Chapter 5: Working with Text Data 10

Road map
◼ Overview
◼ Types of Data Represented as Strings
◼ Bag of Words
◼ Stopwords
◼ Stemming & Lemmatization
◼ Topic Modeling
◼ Summary

Chapter 5: Working with Text Data 11

Bag of Words
◼ One of the most simple but effective and
commonly used ways to represent text for
machine learning is using the bag-of-words
representation.
◼ When using this representation, we discard most
of the structure of the input text, like chapters,
paragraphs, sentences, and formatting, and only
count how often each word appears in each text
in the corpus.
◼ Discarding the structure and counting only word
occurrences leads to the mental image of representing
text as a “bag.
Chapter 5: Working with Text Data 12
Cont..
One of the Computing the bag-of-words
representation for a corpus of documents consists of
the following three steps:
◼ Tokenization. Split each document into the words
that appear in it (called tokens), for example by
splitting them on whitespace and punctuation.
◼ Vocabulary building. Collect a vocabulary of all
words that appear in any of the documents, and
number them (say, in alphabetical order).
◼ Encoding. For each document, count how often
each of the words in the vocabulary appear in this
document.
Chapter 5: Working with Text Data 13
Road map
◼ Overview
◼ Types of Data Represented as Strings
◼ Bag of Words
◼ Stopwords
◼ Stemming & Lemmatization
◼ Topic Modeling
◼ Summary

Chapter 5: Working with Text Data 14

Stopwords
◼ Another way that we can get rid of
uninformative words is by discarding words
that are too frequent to be informative.
◼ There are two main approaches: using a
language specific list of stopwords, or
discarding words that appear too frequently.
◼ scikit-learn has a built-in list of English
stopwords in the feature_extraction.text
module:

Chapter 5: Working with Text Data 15

Where to find a stop word list?
There are established stop word lists that you
could easily plug in and use. Some of the stop
word lists come out of NLP research work, and
some are just manually curated by different people.
Here are a few for you to try in different languages:
◼ English stop words (https://github.com/igorbrigadir/stopwords/blob/master/en/terrier.txt )
◼ Russian stop words (https://github.com/stopwords-iso/stopwords-ru/blob/master/stopwords-ru.txt)
◼ French stop words (https://github.com/stopwords-iso/stopwords-fr/blob/master/stopwords-fr.txt)
◼ Spanish stop words (https://github.com/stopwords-iso/stopwords-es/blob/master/stopwords-es.txt)
◼ German stop words (https://github.com/stopwords-iso/stopwords-de/blob/master/stopwords-de.txt)

Chapter 5: Working with Text Data 16

Road map
◼ Overview
◼ Types of Data Represented as Strings
◼ Bag of Words
◼ Stopwords
◼ Stemming & Lemmatization
◼ Topic Modeling
◼ Summary

Chapter 5: Working with Text Data 17

Stemming and Lemmatization
◼ As we know, the vocabulary includes words like
"replace", "replaced", "replacement", "replaces", and
"replacing", which are different verb forms and a noun
relating to the verb “to replace.”
◼ Similarly to having singular and plural forms of a noun,
treating different verb forms and related words as
distinct tokens is disadvantageous for building a model
that generalizes well.
◼ For the purposes of a bag-of-words model, the
semantics of “replace" and “replaces" are so close that
distinguishing them will only increase overfitting, and
not allow the model to fully exploit the training data.

Chapter 5: Working with Text Data 18

Cont..
◼ This problem can be overcome by representing each word
using its word stem, which involves identifying all the words
that have the same word stem. If this is done by using a rule-
based heuristic, like dropping common suffixes, it is usually
referred to as stemming.
◼ If instead a dictionary of known word forms is used (an explicit
and human-verified system), and the role of the word in the
sentence is taken into account, the process is referred to as
lemmatization and the standardized form of the word is
referred to as the lemma.
◼ Both processing methods, lemmatization and stemming, are
forms of normalization that try to extract some normal form
of a word. Another interesting case of normalization is spelling
correction, which can be helpful in practice but is outside of
the scope of this Course.
Chapter 5: Working with Text Data 19
Cont..
◼ To get a better understanding of normalization, let’s compare
a method for stemming; the Porter stemmer, a widely used
collection of heuristics (here imported from the NLTK
package) to lemmatization as implemented in the spacy
package:
Input:

Chapter 5: Working with Text Data 20

Cont..
◼ We will compare lemmatization and the Porter stemmer on a
sentence designed to show some of the differences:
Input:

Output:

◼ Stemming is always restricted to trimming the word to a stem,

so "was" becomes "wa", while lemmatization can retrieve the
correct base verb form, "be". Similarly, lemmatization can
normalize "worse" to "bad", while stemming produces "wors".

Chapter 5: Working with Text Data 21

Cont..
◼ Another major difference is that stemming
reduces both occurrences of "meeting" to "meet".
◼ Using lemmatization, the first occurrence of
"meeting" is recognized as a noun and left as is,
while the second occurrence is recognized as a
verb and reduced to "meet".
◼ In general, lemmatization is a much more
involved process than stemming, but it usually
produces better results than stemming when
used for normalizing tokens for machine
learning.

Chapter 5: Working with Text Data 22

Road map
◼ Overview
◼ Types of Data Represented as Strings
◼ Bag of Words
◼ Stopwords
◼ Stemming & Lemmatization
◼ Topic Modeling
◼ Summary

Chapter 5: Working with Text Data 23

Topic Modeling and Document Clustering
◼ One particular technique that is often applied to
text data is topic modeling, which is an umbrella
term describing the task of assigning each
document to one or multiple topics, usually
without supervision.
◼ A good example for this is news data, which
might be categorized into topics like “politics,”
“sports,” “finance,” and so on.
◼ If each document is assigned a single topic, this
is the task of clustering that we discussed in
Chapter III.
Chapter 5: Working with Text Data 24
Cont..
◼ Each of the components we learn then
corresponds to one topic, and the
coefficients of the components in the
representation of a document tell us how
strongly related that document is to a
particular topic.
◼ Often, when people talk about topic
modeling, they refer to one particular
decomposition method called Latent
Dirichlet Allocation (often LDA for short)
Chapter 5: Working with Text Data 25
Latent Dirichlet Allocation (LDA)
◼ Intuitively, the LDA model tries to find groups of words
(the topics) that appear together frequently. LDA also
requires that each document can be understood as a
“mixture” of a subset of the topics.
◼ Going back to the example of news articles, we might
have a collection of articles about sports, politics, and
finance, written by two specific authors.
◼ In a politics article, we might expect to see words like
“governor,” “vote,” “party,” etc., while in a sports article
we might expect words like “team,” “score,” and
“season.” Words in each of these groups will likely
appear together, while it’s less likely that, for example,
“team” and “governor” will appear together.
Chapter 5: Working with Text Data 26
Road map
◼ Overview
◼ Types of Data Represented as Strings
◼ Bag of Words
◼ Stopwords
◼ Stemming & Lemmatization
◼ Topic Modeling
◼ Summary

Chapter 5: Working with Text Data 27

Summary
◼ In this chapter we talked about the basics of
processing text, also known as natural language
processing (NLP), with an example application
classifying movie reviews.
◼ The tools discussed here should serve as a great
starting point when trying to process text data.
◼ In particular for text classification tasks such as
spam and fraud detection or sentiment analysis,
bag-of-words representations provide a simple
and powerful solution.

Chapter 5: Working with Text Data 28

Cont..
◼ As is often the case in machine learning, the
representation of the data is key in NLP
applications, and inspecting the tokens and n-
grams that are extracted can give powerful
insights into the modeling process.
◼ In text-processing applications, it is often possible
to introspect models in a meaningful way, as we
saw in this chapter, for both supervised and
unsupervised tasks.
◼ You should take full advantage of this ability
when using NLP-based methods in practice.
Chapter 5: Working with Text Data 29
Thank You!

Chapter 5: Working with Text Data 30

Saltz HW11
No ratings yet
Saltz HW11
2 pages
Chapter 4
No ratings yet
Chapter 4
6 pages
Unit4 Final
No ratings yet
Unit4 Final
57 pages
Chapter 2 Part 1 & 2
No ratings yet
Chapter 2 Part 1 & 2
58 pages
Bag of Words
No ratings yet
Bag of Words
32 pages
CL_lec 6
No ratings yet
CL_lec 6
28 pages
Tidy Text
No ratings yet
Tidy Text
39 pages
Term Paper On Regular Expression
100% (1)
Term Paper On Regular Expression
8 pages
Ngram Experiment 3
No ratings yet
Ngram Experiment 3
3 pages
Lecture 3-Term Vocabulary and Posting Lists
No ratings yet
Lecture 3-Term Vocabulary and Posting Lists
26 pages
NLP Text Preprocessing
No ratings yet
NLP Text Preprocessing
19 pages
Lect 7 Normalization
No ratings yet
Lect 7 Normalization
9 pages
ir manual
No ratings yet
ir manual
53 pages
Introduction To R II
No ratings yet
Introduction To R II
35 pages
Fay Thesis Answers
100% (3)
Fay Thesis Answers
7 pages
Unit V Natural Language Processing
No ratings yet
Unit V Natural Language Processing
20 pages
Cover Page Thesis Sample
100% (2)
Cover Page Thesis Sample
8 pages
Text Processing Steps
No ratings yet
Text Processing Steps
3 pages
NLP CHAPTER 3
No ratings yet
NLP CHAPTER 3
23 pages
Text Preprocessing Stages
No ratings yet
Text Preprocessing Stages
8 pages
Unit v Expert Systems Notes
No ratings yet
Unit v Expert Systems Notes
15 pages
Big Data
No ratings yet
Big Data
32 pages
Reasoning-Based Adaptive Language Parsing
No ratings yet
Reasoning-Based Adaptive Language Parsing
6 pages
Unit 5 - Aiaaia
No ratings yet
Unit 5 - Aiaaia
19 pages
Seminar 2
No ratings yet
Seminar 2
11 pages
List of Symbols Thesis
100% (3)
List of Symbols Thesis
5 pages
4.2 Skills development for integrated tasks.Workshop
No ratings yet
4.2 Skills development for integrated tasks.Workshop
26 pages
AP English Language and Composition Synthesis Essay Examples
100% (2)
AP English Language and Composition Synthesis Essay Examples
4 pages
Python Learn Python Regular Expressions FAST The Ultimate Crash Course To Learning The Basics of Python Regular Expressions - (Acodemy)
No ratings yet
Python Learn Python Regular Expressions FAST The Ultimate Crash Course To Learning The Basics of Python Regular Expressions - (Acodemy)
127 pages
DSB - Unit4-Representing and Miniing text-decision-analytic-think-II
No ratings yet
DSB - Unit4-Representing and Miniing text-decision-analytic-think-II
46 pages
NLP PREP
No ratings yet
NLP PREP
14 pages
Haskell Data Analysis Cookbook Sample Chapter
No ratings yet
Haskell Data Analysis Cookbook Sample Chapter
36 pages
4 Data Wrangling With Excel
No ratings yet
4 Data Wrangling With Excel
27 pages
Unit 3 AI-ML Driven Data Science and Automation
No ratings yet
Unit 3 AI-ML Driven Data Science and Automation
49 pages
NLP-Questions Class 10 Ai
No ratings yet
NLP-Questions Class 10 Ai
8 pages
Master Thesis Coordinator
100% (2)
Master Thesis Coordinator
4 pages
NLP_Lecture_6_Week_3
No ratings yet
NLP_Lecture_6_Week_3
9 pages
Bollywood Thesis
100% (1)
Bollywood Thesis
7 pages
CH2
No ratings yet
CH2
119 pages
Thesis Statement For Programming Languages
100% (3)
Thesis Statement For Programming Languages
6 pages
What Is Parsing
No ratings yet
What Is Parsing
47 pages
Week 8-Module 7 NLP
No ratings yet
Week 8-Module 7 NLP
52 pages
Big Data Finance t8 1 Choi Neoma NLP 2024
No ratings yet
Big Data Finance t8 1 Choi Neoma NLP 2024
13 pages
Should A Literature Review Be Double Spaced
100% (1)
Should A Literature Review Be Double Spaced
8 pages
Making A Chat
No ratings yet
Making A Chat
3 pages
NLP For ML - Spam Classifier
No ratings yet
NLP For ML - Spam Classifier
14 pages
Do You Need To Write A Thesis For A Bachelors Degree
No ratings yet
Do You Need To Write A Thesis For A Bachelors Degree
8 pages
Mid 1
No ratings yet
Mid 1
37 pages
NLP CHAPTER-1
No ratings yet
NLP CHAPTER-1
24 pages
Preparing Data for Analysis with JMP
From Everand
Preparing Data for Analysis with JMP
Robert Carver
No ratings yet
Table Interpretation Thesis Sample
100% (3)
Table Interpretation Thesis Sample
5 pages
Thesis Font and Line Spacing
100% (3)
Thesis Font and Line Spacing
4 pages
Learning Activity Sheet
No ratings yet
Learning Activity Sheet
2 pages
NLP Pre-Processing
No ratings yet
NLP Pre-Processing
6 pages
02 Text Operation
No ratings yet
02 Text Operation
52 pages
601.465/665 - Natural Language Processing Assignment 1: Designing Context-Free Grammars
No ratings yet
601.465/665 - Natural Language Processing Assignment 1: Designing Context-Free Grammars
11 pages
1 Motivation: Setting Up To Use Pstone
No ratings yet
1 Motivation: Setting Up To Use Pstone
9 pages
Unit-3 (NLP)
No ratings yet
Unit-3 (NLP)
28 pages
NLB final lab manual (2)
No ratings yet
NLB final lab manual (2)
23 pages
Natural Language Processing_NOTES
No ratings yet
Natural Language Processing_NOTES
4 pages
Quantitative Text Analysis Overview and Fundamentals: Kenneth Benoit
No ratings yet
Quantitative Text Analysis Overview and Fundamentals: Kenneth Benoit
60 pages
Natural Language Processing Master Thesis
100% (3)
Natural Language Processing Master Thesis
8 pages
Natural Language Processing (NLP) With Python - Tutorial
No ratings yet
Natural Language Processing (NLP) With Python - Tutorial
72 pages
6th Semester Syllabus
No ratings yet
6th Semester Syllabus
20 pages
Ontbot: Ontology Based Chatbot: Hadeel Al-Zubaide, and Ayman A. Issa
No ratings yet
Ontbot: Ontology Based Chatbot: Hadeel Al-Zubaide, and Ayman A. Issa
6 pages
Morphological Analysis
No ratings yet
Morphological Analysis
27 pages
NLP Merged
No ratings yet
NLP Merged
52 pages
Lec-3 Language Modeling N-Grams
No ratings yet
Lec-3 Language Modeling N-Grams
41 pages
Cl-II Lab Manual Ir-Ui - Ux
No ratings yet
Cl-II Lab Manual Ir-Ui - Ux
61 pages
Mini Project Report
No ratings yet
Mini Project Report
45 pages
Answers 111111111111111111111111111
No ratings yet
Answers 111111111111111111111111111
21 pages
Teaching NLTK Norwegian
No ratings yet
Teaching NLTK Norwegian
68 pages
NLTK
No ratings yet
NLTK
16 pages
Unit 2 Data - Structures
No ratings yet
Unit 2 Data - Structures
84 pages
Fraud Application Detection Using Data Mining Techniques
No ratings yet
Fraud Application Detection Using Data Mining Techniques
4 pages
Natural Language Processing: Topic: Morphology
No ratings yet
Natural Language Processing: Topic: Morphology
52 pages
NLP Manual
No ratings yet
NLP Manual
21 pages
ICSDP REVIEW[1]
No ratings yet
ICSDP REVIEW[1]
21 pages
Parse PPT
No ratings yet
Parse PPT
25 pages
Text Operations 2021
No ratings yet
Text Operations 2021
45 pages
Informatin and Storage Retrieval Group - 5 Sec - 2 Assiment
No ratings yet
Informatin and Storage Retrieval Group - 5 Sec - 2 Assiment
14 pages
Aiml - 4351601
No ratings yet
Aiml - 4351601
60 pages
2011 Dawson Stemmer
No ratings yet
2011 Dawson Stemmer
7 pages
An Automated Resume Screening System Using Natural
No ratings yet
An Automated Resume Screening System Using Natural
5 pages
Persian Verbs
No ratings yet
Persian Verbs
8 pages
An Experimental Study of Text Preprocessing Techniques For Automatic Short Answer Grading in Indonesian
No ratings yet
An Experimental Study of Text Preprocessing Techniques For Automatic Short Answer Grading in Indonesian
5 pages
Term Frequency and Inverse Document Frequency
No ratings yet
Term Frequency and Inverse Document Frequency
26 pages
Designing A Rule Based Stemmer For Ge'Ez Text: Zigju Demissie Baye
No ratings yet
Designing A Rule Based Stemmer For Ge'Ez Text: Zigju Demissie Baye
4 pages
IRS Study Material
100% (1)
IRS Study Material
87 pages