0% found this document useful (0 votes)

14 views

Module 8 - Text - Update

This document provides an overview of text analytics and natural language processing. It discusses how companies can use text analytics to understand customer sentiment from reviews, complaints and surveys. It describes some of the challenges in analyzing textual data, such as ambiguity and context. It then covers common text analytics techniques like bag-of-words modeling, preprocessing text through cleaning, stemming and removing stop words. Finally, it discusses more advanced techniques like sentiment analysis using machine learning models trained on labeled text data from sources like Amazon Mechanical Turk.

Uploaded by

Vanessa Wong

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views

Module 8 - Text - Update

Uploaded by

Vanessa Wong

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 42

IIMT 2641 Introduction to Business Analytics

Module 8: Text Analytics

1
Textual Information

Product Review Customer Complaints Internal Survey

Understanding People
§ Many companies maintain online presences

§ Managing public perception in age of instant communication essential

§ Reacting to changing sentiment, identifying offensive posts, determining

topics of interest…

§ How can we use analytics to address this?

Using Text as Data

§ Until now, our data has typically been

– Structured
– Numerical
– Categorical

§ Reviews, Tweets, Surveys are

– Loosely structured
– Textual
– Poor spelling, non-traditional grammar
– Multilingual
Text Analytics

§ We have discussed why people care about textual data, but how do we
handle it?

§ Humans can’t keep up with Internet-scale volumes of data

– ~500 Million Tweets every day!

§ Even at a small scale, the cost and time required may be prohibitive
How Can Computers Help?

§ Computers need to understand text

§ This field is called Natural Language Processing (NLP)

§ The goal is to understand and derive meaning from human language

§ In 1950, Alan Turing proposes a test of machine intelligence: a computer

program passes it if it can take part in a real-time conversation and cannot
be distinguished from a human
History of Natural Language Processing

§ Some progress: “chatterbots” like

ELIZA

§ Initial focus on understanding grammar

§ Focus shifting now towards statistical,

machine learning techniques that learn
from large bodies of text

§ Modern “artificial intelligences”:

Apple’s Siri and Google Now
Why is it Hard?
§ Computers need to understand text

§ Ambiguity:
– “I put my bag in the car. It is large and blue”
– “It” = bag? “It” = car?

§ Context
– Homonyms (words pronounced or spelled the same way but have different
meanings), metaphors
– Sarcasm (irony)

§ In this lecture, we’ll see how we can build analytics models using text as
our data
Sentiment Mining – Team Health Survey
§ An anonymous company conducted a team health survey and received 300
responds.

§ The firm wants to monitor how people feel about their team and the
working environments.

“We're a fun team that works well together and is constantly learning
together. We are small which allows us to move quickly and have a
process with very little overhead.”

§ Challenge: Can we correctly classify the response as being happy,

frustrated, or angry?
A Bag of Words
§ Fully understanding text is difficult
§ Simpler approach: Count the number of times each words appears

§ “This course is great. I would recommend this course to my friends.”

THIS COURSE GREAT … WOULD FRIENDS

2 2 1 … 1 1
A Simple but Effective Approach
§ One feature for each word - a simple approach, but effective

§ Used as a baseline in text analytics projects and natural language

processing

§ Not the whole story though - preprocessing can dramatically improve

performance!
Cleaning Up Irregularities
§ Text data often has many inconsistencies that will cause algorithms trouble

§ Computers are very literal by default – Apple, APPLE, and ApPLe will all
be counted separately.

§ Change all words to either lower-case or upper-case

Apple APPLE ApPLe apple

apple apple apple 3
Cleaning Up Irregularities
• Punctuation also causes problems – basic approach is to remove everything that
isn’t a,b,…,z

• Sometimes punctuation is meaningful

• Twitter: @apple is a message to Apple, #apple is about Apple
• Web addresses: www.website.com/somepage.html
• Should tailor approach to the specific problem

@Apple APPLE! --apple-- apple

apple apple apple 3
Removing Unhelpful Terms
• Many words are frequently used but are only meaningful in a sentence -
“stop words”
• Examples: the, is, at, which…
• Unlikely to improve machine learning prediction quality
• Remove to reduce size of data

• Two words at a time? It may lead to problem, change the meaning.

• “The Who” • “ ”
• “Take That” • “Take”
Stemming
• Do we need to draw a distinction between the following
words?
argue argued argues arguing
• Could all be represented by a common stem, argu
• Algorithmic process of performing this reduction is called
stemming

• Many ways to approach the problem

Stemming
• Could build a database of words and their stems
• Pro: handles exceptions
• Con: won’t handle new words, bad for the Internet!

• Can write a rule-based algorithm

• e.g. if word ends in “ed”, “ing”, or “ly”, remove it
• Pro: handles new/unknown words well
• Con: many exceptions, misses words like child and
§ children (but would get other plurals: dog and dogs)
Stemming
• The second option is widely popular
• “Porter Stemmer” by Martin Porter in 1980, still used!
• is a process for removing the commoner morphological and inflexional
endings from word
• Stemmers have been written for many languages
• Packages written for computer languages
• Other options include machine learning (train algorithms to recognize the
roots of words) and combinations of the above

Real example from data (after stemmed):

“byfar the best customer care serviceI haveever received”
“byfar the best custom care servic I haveeverreceiv”
Cleaning Up Irregularities in R
§ Install required library
§ Read the data

In R, a Corpus is a collection of text document(s) to apply text mining or NLP

routines on.
§ Cleaning the text data (removing special characters from the text)
– Replacing "/", "@" and "|" with space
– Convert the text to lower case
– Remove numbers
– Remove common stopwords (“the, is, at, on” in English).
– Remove your own stop word (this case, ‘team’, ‘company’, ‘health’)
– Remove punctuations
– Eliminate extra white spaces
– Text stemming (It is the process of reducing the word to its root form.)

If you get warning for ‘transformation drops

documents’, it doesn’t matter.
§ Calculate the word frequency
§ Visualization
• Visualization
Word Association
§ Correlation is a statistical technique that can demonstrate whether, and how
strongly, pairs of variables are related.

§ ‘feel’ is a high-frequency word, but ‘feel good’ and ‘feel bad’ are totally
different.

§ Which words occur most often in association with the most frequently
occurring words in the survey responses?
Word Association in R

This script shows which words are most frequently associated with ‘feel’ (corlimit =
0.25 is the lower limit/threshold set. You can set it lower to see more words, or
higher to see less).
Word Association in R

The output indicates that “integr” (which is the root for word “integrity”) and
“synergi” (which is the root for words “synergy”, “synergies”, etc.) have a high
“correlation” with the word “good”.

This can be interpreted as the most frequently occurring word (“good”) is positive.
Similarly, the root of the word “together” is highly correlated with the word “work”.
This indicates that most responses are saying that teams “work together” and can be
interpreted in a positive context.
Word Association in R

Find terms associated with words that occur at least 50 times or more.
Sentiment Analysis with Machine Learning
• Text data is publicly available
• Scrape website, or
• Use special interface for programmers (API)

• Need to construct the outcome variable for each response (label them as
positive, negative or neutral sentiment)
• Millions of sentences online
• Two people may disagree over the correct classification
• One option is to use Amazon Mechanical Turk
Amazon Mechanical Turk
• Break tasks down into small components and distribute online

• People can sign up to perform the tasks for a fee

• Pay workers, e.g. $0.02 per classified response
• Amazon MTurk serves as a broker, takes smallcut

• Many tasks require human intelligence (like classify the sentiment of a

response), but may be time consuming or require building otherwise
unneeded capacity for the creator of the task, so it is appealing to outsource
the job

• After collecting data, we train models to predict the sentiments.

Sentiment Analysis Today
• Over 7,000 research articles have been written on this topic

• Hundreds of start-ups are developing sentiment analysis solutions

• Many websites perform real-time analysis of tweets

• “tweetfeel” shows trends given any term
• “The Stock Sonar” shows sentiment and stock prices
Sentiment Analysis in R
§ We use the Syuzhet package for generating sentiment scores

§ Syuzhet has four sentiment dictionaries and offers a method for accessing
the sentiment extraction tool developed in the NLP group at Stanford.
get_nrc_sentiments function returns a data frame with each row representing a
sentence from the original file.

The data frame has ten columns (one column for each of the eight emotions, one
column for positive sentiment and one for negative sentiment).
•Zero occurrences of words associated with emotions of anger, disgust, fear, sadness
and surprise
•One occurrence each of words associated with emotions of anticipation and joy
•Two occurrences of words associated with emotions of trust
•Total of one occurrence of words associated with negative emotions
•Total of two occurrences of words associated with positive emotions
Cleaning the result
Visualization
§ The plot shows the total number of instances of words in the text,
associated with each of the eight emotions.

This plot demonstrates that words

associated with the positive emotion of
“trust” occurred about five hundred times
in the text, whereas words associated with
the negative emotion of “disgust” occurred
less than 25 times.
Visualization

This shows that words associated with this ‘trust’

positive emotion constitute over 35% of all the
meaningful words in this text.

On the other hand, the emotion of “disgust” has

the shortest bar, indicating that words associated
with this negative emotion constitute less than
2% of all the meaningful words in this text.
Different methods in Syuzhet package
The get_sentiment function accepts two arguments: a character vector (of sentences
or words) and a method.

The sum of the sentiment scores of all meaningful words in the first response(line) in
the text file, adds up to 2.60. [For every word, the scale for sentiment scores using
the syuzhet method is decimal and ranges from -1(indicating most negative) to +1(indicating
most positive).]

Median value of 1.6, which is above zero and can be interpreted as the overall
average sentiment across all the responses is positive.
Different methods in Syuzhet package

bing – binary scale with -1 indicating negative and +1 indicating positive sentiment

afinn – integer scale ranging from -5 to +5

Compare Different methods in Syuzhet package

Since these different methods use different scales, it’s better to transform their output
to a common scale before comparing them.

Using R's built-in sign function, which changes all positive numbers to 1, all negative
numbers to -1, and all zeros to 0, you may perform this simple scale conversion.
Takeaway messages
• Analytical sentiment analysis can replace more labor-intensive methods like
polling

• Text analytics can deal with the massive amounts of unstructured data
being generated on the internet

• Computers are becoming more and more capable of interacting with

humans and performing human tasks

Text and Sentiment Analysis
No ratings yet
Text and Sentiment Analysis
41 pages
Machine Learning With Advance Model
No ratings yet
Machine Learning With Advance Model
19 pages
Lecture 2 Guide to Text Analytics Techniques
No ratings yet
Lecture 2 Guide to Text Analytics Techniques
14 pages
ETB Text analytics using Machine Learning -20-12-24
No ratings yet
ETB Text analytics using Machine Learning -20-12-24
38 pages
Lec 5 e Text Analytics Vector Space TF IDF
No ratings yet
Lec 5 e Text Analytics Vector Space TF IDF
51 pages
Minor_Project_Presentation (1)
No ratings yet
Minor_Project_Presentation (1)
16 pages
AAIML
No ratings yet
AAIML
10 pages
TEXT ANALYTICS With Python
No ratings yet
TEXT ANALYTICS With Python
37 pages
NLP Pre-Processing
No ratings yet
NLP Pre-Processing
6 pages
Semantic Analysis Theory1
No ratings yet
Semantic Analysis Theory1
16 pages
Introduction To NLP
No ratings yet
Introduction To NLP
50 pages
Lecture 3
No ratings yet
Lecture 3
70 pages
Sentiment Analysis
No ratings yet
Sentiment Analysis
30 pages
week_1-4_Text_an
No ratings yet
week_1-4_Text_an
74 pages
Lecture 6-Text Mining and Sentiment Analysis
No ratings yet
Lecture 6-Text Mining and Sentiment Analysis
57 pages
Text Analysis
No ratings yet
Text Analysis
13 pages
Statistical NLP
No ratings yet
Statistical NLP
45 pages
NLP 2
No ratings yet
NLP 2
86 pages
Module 3
No ratings yet
Module 3
40 pages
Session 11-12 - Text Analytics
No ratings yet
Session 11-12 - Text Analytics
38 pages
Sentiment Analysis JW Marriot
No ratings yet
Sentiment Analysis JW Marriot
16 pages
DSB - Unit4-Representing and Miniing text-decision-analytic-think-II
No ratings yet
DSB - Unit4-Representing and Miniing text-decision-analytic-think-II
46 pages
Text Mining
No ratings yet
Text Mining
62 pages
MARK3088 - Lecture WK 5 - New Product Idea Generation
No ratings yet
MARK3088 - Lecture WK 5 - New Product Idea Generation
46 pages
Natural Language Processing
No ratings yet
Natural Language Processing
10 pages
NLP (4)
No ratings yet
NLP (4)
40 pages
NLP and Sentiment Analysis
No ratings yet
NLP and Sentiment Analysis
89 pages
Data Mining and Sentiment Analysis: A Seminar Report On
No ratings yet
Data Mining and Sentiment Analysis: A Seminar Report On
39 pages
Text Mining - Analytics
No ratings yet
Text Mining - Analytics
35 pages
Text Mining: Tools, Techniques, and Applications
No ratings yet
Text Mining: Tools, Techniques, and Applications
19 pages
research ashish
No ratings yet
research ashish
7 pages
Stock Prediction With Sentiment
No ratings yet
Stock Prediction With Sentiment
7 pages
Text Analysis
No ratings yet
Text Analysis
15 pages
CH 06 PPTaccessible
No ratings yet
CH 06 PPTaccessible
71 pages
Int344 Unit 1,2 & 6
No ratings yet
Int344 Unit 1,2 & 6
13 pages
Sentiment 2
No ratings yet
Sentiment 2
7 pages
Chapter 7.1 - Introducing Natural Language Processing
No ratings yet
Chapter 7.1 - Introducing Natural Language Processing
39 pages
SentA Russir Day2
No ratings yet
SentA Russir Day2
33 pages
NLB final lab manual (2)
No ratings yet
NLB final lab manual (2)
23 pages
Sentiment Analysis On Twitter Data
No ratings yet
Sentiment Analysis On Twitter Data
23 pages
CSE442 Text
No ratings yet
CSE442 Text
89 pages
Natural Language Processing for Sentiment Analysis - ankur Shukla
No ratings yet
Natural Language Processing for Sentiment Analysis - ankur Shukla
27 pages
A Survey of Sentiment Analysis Techniques: Harpreet Kaur Veenu Mangat Nidhi
No ratings yet
A Survey of Sentiment Analysis Techniques: Harpreet Kaur Veenu Mangat Nidhi
5 pages
UNIT V (1)
No ratings yet
UNIT V (1)
22 pages
BCSE206L_FDS_MODULE-4_SMSATAPATHY
No ratings yet
BCSE206L_FDS_MODULE-4_SMSATAPATHY
50 pages
Natural Language Processing
No ratings yet
Natural Language Processing
6 pages
Natural Language Processing Revision Notes
No ratings yet
Natural Language Processing Revision Notes
4 pages
Sentiment Analysis of Twitter Data My
75% (4)
Sentiment Analysis of Twitter Data My
14 pages
Step 1: Create A CSV File: # For Text Mining
No ratings yet
Step 1: Create A CSV File: # For Text Mining
9 pages
Sentimental Analysis Using NLP
No ratings yet
Sentimental Analysis Using NLP
5 pages
### Seminar Report
No ratings yet
### Seminar Report
12 pages
Machine Learning Algorithm For Sentimental Analysis of Twitter Feeds
No ratings yet
Machine Learning Algorithm For Sentimental Analysis of Twitter Feeds
4 pages
AIML-HC Mod 04
No ratings yet
AIML-HC Mod 04
71 pages
NLP Unit 6
No ratings yet
NLP Unit 6
16 pages
Implementation of Sentiment Analysis On Twitter Data
No ratings yet
Implementation of Sentiment Analysis On Twitter Data
6 pages
Dav Exp7 56
No ratings yet
Dav Exp7 56
8 pages
Ed 571275
No ratings yet
Ed 571275
11 pages
Text Mining
No ratings yet
Text Mining
35 pages
PPPT
No ratings yet
PPPT
20 pages
AI Prompting: A Guide to Communicating with Artificial Intelligence
From Everand
AI Prompting: A Guide to Communicating with Artificial Intelligence
E. A. Ruppert II
No ratings yet
Module 2 - RV - Afterclass
No ratings yet
Module 2 - RV - Afterclass
44 pages
Module 2 - Sample - Afterclass
No ratings yet
Module 2 - Sample - Afterclass
36 pages
Module 3 - MultipleLinearRegression - Afterclass1b
No ratings yet
Module 3 - MultipleLinearRegression - Afterclass1b
34 pages
Module 6 - CART - Inclassb
No ratings yet
Module 6 - CART - Inclassb
50 pages
ABS5493
No ratings yet
ABS5493
6 pages
Codetantra
0% (1)
Codetantra
51 pages
Metaphor and Metonymy in Joyce A Little Cloud
No ratings yet
Metaphor and Metonymy in Joyce A Little Cloud
10 pages
All in The Golden Afternoon: Alice's Adventures in Wonderland Reading Questions
No ratings yet
All in The Golden Afternoon: Alice's Adventures in Wonderland Reading Questions
13 pages
IMXRT1170RM
100% (1)
IMXRT1170RM
6,214 pages
SYNCHRONIZE 2º ESO - Grammar Units 0-1 Test
100% (2)
SYNCHRONIZE 2º ESO - Grammar Units 0-1 Test
4 pages
一年级英文试卷
100% (1)
一年级英文试卷
6 pages
Agógica en Grove Music
No ratings yet
Agógica en Grove Music
4 pages
Reviewer in Introduction To Linguistics (FINAL TERM)
No ratings yet
Reviewer in Introduction To Linguistics (FINAL TERM)
7 pages
Mindanao Literature (Survey of Philippine Literature
100% (1)
Mindanao Literature (Survey of Philippine Literature
85 pages
Soal Bing PTS 1 Kelas 6
No ratings yet
Soal Bing PTS 1 Kelas 6
5 pages
log
No ratings yet
log
3 pages
Methods To Compute Prediction Intervals A Review A
No ratings yet
Methods To Compute Prediction Intervals A Review A
37 pages
gr-1-term-2-2021-psrip-efal-lesson-plan
No ratings yet
gr-1-term-2-2021-psrip-efal-lesson-plan
290 pages
Symposium Mini-Lesson #2: Point and Support Outlines
No ratings yet
Symposium Mini-Lesson #2: Point and Support Outlines
15 pages
Thank You Mam Vocabulary
No ratings yet
Thank You Mam Vocabulary
2 pages
Rise of Novel
No ratings yet
Rise of Novel
6 pages
Lesson Modifier Problems
No ratings yet
Lesson Modifier Problems
10 pages
AAHL
No ratings yet
AAHL
4 pages
b5 English 1
No ratings yet
b5 English 1
2 pages
Com.ranger.cheat Logcat
No ratings yet
Com.ranger.cheat Logcat
361 pages
Davidson Nisdaq PDF
No ratings yet
Davidson Nisdaq PDF
13 pages
07 Contoh dialog conditional sentence type 1
No ratings yet
07 Contoh dialog conditional sentence type 1
2 pages
How To Do A Presentation
No ratings yet
How To Do A Presentation
16 pages
2021_Summer_CSE173_final_s11-1
No ratings yet
2021_Summer_CSE173_final_s11-1
2 pages
Chi-Square Pearson Spearman
No ratings yet
Chi-Square Pearson Spearman
8 pages
MIC Assignment 3-22-23
No ratings yet
MIC Assignment 3-22-23
2 pages
Multimedia Lab Manual IT
100% (1)
Multimedia Lab Manual IT
25 pages
ch1 Notes
No ratings yet
ch1 Notes
116 pages
Rohini 77612759119
No ratings yet
Rohini 77612759119
14 pages

Module 8 - Text - Update

Uploaded by

Module 8 - Text - Update

Uploaded by

IIMT 2641 Introduction to Business Analytics

Module 8: Text Analytics

Product Review Customer Complaints Internal Survey

§ Managing public perception in age of instant communication essential

§ Reacting to changing sentiment, identifying offensive posts, determining

§ How can we use analytics to address this?

§ Until now, our data has typically been

§ Reviews, Tweets, Surveys are

§ Humans can’t keep up with Internet-scale volumes of data

§ Computers need to understand text

§ This field is called Natural Language Processing (NLP)

§ The goal is to understand and derive meaning from human language

§ In 1950, Alan Turing proposes a test of machine intelligence: a computer

§ Some progress: “chatterbots” like

§ Initial focus on understanding grammar

§ Focus shifting now towards statistical,

§ Modern “artificial intelligences”:

§ Challenge: Can we correctly classify the response as being happy,

§ “This course is great. I would recommend this course to my friends.”

THIS COURSE GREAT … WOULD FRIENDS

§ Used as a baseline in text analytics projects and natural language

§ Not the whole story though - preprocessing can dramatically improve

§ Change all words to either lower-case or upper-case

Apple APPLE ApPLe apple

• Sometimes punctuation is meaningful

@Apple APPLE! --apple-- apple

• Two words at a time? It may lead to problem, change the meaning.

• Many ways to approach the problem

• Can write a rule-based algorithm

Real example from data (after stemmed):

In R, a Corpus is a collection of text document(s) to apply text mining or NLP

If you get warning for ‘transformation drops

• People can sign up to perform the tasks for a fee

• Many tasks require human intelligence (like classify the sentiment of a

• After collecting data, we train models to predict the sentiments.

• Hundreds of start-ups are developing sentiment analysis solutions

• Many websites perform real-time analysis of tweets

This plot demonstrates that words

This shows that words associated with this ‘trust’

On the other hand, the emotion of “disgust” has

afinn – integer scale ranging from -5 to +5

• Computers are becoming more and more capable of interacting with

You might also like