0% found this document useful (0 votes)
14 views

Module 8 - Text - Update

This document provides an overview of text analytics and natural language processing. It discusses how companies can use text analytics to understand customer sentiment from reviews, complaints and surveys. It describes some of the challenges in analyzing textual data, such as ambiguity and context. It then covers common text analytics techniques like bag-of-words modeling, preprocessing text through cleaning, stemming and removing stop words. Finally, it discusses more advanced techniques like sentiment analysis using machine learning models trained on labeled text data from sources like Amazon Mechanical Turk.

Uploaded by

Vanessa Wong
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Module 8 - Text - Update

This document provides an overview of text analytics and natural language processing. It discusses how companies can use text analytics to understand customer sentiment from reviews, complaints and surveys. It describes some of the challenges in analyzing textual data, such as ambiguity and context. It then covers common text analytics techniques like bag-of-words modeling, preprocessing text through cleaning, stemming and removing stop words. Finally, it discusses more advanced techniques like sentiment analysis using machine learning models trained on labeled text data from sources like Amazon Mechanical Turk.

Uploaded by

Vanessa Wong
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

IIMT 2641 Introduction to Business Analytics

Module 8: Text Analytics

1
Textual Information

Product Review Customer Complaints Internal Survey


Understanding People
§ Many companies maintain online presences

§ Managing public perception in age of instant communication essential

§ Reacting to changing sentiment, identifying offensive posts, determining


topics of interest…

§ How can we use analytics to address this?


Using Text as Data

§ Until now, our data has typically been


– Structured
– Numerical
– Categorical

§ Reviews, Tweets, Surveys are


– Loosely structured
– Textual
– Poor spelling, non-traditional grammar
– Multilingual
Text Analytics

§ We have discussed why people care about textual data, but how do we
handle it?

§ Humans can’t keep up with Internet-scale volumes of data


– ~500 Million Tweets every day!

§ Even at a small scale, the cost and time required may be prohibitive
How Can Computers Help?

§ Computers need to understand text

§ This field is called Natural Language Processing (NLP)

§ The goal is to understand and derive meaning from human language

§ In 1950, Alan Turing proposes a test of machine intelligence: a computer


program passes it if it can take part in a real-time conversation and cannot
be distinguished from a human
History of Natural Language Processing

§ Some progress: “chatterbots” like


ELIZA

§ Initial focus on understanding grammar

§ Focus shifting now towards statistical,


machine learning techniques that learn
from large bodies of text

§ Modern “artificial intelligences”:


Apple’s Siri and Google Now
Why is it Hard?
§ Computers need to understand text

§ Ambiguity:
– “I put my bag in the car. It is large and blue”
– “It” = bag? “It” = car?

§ Context
– Homonyms (words pronounced or spelled the same way but have different
meanings), metaphors
– Sarcasm (irony)

§ In this lecture, we’ll see how we can build analytics models using text as
our data
Sentiment Mining – Team Health Survey
§ An anonymous company conducted a team health survey and received 300
responds.

§ The firm wants to monitor how people feel about their team and the
working environments.

“We're a fun team that works well together and is constantly learning
together. We are small which allows us to move quickly and have a
process with very little overhead.”

§ Challenge: Can we correctly classify the response as being happy,


frustrated, or angry?
A Bag of Words
§ Fully understanding text is difficult
§ Simpler approach: Count the number of times each words appears

§ “This course is great. I would recommend this course to my friends.”

THIS COURSE GREAT … WOULD FRIENDS


2 2 1 … 1 1
A Simple but Effective Approach
§ One feature for each word - a simple approach, but effective

§ Used as a baseline in text analytics projects and natural language


processing

§ Not the whole story though - preprocessing can dramatically improve


performance!
Cleaning Up Irregularities
§ Text data often has many inconsistencies that will cause algorithms trouble

§ Computers are very literal by default – Apple, APPLE, and ApPLe will all
be counted separately.

§ Change all words to either lower-case or upper-case

Apple APPLE ApPLe apple


apple apple apple 3
Cleaning Up Irregularities
• Punctuation also causes problems – basic approach is to remove everything that
isn’t a,b,…,z

• Sometimes punctuation is meaningful


• Twitter: @apple is a message to Apple, #apple is about Apple
• Web addresses: www.website.com/somepage.html
• Should tailor approach to the specific problem

@Apple APPLE! --apple-- apple


apple apple apple 3
Removing Unhelpful Terms
• Many words are frequently used but are only meaningful in a sentence -
“stop words”
• Examples: the, is, at, which…
• Unlikely to improve machine learning prediction quality
• Remove to reduce size of data

• Two words at a time? It may lead to problem, change the meaning.


• “The Who” • “ ”
• “Take That” • “Take”
Stemming
• Do we need to draw a distinction between the following
words?
argue argued argues arguing
• Could all be represented by a common stem, argu
• Algorithmic process of performing this reduction is called
stemming

• Many ways to approach the problem


Stemming
• Could build a database of words and their stems
• Pro: handles exceptions
• Con: won’t handle new words, bad for the Internet!

• Can write a rule-based algorithm


• e.g. if word ends in “ed”, “ing”, or “ly”, remove it
• Pro: handles new/unknown words well
• Con: many exceptions, misses words like child and
§ children (but would get other plurals: dog and dogs)
Stemming
• The second option is widely popular
• “Porter Stemmer” by Martin Porter in 1980, still used!
• is a process for removing the commoner morphological and inflexional
endings from word
• Stemmers have been written for many languages
• Packages written for computer languages
• Other options include machine learning (train algorithms to recognize the
roots of words) and combinations of the above

Real example from data (after stemmed):


“byfar the best customer care serviceI haveever received”
“byfar the best custom care servic I haveeverreceiv”
Cleaning Up Irregularities in R
§ Install required library
§ Read the data

In R, a Corpus is a collection of text document(s) to apply text mining or NLP


routines on.
§ Cleaning the text data (removing special characters from the text)
– Replacing "/", "@" and "|" with space
– Convert the text to lower case
– Remove numbers
– Remove common stopwords (“the, is, at, on” in English).
– Remove your own stop word (this case, ‘team’, ‘company’, ‘health’)
– Remove punctuations
– Eliminate extra white spaces
– Text stemming (It is the process of reducing the word to its root form.)

If you get warning for ‘transformation drops


documents’, it doesn’t matter.
§ Calculate the word frequency
§ Visualization
• Visualization
Word Association
§ Correlation is a statistical technique that can demonstrate whether, and how
strongly, pairs of variables are related.

§ ‘feel’ is a high-frequency word, but ‘feel good’ and ‘feel bad’ are totally
different.

§ Which words occur most often in association with the most frequently
occurring words in the survey responses?
Word Association in R

This script shows which words are most frequently associated with ‘feel’ (corlimit =
0.25 is the lower limit/threshold set. You can set it lower to see more words, or
higher to see less).
Word Association in R

The output indicates that “integr” (which is the root for word “integrity”) and
“synergi” (which is the root for words “synergy”, “synergies”, etc.) have a high
“correlation” with the word “good”.

This can be interpreted as the most frequently occurring word (“good”) is positive.
Similarly, the root of the word “together” is highly correlated with the word “work”.
This indicates that most responses are saying that teams “work together” and can be
interpreted in a positive context.
Word Association in R

Find terms associated with words that occur at least 50 times or more.
Sentiment Analysis with Machine Learning
• Text data is publicly available
• Scrape website, or
• Use special interface for programmers (API)

• Need to construct the outcome variable for each response (label them as
positive, negative or neutral sentiment)
• Millions of sentences online
• Two people may disagree over the correct classification
• One option is to use Amazon Mechanical Turk
Amazon Mechanical Turk
• Break tasks down into small components and distribute online

• People can sign up to perform the tasks for a fee


• Pay workers, e.g. $0.02 per classified response
• Amazon MTurk serves as a broker, takes smallcut

• Many tasks require human intelligence (like classify the sentiment of a


response), but may be time consuming or require building otherwise
unneeded capacity for the creator of the task, so it is appealing to outsource
the job

• After collecting data, we train models to predict the sentiments.


Sentiment Analysis Today
• Over 7,000 research articles have been written on this topic

• Hundreds of start-ups are developing sentiment analysis solutions

• Many websites perform real-time analysis of tweets


• “tweetfeel” shows trends given any term
• “The Stock Sonar” shows sentiment and stock prices
Sentiment Analysis in R
§ We use the Syuzhet package for generating sentiment scores

§ Syuzhet has four sentiment dictionaries and offers a method for accessing
the sentiment extraction tool developed in the NLP group at Stanford.
get_nrc_sentiments function returns a data frame with each row representing a
sentence from the original file.

The data frame has ten columns (one column for each of the eight emotions, one
column for positive sentiment and one for negative sentiment).
•Zero occurrences of words associated with emotions of anger, disgust, fear, sadness
and surprise
•One occurrence each of words associated with emotions of anticipation and joy
•Two occurrences of words associated with emotions of trust
•Total of one occurrence of words associated with negative emotions
•Total of two occurrences of words associated with positive emotions
Cleaning the result
Visualization
§ The plot shows the total number of instances of words in the text,
associated with each of the eight emotions.

This plot demonstrates that words


associated with the positive emotion of
“trust” occurred about five hundred times
in the text, whereas words associated with
the negative emotion of “disgust” occurred
less than 25 times.
Visualization

This shows that words associated with this ‘trust’


positive emotion constitute over 35% of all the
meaningful words in this text.

On the other hand, the emotion of “disgust” has


the shortest bar, indicating that words associated
with this negative emotion constitute less than
2% of all the meaningful words in this text.
Different methods in Syuzhet package
The get_sentiment function accepts two arguments: a character vector (of sentences
or words) and a method.

The sum of the sentiment scores of all meaningful words in the first response(line) in
the text file, adds up to 2.60. [For every word, the scale for sentiment scores using
the syuzhet method is decimal and ranges from -1(indicating most negative) to +1(indicating
most positive).]

Median value of 1.6, which is above zero and can be interpreted as the overall
average sentiment across all the responses is positive.
Different methods in Syuzhet package

bing – binary scale with -1 indicating negative and +1 indicating positive sentiment

afinn – integer scale ranging from -5 to +5


Compare Different methods in Syuzhet package

Since these different methods use different scales, it’s better to transform their output
to a common scale before comparing them.

Using R's built-in sign function, which changes all positive numbers to 1, all negative
numbers to -1, and all zeros to 0, you may perform this simple scale conversion.
Takeaway messages
• Analytical sentiment analysis can replace more labor-intensive methods like
polling

• Text analytics can deal with the massive amounts of unstructured data
being generated on the internet

• Computers are becoming more and more capable of interacting with


humans and performing human tasks

You might also like