Module 8 - Text - Update
Module 8 - Text - Update
1
Textual Information
§ We have discussed why people care about textual data, but how do we
handle it?
§ Even at a small scale, the cost and time required may be prohibitive
How Can Computers Help?
§ Ambiguity:
– “I put my bag in the car. It is large and blue”
– “It” = bag? “It” = car?
§ Context
– Homonyms (words pronounced or spelled the same way but have different
meanings), metaphors
– Sarcasm (irony)
§ In this lecture, we’ll see how we can build analytics models using text as
our data
Sentiment Mining – Team Health Survey
§ An anonymous company conducted a team health survey and received 300
responds.
§ The firm wants to monitor how people feel about their team and the
working environments.
“We're a fun team that works well together and is constantly learning
together. We are small which allows us to move quickly and have a
process with very little overhead.”
§ Computers are very literal by default – Apple, APPLE, and ApPLe will all
be counted separately.
§ ‘feel’ is a high-frequency word, but ‘feel good’ and ‘feel bad’ are totally
different.
§ Which words occur most often in association with the most frequently
occurring words in the survey responses?
Word Association in R
This script shows which words are most frequently associated with ‘feel’ (corlimit =
0.25 is the lower limit/threshold set. You can set it lower to see more words, or
higher to see less).
Word Association in R
The output indicates that “integr” (which is the root for word “integrity”) and
“synergi” (which is the root for words “synergy”, “synergies”, etc.) have a high
“correlation” with the word “good”.
This can be interpreted as the most frequently occurring word (“good”) is positive.
Similarly, the root of the word “together” is highly correlated with the word “work”.
This indicates that most responses are saying that teams “work together” and can be
interpreted in a positive context.
Word Association in R
Find terms associated with words that occur at least 50 times or more.
Sentiment Analysis with Machine Learning
• Text data is publicly available
• Scrape website, or
• Use special interface for programmers (API)
• Need to construct the outcome variable for each response (label them as
positive, negative or neutral sentiment)
• Millions of sentences online
• Two people may disagree over the correct classification
• One option is to use Amazon Mechanical Turk
Amazon Mechanical Turk
• Break tasks down into small components and distribute online
§ Syuzhet has four sentiment dictionaries and offers a method for accessing
the sentiment extraction tool developed in the NLP group at Stanford.
get_nrc_sentiments function returns a data frame with each row representing a
sentence from the original file.
The data frame has ten columns (one column for each of the eight emotions, one
column for positive sentiment and one for negative sentiment).
•Zero occurrences of words associated with emotions of anger, disgust, fear, sadness
and surprise
•One occurrence each of words associated with emotions of anticipation and joy
•Two occurrences of words associated with emotions of trust
•Total of one occurrence of words associated with negative emotions
•Total of two occurrences of words associated with positive emotions
Cleaning the result
Visualization
§ The plot shows the total number of instances of words in the text,
associated with each of the eight emotions.
The sum of the sentiment scores of all meaningful words in the first response(line) in
the text file, adds up to 2.60. [For every word, the scale for sentiment scores using
the syuzhet method is decimal and ranges from -1(indicating most negative) to +1(indicating
most positive).]
Median value of 1.6, which is above zero and can be interpreted as the overall
average sentiment across all the responses is positive.
Different methods in Syuzhet package
bing – binary scale with -1 indicating negative and +1 indicating positive sentiment
Since these different methods use different scales, it’s better to transform their output
to a common scale before comparing them.
Using R's built-in sign function, which changes all positive numbers to 1, all negative
numbers to -1, and all zeros to 0, you may perform this simple scale conversion.
Takeaway messages
• Analytical sentiment analysis can replace more labor-intensive methods like
polling
• Text analytics can deal with the massive amounts of unstructured data
being generated on the internet