SlideShare a Scribd company logo
Class Outline
• Introduction: Unstructured Data Analysis
• Word-level Analysis
– Vector Space Model
– TF-IDF

• Beyond Word-level Analysis: Natural
Language Processing (NLP)
• Text Mining Demonstration in R: Mining
Twitter Data
Background: Text Mining – New MR Tool!
• Text data is everywhere – books, news, articles, financial analysis,
blogs, social networking, etc
• According to estimates, 80% of world’s data is in “unstructured text
format”
• We need methods to extract, summarize, and analyze useful
information from unstructured/text data
• Text mining seeks to automatically discover useful knowledge from
the massive amount of data
• Active research is going on in the area of text mining in industry and
academics
What is Text Mining?
• Use of computational techniques to extract high quality
information from text

• Extract and discover knowledge hidden in text automatically

• KDD definition: “discovery by computer of new previously unknown
information, by automatically extracting information from a usually
large amount of different unstructured textual resources”
Text Mining Tasks
• 1. Document Categorization (Supervised Learning)
• 2. Document Clustering/Organization (Unsupervised Learning)
• 3. Summarization (key words, indices, etc)
• 4. Visualization (word cloud, maps)
• 5. Numeric prediction (stock market prediction based on news text)
Features of Text Data
•
•
•
•
•
•
•
•

High dimensionality
Large number of features
Multiple ways to represent the same concept
Highly redundant data
Unstructured data
Easy for humans, hard for machine
Abstract ideas hard to represent
Huge amount of data to be processed
– Automation is required
Acquiring Texts
• Existing digital corpora: e.g. XML (high quality text and metadata)
– http://www.hathitrust.org/htrc

• Other digital sources (e.g. Web, twitter, Amazon consumer reviews)
– Through API: e.g. tweets
– Websites without APIs can be “scraped”
– Generally requires custom programming (Perl, Python, etc) or software tools
(e.g. Web extractor pro)

• Undigitized text
– Scanned and subjected to Optical Character Recognition (OCR)
– Time and labor intensive
– Error-prone
Word-level Analysis: Vector Space Model
• Documents are treated as a “bag” of words or terms
• Any document can be represented as a vector: a list of terms and
their associated weights
– D= {(t1,w1),(t2,w2),…………,(tn,wn )}
– ti: i-th term
– wi: weight for the i-th term

• Weight is a measure of the importance of terms of information
content
Vector Space Model: Bag of Words Representation
• Each document: Sparse high-dimensional vector!
TF-IDF: Definition
TF-IDF: Example
• TF: Consider a document containing 100 words wherein the word cow
appears 3 times. Following the previously defined formulas, what is
the term frequency (TF) for cow?
– TF(cow,d1) = 3.

• IDF: Now assume we have 10 million documents and cow appears in
one thousand of these. What is the inverse document frequency of
the term, cow?
– IDF(cow) = log(10,000,000/1,000) = 4

• TF-IDF score?
– TF-IDF = 3 x 4 = 12 (Product of TF and IDF)
Application 1: Document Search with Query
Document ID

Cat

Dog

d1

0.397

d2

Mouse

Fish

Horse

Cow

Matching Scores

0.397 0.000

0.475

0.000

0.000

1.268

0.352

0.301 0.680

0.000

0.000

0.000

0.653

d3

0.301

0.363 0.000

0.000

0.669

0.741

0.664

d4

0.376

0.352 0.636

0.558

0.000

0.000

1.286

d5

0.301

0.301 0.000

0.426

0.544

0.544

1.028
Application 2: Word Frequencies – Zipf’s Law
• Idea: We use a few words very often, and most words very rarely,
because it’s more effort to use a rare word.

• Zipf’s Law: Product of frequency of word and its rank is [reasonably]
constant

• Empirically demonstrable; holds up over different languages
Application 2: Word Frequencies – Zipf’s Law
Application 3: Word Cloud - Budweiser Example

http://people.duke.edu/~el113/Visualizations.html
Problems with Word-level Analysis: Sentiment
• Sentiment can often be expressed in a more subtle manner, making it
difficult to be identified by any of a sentence or document’s terms
when considered in isolation
– A positive or negative sentiment word may have opposite orientations in
different application domains. (“This camera sucks.” -> negative; “This vacuum
cleaner really sucks.” -> positive)
– A sentence containing sentiment words may not express any sentiment. (e.g.
“Can you tell me which Sony camera is good?”)
– Sarcastic sentences with or without sentiment words are hard to deal with. (e.g.
“What a great car! It sopped working in two days.”
– Many sentences without sentiment words can also imply opinions. (e.g. “This
washer uses a lot of water.” -> negative)

• We have to consider the overall context (semantics of each sentence
or document)
Natural Language Processing (NLP) to the Rescue!
• NLP: is a filed of computer science, artificial intelligence, and
linguistics, concerned with the interactions between computers and
human (natural) languages.
• Key idea: Use statistical “machine learning” to automatically learn
the language from data!
• Major tasks in NLP
–
–
–
–
–
–

Automatic summarization
Part-of-speech tagging (POS tagging)
Relationship extraction
Sentiment analysis
Topic segmentation and recognition
Machine translation
Demonstration: POS Tagging – 1/2
• http://cogcomp.cs.illinois.edu/demo/pos/results.php
Demonstration: POS Tagging – 2/2
Demonstration: Sentence-level Sentiment – 1/3
• Stanford Sentiment Analyzer
– http://nlp.stanford.edu:8080/sentiment/rntnDemo.html
Demonstration: Sentence-level Sentiment – 2/3
• Review 1: This movie doesn’t care about cleverness, wit or any other
kind of intelligent humor. -> Negative
Demonstration: Sentence-level Sentiment – 3/3
• There are slow and repetitive parts, but it has just enough spice to
keep it interesting. -> Positive
• Text Mining Demonstration in R: Mining
Twitter Data
Twitter Mining in R – 1/2

Step 0) Install “R” and Packages
R program: http://www.r-project.org/
Package: http://cran.r-project.org/web/packages/tm/index.html
Package: http://cran.r-project.org/web/packages/twitteR/index.html
Package: http://cran.r-project.org/web/packages/wordcloud/index.html
Manual: http://cran.r-project.org/web/packages/tm/vignettes/tm.pdf

Step 1) Retrieving Text from Twitter: Twitter API
(Using twitteR)
Twitter Mining in R – 2/2
Step 2) Transforming Text

Step 3) Stemming Words
Step 4) Build a Term-Document Matrix
Step 5) Frequent Terms and Associations

Step 6) Word Cloud
Software for Text Mining
• A number of academic/commercial software available:
– 1. Open source packages in R – e.g. tm
• R program: http://www.r-project.org/
• Package: http://cran.r-project.org/web/packages/tm/index.html
• Manual: http://cran.r-project.org/web/packages/tm/vignettes/tm.pdf

– 2. Stanford NLP core
• http://nlp.stanford.edu/software/corenlp.shtml

–
–
–
–
–

3. SAS TextMiner
4. IBM SPSS
5. Boos Texter
6. StatSoft
7. AeroText

• Text Data is everywhere – you can mine it to gain insights!
Ad

More Related Content

What's hot (20)

Tesxt mining
Tesxt miningTesxt mining
Tesxt mining
Maurice Masih
 
Big Data & Text Mining
Big Data & Text MiningBig Data & Text Mining
Big Data & Text Mining
Michel Bruley
 
4.4 text mining
4.4 text mining4.4 text mining
4.4 text mining
Krish_ver2
 
Text categorization
Text categorizationText categorization
Text categorization
KU Leuven
 
Introduction to Data Science and Analytics
Introduction to Data Science and AnalyticsIntroduction to Data Science and Analytics
Introduction to Data Science and Analytics
Srinath Perera
 
Text mining presentation in Data mining Area
Text mining presentation in Data mining AreaText mining presentation in Data mining Area
Text mining presentation in Data mining Area
MahamudHasanCSE
 
Text MIning
Text MIningText MIning
Text MIning
Prakhyath Rai
 
Data Mining: What is Data Mining?
Data Mining: What is Data Mining?Data Mining: What is Data Mining?
Data Mining: What is Data Mining?
Seerat Malik
 
Web mining
Web miningWeb mining
Web mining
Tanjarul Islam Mishu
 
Text analysis and its Importance.pdf
Text analysis and its Importance.pdfText analysis and its Importance.pdf
Text analysis and its Importance.pdf
VivekDixit486466
 
Text Classification
Text ClassificationText Classification
Text Classification
RAX Automation Suite
 
Tutorial on Web Scraping in Python
Tutorial on Web Scraping in PythonTutorial on Web Scraping in Python
Tutorial on Web Scraping in Python
Nithish Raghunandanan
 
Data mining
Data miningData mining
Data mining
Birju Tank
 
Data science
Data scienceData science
Data science
Mohamed Loey
 
Data Mining: Text and web mining
Data Mining: Text and web miningData Mining: Text and web mining
Data Mining: Text and web mining
DataminingTools Inc
 
Web mining
Web mining Web mining
Web mining
TeklayBirhane
 
Introduction to-data-science
Introduction to-data-scienceIntroduction to-data-science
Introduction to-data-science
Ahmad karawash
 
What’s The Difference Between Structured, Semi-Structured And Unstructured Data?
What’s The Difference Between Structured, Semi-Structured And Unstructured Data?What’s The Difference Between Structured, Semi-Structured And Unstructured Data?
What’s The Difference Between Structured, Semi-Structured And Unstructured Data?
Bernard Marr
 
The Evolution of Data Science
The Evolution of Data ScienceThe Evolution of Data Science
The Evolution of Data Science
Kenny Daniel
 
Building a performing Machine Learning model from A to Z
Building a performing Machine Learning model from A to ZBuilding a performing Machine Learning model from A to Z
Building a performing Machine Learning model from A to Z
Charles Vestur
 
Big Data & Text Mining
Big Data & Text MiningBig Data & Text Mining
Big Data & Text Mining
Michel Bruley
 
4.4 text mining
4.4 text mining4.4 text mining
4.4 text mining
Krish_ver2
 
Text categorization
Text categorizationText categorization
Text categorization
KU Leuven
 
Introduction to Data Science and Analytics
Introduction to Data Science and AnalyticsIntroduction to Data Science and Analytics
Introduction to Data Science and Analytics
Srinath Perera
 
Text mining presentation in Data mining Area
Text mining presentation in Data mining AreaText mining presentation in Data mining Area
Text mining presentation in Data mining Area
MahamudHasanCSE
 
Data Mining: What is Data Mining?
Data Mining: What is Data Mining?Data Mining: What is Data Mining?
Data Mining: What is Data Mining?
Seerat Malik
 
Text analysis and its Importance.pdf
Text analysis and its Importance.pdfText analysis and its Importance.pdf
Text analysis and its Importance.pdf
VivekDixit486466
 
Data Mining: Text and web mining
Data Mining: Text and web miningData Mining: Text and web mining
Data Mining: Text and web mining
DataminingTools Inc
 
Introduction to-data-science
Introduction to-data-scienceIntroduction to-data-science
Introduction to-data-science
Ahmad karawash
 
What’s The Difference Between Structured, Semi-Structured And Unstructured Data?
What’s The Difference Between Structured, Semi-Structured And Unstructured Data?What’s The Difference Between Structured, Semi-Structured And Unstructured Data?
What’s The Difference Between Structured, Semi-Structured And Unstructured Data?
Bernard Marr
 
The Evolution of Data Science
The Evolution of Data ScienceThe Evolution of Data Science
The Evolution of Data Science
Kenny Daniel
 
Building a performing Machine Learning model from A to Z
Building a performing Machine Learning model from A to ZBuilding a performing Machine Learning model from A to Z
Building a performing Machine Learning model from A to Z
Charles Vestur
 

Similar to Introduction to Text Mining (20)

Data Acquisition for Sentiment Analysis
Data Acquisition for Sentiment AnalysisData Acquisition for Sentiment Analysis
Data Acquisition for Sentiment Analysis
Ali BELCAID
 
Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...
RajkiranVeluri
 
Building NLP solutions using Python
Building NLP solutions using PythonBuilding NLP solutions using Python
Building NLP solutions using Python
botsplash.com
 
Natural Language Processing (NLP).pptx
Natural Language Processing   (NLP).pptxNatural Language Processing   (NLP).pptx
Natural Language Processing (NLP).pptx
HelmandAtssar
 
NLP, Expert system and pattern recognition
NLP, Expert system and pattern recognitionNLP, Expert system and pattern recognition
NLP, Expert system and pattern recognition
Mohammad Ilyas Malik
 
Building NLP solutions for Davidson ML Group
Building NLP solutions for Davidson ML GroupBuilding NLP solutions for Davidson ML Group
Building NLP solutions for Davidson ML Group
botsplash.com
 
Big data 4 webmonday
Big data 4 webmondayBig data 4 webmonday
Big data 4 webmonday
Daniel Koller
 
Big Data Analytics course: Named Entities and Deep Learning for NLP
Big Data Analytics course: Named Entities and Deep Learning for NLPBig Data Analytics course: Named Entities and Deep Learning for NLP
Big Data Analytics course: Named Entities and Deep Learning for NLP
Christian Morbidoni
 
TOPIC__MODELING_IN_NLP__& __EasyOCR.pptx
TOPIC__MODELING_IN_NLP__& __EasyOCR.pptxTOPIC__MODELING_IN_NLP__& __EasyOCR.pptx
TOPIC__MODELING_IN_NLP__& __EasyOCR.pptx
ebraheem943946
 
Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)
WingChan46
 
Final presentation
Final presentationFinal presentation
Final presentation
Nitish Upreti
 
A Gentle Introduction to Text Analysis :)
A Gentle Introduction to Text Analysis :)A Gentle Introduction to Text Analysis :)
A Gentle Introduction to Text Analysis :)
UNCResearchHub
 
Taming Text
Taming TextTaming Text
Taming Text
Grant Ingersoll
 
AI Technology Overview and Career Advice
AI Technology Overview and Career AdviceAI Technology Overview and Career Advice
AI Technology Overview and Career Advice
Kunling Geng
 
Deep learning introduction
Deep learning introductionDeep learning introduction
Deep learning introduction
Adwait Bhave
 
Dice.com Bay Area Search - Beyond Learning to Rank Talk
Dice.com Bay Area Search - Beyond Learning to Rank TalkDice.com Bay Area Search - Beyond Learning to Rank Talk
Dice.com Bay Area Search - Beyond Learning to Rank Talk
Simon Hughes
 
Digitization in theory and practice
Digitization in theory and practiceDigitization in theory and practice
Digitization in theory and practice
Helen Nneka Okpala
 
aistudy-240521200530-db141c56 RAG AI.pptx
aistudy-240521200530-db141c56 RAG AI.pptxaistudy-240521200530-db141c56 RAG AI.pptx
aistudy-240521200530-db141c56 RAG AI.pptx
emceemouli
 
Knowledge base system appl. p 1,2-ver1
Knowledge base system appl.  p 1,2-ver1Knowledge base system appl.  p 1,2-ver1
Knowledge base system appl. p 1,2-ver1
Taymoor Nazmy
 
AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101
vincent683379
 
Data Acquisition for Sentiment Analysis
Data Acquisition for Sentiment AnalysisData Acquisition for Sentiment Analysis
Data Acquisition for Sentiment Analysis
Ali BELCAID
 
Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...
RajkiranVeluri
 
Building NLP solutions using Python
Building NLP solutions using PythonBuilding NLP solutions using Python
Building NLP solutions using Python
botsplash.com
 
Natural Language Processing (NLP).pptx
Natural Language Processing   (NLP).pptxNatural Language Processing   (NLP).pptx
Natural Language Processing (NLP).pptx
HelmandAtssar
 
NLP, Expert system and pattern recognition
NLP, Expert system and pattern recognitionNLP, Expert system and pattern recognition
NLP, Expert system and pattern recognition
Mohammad Ilyas Malik
 
Building NLP solutions for Davidson ML Group
Building NLP solutions for Davidson ML GroupBuilding NLP solutions for Davidson ML Group
Building NLP solutions for Davidson ML Group
botsplash.com
 
Big data 4 webmonday
Big data 4 webmondayBig data 4 webmonday
Big data 4 webmonday
Daniel Koller
 
Big Data Analytics course: Named Entities and Deep Learning for NLP
Big Data Analytics course: Named Entities and Deep Learning for NLPBig Data Analytics course: Named Entities and Deep Learning for NLP
Big Data Analytics course: Named Entities and Deep Learning for NLP
Christian Morbidoni
 
TOPIC__MODELING_IN_NLP__& __EasyOCR.pptx
TOPIC__MODELING_IN_NLP__& __EasyOCR.pptxTOPIC__MODELING_IN_NLP__& __EasyOCR.pptx
TOPIC__MODELING_IN_NLP__& __EasyOCR.pptx
ebraheem943946
 
Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)
WingChan46
 
A Gentle Introduction to Text Analysis :)
A Gentle Introduction to Text Analysis :)A Gentle Introduction to Text Analysis :)
A Gentle Introduction to Text Analysis :)
UNCResearchHub
 
AI Technology Overview and Career Advice
AI Technology Overview and Career AdviceAI Technology Overview and Career Advice
AI Technology Overview and Career Advice
Kunling Geng
 
Deep learning introduction
Deep learning introductionDeep learning introduction
Deep learning introduction
Adwait Bhave
 
Dice.com Bay Area Search - Beyond Learning to Rank Talk
Dice.com Bay Area Search - Beyond Learning to Rank TalkDice.com Bay Area Search - Beyond Learning to Rank Talk
Dice.com Bay Area Search - Beyond Learning to Rank Talk
Simon Hughes
 
Digitization in theory and practice
Digitization in theory and practiceDigitization in theory and practice
Digitization in theory and practice
Helen Nneka Okpala
 
aistudy-240521200530-db141c56 RAG AI.pptx
aistudy-240521200530-db141c56 RAG AI.pptxaistudy-240521200530-db141c56 RAG AI.pptx
aistudy-240521200530-db141c56 RAG AI.pptx
emceemouli
 
Knowledge base system appl. p 1,2-ver1
Knowledge base system appl.  p 1,2-ver1Knowledge base system appl.  p 1,2-ver1
Knowledge base system appl. p 1,2-ver1
Taymoor Nazmy
 
AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101
vincent683379
 
Ad

More from Minha Hwang (14)

Marketing Experiment - Part II: Analysis
Marketing Experiment - Part II: Analysis Marketing Experiment - Part II: Analysis
Marketing Experiment - Part II: Analysis
Minha Hwang
 
Marketing Experimentation - Part I
Marketing Experimentation - Part IMarketing Experimentation - Part I
Marketing Experimentation - Part I
Minha Hwang
 
Introduction to Recommendation System
Introduction to Recommendation SystemIntroduction to Recommendation System
Introduction to Recommendation System
Minha Hwang
 
Promotion Analytics - Module 2: Model and Estimation
Promotion Analytics - Module 2: Model and EstimationPromotion Analytics - Module 2: Model and Estimation
Promotion Analytics - Module 2: Model and Estimation
Minha Hwang
 
Promotion Analytics in Consumer Electronics - Module 1: Data
Promotion Analytics in Consumer Electronics - Module 1: DataPromotion Analytics in Consumer Electronics - Module 1: Data
Promotion Analytics in Consumer Electronics - Module 1: Data
Minha Hwang
 
Dummy Variable Regression Analysis
Dummy Variable Regression AnalysisDummy Variable Regression Analysis
Dummy Variable Regression Analysis
Minha Hwang
 
Multiple Regression Analysis
Multiple Regression AnalysisMultiple Regression Analysis
Multiple Regression Analysis
Minha Hwang
 
Introduction to Regression Analysis
Introduction to Regression AnalysisIntroduction to Regression Analysis
Introduction to Regression Analysis
Minha Hwang
 
Conjoint Analysis Part 3/3 - Market Simulator
Conjoint Analysis Part 3/3 - Market SimulatorConjoint Analysis Part 3/3 - Market Simulator
Conjoint Analysis Part 3/3 - Market Simulator
Minha Hwang
 
Conjoint Analysis - Part 2/3
Conjoint Analysis - Part 2/3Conjoint Analysis - Part 2/3
Conjoint Analysis - Part 2/3
Minha Hwang
 
Conjoint Analysis - Part 1/3
Conjoint Analysis - Part 1/3Conjoint Analysis - Part 1/3
Conjoint Analysis - Part 1/3
Minha Hwang
 
Marketing Research - Perceptual Map
Marketing Research - Perceptual MapMarketing Research - Perceptual Map
Marketing Research - Perceptual Map
Minha Hwang
 
Channel capabilities, product characteristics, and impacts of mobile channel ...
Channel capabilities, product characteristics, and impacts of mobile channel ...Channel capabilities, product characteristics, and impacts of mobile channel ...
Channel capabilities, product characteristics, and impacts of mobile channel ...
Minha Hwang
 
From Online to Mobile - Impact of Consumers' Online Purchase Behaviors on Mob...
From Online to Mobile - Impact of Consumers' Online Purchase Behaviors on Mob...From Online to Mobile - Impact of Consumers' Online Purchase Behaviors on Mob...
From Online to Mobile - Impact of Consumers' Online Purchase Behaviors on Mob...
Minha Hwang
 
Marketing Experiment - Part II: Analysis
Marketing Experiment - Part II: Analysis Marketing Experiment - Part II: Analysis
Marketing Experiment - Part II: Analysis
Minha Hwang
 
Marketing Experimentation - Part I
Marketing Experimentation - Part IMarketing Experimentation - Part I
Marketing Experimentation - Part I
Minha Hwang
 
Introduction to Recommendation System
Introduction to Recommendation SystemIntroduction to Recommendation System
Introduction to Recommendation System
Minha Hwang
 
Promotion Analytics - Module 2: Model and Estimation
Promotion Analytics - Module 2: Model and EstimationPromotion Analytics - Module 2: Model and Estimation
Promotion Analytics - Module 2: Model and Estimation
Minha Hwang
 
Promotion Analytics in Consumer Electronics - Module 1: Data
Promotion Analytics in Consumer Electronics - Module 1: DataPromotion Analytics in Consumer Electronics - Module 1: Data
Promotion Analytics in Consumer Electronics - Module 1: Data
Minha Hwang
 
Dummy Variable Regression Analysis
Dummy Variable Regression AnalysisDummy Variable Regression Analysis
Dummy Variable Regression Analysis
Minha Hwang
 
Multiple Regression Analysis
Multiple Regression AnalysisMultiple Regression Analysis
Multiple Regression Analysis
Minha Hwang
 
Introduction to Regression Analysis
Introduction to Regression AnalysisIntroduction to Regression Analysis
Introduction to Regression Analysis
Minha Hwang
 
Conjoint Analysis Part 3/3 - Market Simulator
Conjoint Analysis Part 3/3 - Market SimulatorConjoint Analysis Part 3/3 - Market Simulator
Conjoint Analysis Part 3/3 - Market Simulator
Minha Hwang
 
Conjoint Analysis - Part 2/3
Conjoint Analysis - Part 2/3Conjoint Analysis - Part 2/3
Conjoint Analysis - Part 2/3
Minha Hwang
 
Conjoint Analysis - Part 1/3
Conjoint Analysis - Part 1/3Conjoint Analysis - Part 1/3
Conjoint Analysis - Part 1/3
Minha Hwang
 
Marketing Research - Perceptual Map
Marketing Research - Perceptual MapMarketing Research - Perceptual Map
Marketing Research - Perceptual Map
Minha Hwang
 
Channel capabilities, product characteristics, and impacts of mobile channel ...
Channel capabilities, product characteristics, and impacts of mobile channel ...Channel capabilities, product characteristics, and impacts of mobile channel ...
Channel capabilities, product characteristics, and impacts of mobile channel ...
Minha Hwang
 
From Online to Mobile - Impact of Consumers' Online Purchase Behaviors on Mob...
From Online to Mobile - Impact of Consumers' Online Purchase Behaviors on Mob...From Online to Mobile - Impact of Consumers' Online Purchase Behaviors on Mob...
From Online to Mobile - Impact of Consumers' Online Purchase Behaviors on Mob...
Minha Hwang
 
Ad

Recently uploaded (20)

CaseStudy of marketing strategy of himalyan java
CaseStudy of marketing strategy of himalyan javaCaseStudy of marketing strategy of himalyan java
CaseStudy of marketing strategy of himalyan java
PrashantShah565560
 
branding companies in india|Poppy Pulse|pptx
branding companies in india|Poppy Pulse|pptxbranding companies in india|Poppy Pulse|pptx
branding companies in india|Poppy Pulse|pptx
Poppy Pulse
 
Digital_Marketing_Fundamentals_Ananya_Updated.pptx
Digital_Marketing_Fundamentals_Ananya_Updated.pptxDigital_Marketing_Fundamentals_Ananya_Updated.pptx
Digital_Marketing_Fundamentals_Ananya_Updated.pptx
batraananya927
 
How to Choose the Right Performance Marketing Agency to Drive Your Success
How to Choose the Right Performance Marketing Agency to Drive Your SuccessHow to Choose the Right Performance Marketing Agency to Drive Your Success
How to Choose the Right Performance Marketing Agency to Drive Your Success
Viacon
 
KYC Fraud in the Digital Age_ Understanding the Threat and Strengthening Defe...
KYC Fraud in the Digital Age_ Understanding the Threat and Strengthening Defe...KYC Fraud in the Digital Age_ Understanding the Threat and Strengthening Defe...
KYC Fraud in the Digital Age_ Understanding the Threat and Strengthening Defe...
gridlinesseo
 
📘 Best Email Marketing Service_ Step-by-Step Guide.pdf
📘 Best Email Marketing Service_ Step-by-Step Guide.pdf📘 Best Email Marketing Service_ Step-by-Step Guide.pdf
📘 Best Email Marketing Service_ Step-by-Step Guide.pdf
pckhetal
 
Crossmarket Org .pdf it is the best plan i ever
Crossmarket Org .pdf it is the best plan i everCrossmarket Org .pdf it is the best plan i ever
Crossmarket Org .pdf it is the best plan i ever
pratiksinha319
 
UNIT 4- PH MARKETING( the marketing management process in the drug industry).pdf
UNIT 4- PH MARKETING( the marketing management process in the drug industry).pdfUNIT 4- PH MARKETING( the marketing management process in the drug industry).pdf
UNIT 4- PH MARKETING( the marketing management process in the drug industry).pdf
necklejadeolis
 
How to Use Roll-Up Banners for Brand Visibility at Events.pdf
How to Use Roll-Up Banners for Brand Visibility at Events.pdfHow to Use Roll-Up Banners for Brand Visibility at Events.pdf
How to Use Roll-Up Banners for Brand Visibility at Events.pdf
Ezybook
 
Beyond ROAS: Aligning Google Ads With Your True Business Objectives
Beyond ROAS: Aligning Google Ads With Your True Business ObjectivesBeyond ROAS: Aligning Google Ads With Your True Business Objectives
Beyond ROAS: Aligning Google Ads With Your True Business Objectives
Search Engine Journal
 
Alt Attribute SEO Guide: How to Optimize Image Alt Text for Search
Alt Attribute SEO Guide: How to Optimize Image Alt Text for SearchAlt Attribute SEO Guide: How to Optimize Image Alt Text for Search
Alt Attribute SEO Guide: How to Optimize Image Alt Text for Search
Sanjay Kumar Monu
 
Why Digital Marketing is Essential for Modern Businesses.pptx
Why Digital Marketing is Essential for Modern Businesses.pptxWhy Digital Marketing is Essential for Modern Businesses.pptx
Why Digital Marketing is Essential for Modern Businesses.pptx
marketingtagsnticks
 
How to Integrate AI into Your Marketing Campaigns Today
How to Integrate AI into Your Marketing Campaigns TodayHow to Integrate AI into Your Marketing Campaigns Today
How to Integrate AI into Your Marketing Campaigns Today
NapierPR
 
The SEO Lifecycle: Lessons from the Past, Strategies for the Future
The SEO Lifecycle: Lessons from the Past, Strategies for the FutureThe SEO Lifecycle: Lessons from the Past, Strategies for the Future
The SEO Lifecycle: Lessons from the Past, Strategies for the Future
Lily Ray
 
UNIT 3 magmt SERVICE & QUALITY.pptx 8sem
UNIT 3 magmt SERVICE & QUALITY.pptx 8semUNIT 3 magmt SERVICE & QUALITY.pptx 8sem
UNIT 3 magmt SERVICE & QUALITY.pptx 8sem
bbacoordinator1
 
The 10 Best Tips to Get SoundCloud Likes & Grow Your Profile Fast.pdf
The 10 Best Tips to Get SoundCloud Likes & Grow Your Profile Fast.pdfThe 10 Best Tips to Get SoundCloud Likes & Grow Your Profile Fast.pdf
The 10 Best Tips to Get SoundCloud Likes & Grow Your Profile Fast.pdf
Sociofire
 
Cracking LinkedIn's Algorithm in 2025 to up your content game.
Cracking LinkedIn's Algorithm in 2025 to up your content game.Cracking LinkedIn's Algorithm in 2025 to up your content game.
Cracking LinkedIn's Algorithm in 2025 to up your content game.
Udit Goenka
 
How brands can use memes to connect with younger audiences.pdf
How brands can use memes to connect with younger audiences.pdfHow brands can use memes to connect with younger audiences.pdf
How brands can use memes to connect with younger audiences.pdf
iM4U Digital Marketing Agency
 
Super AI Review: The First SuperModel™ Uniting Every AI Model Ever Created in...
Super AI Review: The First SuperModel™ Uniting Every AI Model Ever Created in...Super AI Review: The First SuperModel™ Uniting Every AI Model Ever Created in...
Super AI Review: The First SuperModel™ Uniting Every AI Model Ever Created in...
SOFTTECHHUB
 
2025-04 - VWO Webinar - Alignment and Focus_ The Key to Delivering Business I...
2025-04 - VWO Webinar - Alignment and Focus_ The Key to Delivering Business I...2025-04 - VWO Webinar - Alignment and Focus_ The Key to Delivering Business I...
2025-04 - VWO Webinar - Alignment and Focus_ The Key to Delivering Business I...
VWO
 
CaseStudy of marketing strategy of himalyan java
CaseStudy of marketing strategy of himalyan javaCaseStudy of marketing strategy of himalyan java
CaseStudy of marketing strategy of himalyan java
PrashantShah565560
 
branding companies in india|Poppy Pulse|pptx
branding companies in india|Poppy Pulse|pptxbranding companies in india|Poppy Pulse|pptx
branding companies in india|Poppy Pulse|pptx
Poppy Pulse
 
Digital_Marketing_Fundamentals_Ananya_Updated.pptx
Digital_Marketing_Fundamentals_Ananya_Updated.pptxDigital_Marketing_Fundamentals_Ananya_Updated.pptx
Digital_Marketing_Fundamentals_Ananya_Updated.pptx
batraananya927
 
How to Choose the Right Performance Marketing Agency to Drive Your Success
How to Choose the Right Performance Marketing Agency to Drive Your SuccessHow to Choose the Right Performance Marketing Agency to Drive Your Success
How to Choose the Right Performance Marketing Agency to Drive Your Success
Viacon
 
KYC Fraud in the Digital Age_ Understanding the Threat and Strengthening Defe...
KYC Fraud in the Digital Age_ Understanding the Threat and Strengthening Defe...KYC Fraud in the Digital Age_ Understanding the Threat and Strengthening Defe...
KYC Fraud in the Digital Age_ Understanding the Threat and Strengthening Defe...
gridlinesseo
 
📘 Best Email Marketing Service_ Step-by-Step Guide.pdf
📘 Best Email Marketing Service_ Step-by-Step Guide.pdf📘 Best Email Marketing Service_ Step-by-Step Guide.pdf
📘 Best Email Marketing Service_ Step-by-Step Guide.pdf
pckhetal
 
Crossmarket Org .pdf it is the best plan i ever
Crossmarket Org .pdf it is the best plan i everCrossmarket Org .pdf it is the best plan i ever
Crossmarket Org .pdf it is the best plan i ever
pratiksinha319
 
UNIT 4- PH MARKETING( the marketing management process in the drug industry).pdf
UNIT 4- PH MARKETING( the marketing management process in the drug industry).pdfUNIT 4- PH MARKETING( the marketing management process in the drug industry).pdf
UNIT 4- PH MARKETING( the marketing management process in the drug industry).pdf
necklejadeolis
 
How to Use Roll-Up Banners for Brand Visibility at Events.pdf
How to Use Roll-Up Banners for Brand Visibility at Events.pdfHow to Use Roll-Up Banners for Brand Visibility at Events.pdf
How to Use Roll-Up Banners for Brand Visibility at Events.pdf
Ezybook
 
Beyond ROAS: Aligning Google Ads With Your True Business Objectives
Beyond ROAS: Aligning Google Ads With Your True Business ObjectivesBeyond ROAS: Aligning Google Ads With Your True Business Objectives
Beyond ROAS: Aligning Google Ads With Your True Business Objectives
Search Engine Journal
 
Alt Attribute SEO Guide: How to Optimize Image Alt Text for Search
Alt Attribute SEO Guide: How to Optimize Image Alt Text for SearchAlt Attribute SEO Guide: How to Optimize Image Alt Text for Search
Alt Attribute SEO Guide: How to Optimize Image Alt Text for Search
Sanjay Kumar Monu
 
Why Digital Marketing is Essential for Modern Businesses.pptx
Why Digital Marketing is Essential for Modern Businesses.pptxWhy Digital Marketing is Essential for Modern Businesses.pptx
Why Digital Marketing is Essential for Modern Businesses.pptx
marketingtagsnticks
 
How to Integrate AI into Your Marketing Campaigns Today
How to Integrate AI into Your Marketing Campaigns TodayHow to Integrate AI into Your Marketing Campaigns Today
How to Integrate AI into Your Marketing Campaigns Today
NapierPR
 
The SEO Lifecycle: Lessons from the Past, Strategies for the Future
The SEO Lifecycle: Lessons from the Past, Strategies for the FutureThe SEO Lifecycle: Lessons from the Past, Strategies for the Future
The SEO Lifecycle: Lessons from the Past, Strategies for the Future
Lily Ray
 
UNIT 3 magmt SERVICE & QUALITY.pptx 8sem
UNIT 3 magmt SERVICE & QUALITY.pptx 8semUNIT 3 magmt SERVICE & QUALITY.pptx 8sem
UNIT 3 magmt SERVICE & QUALITY.pptx 8sem
bbacoordinator1
 
The 10 Best Tips to Get SoundCloud Likes & Grow Your Profile Fast.pdf
The 10 Best Tips to Get SoundCloud Likes & Grow Your Profile Fast.pdfThe 10 Best Tips to Get SoundCloud Likes & Grow Your Profile Fast.pdf
The 10 Best Tips to Get SoundCloud Likes & Grow Your Profile Fast.pdf
Sociofire
 
Cracking LinkedIn's Algorithm in 2025 to up your content game.
Cracking LinkedIn's Algorithm in 2025 to up your content game.Cracking LinkedIn's Algorithm in 2025 to up your content game.
Cracking LinkedIn's Algorithm in 2025 to up your content game.
Udit Goenka
 
How brands can use memes to connect with younger audiences.pdf
How brands can use memes to connect with younger audiences.pdfHow brands can use memes to connect with younger audiences.pdf
How brands can use memes to connect with younger audiences.pdf
iM4U Digital Marketing Agency
 
Super AI Review: The First SuperModel™ Uniting Every AI Model Ever Created in...
Super AI Review: The First SuperModel™ Uniting Every AI Model Ever Created in...Super AI Review: The First SuperModel™ Uniting Every AI Model Ever Created in...
Super AI Review: The First SuperModel™ Uniting Every AI Model Ever Created in...
SOFTTECHHUB
 
2025-04 - VWO Webinar - Alignment and Focus_ The Key to Delivering Business I...
2025-04 - VWO Webinar - Alignment and Focus_ The Key to Delivering Business I...2025-04 - VWO Webinar - Alignment and Focus_ The Key to Delivering Business I...
2025-04 - VWO Webinar - Alignment and Focus_ The Key to Delivering Business I...
VWO
 

Introduction to Text Mining

  • 1. Class Outline • Introduction: Unstructured Data Analysis • Word-level Analysis – Vector Space Model – TF-IDF • Beyond Word-level Analysis: Natural Language Processing (NLP) • Text Mining Demonstration in R: Mining Twitter Data
  • 2. Background: Text Mining – New MR Tool! • Text data is everywhere – books, news, articles, financial analysis, blogs, social networking, etc • According to estimates, 80% of world’s data is in “unstructured text format” • We need methods to extract, summarize, and analyze useful information from unstructured/text data • Text mining seeks to automatically discover useful knowledge from the massive amount of data • Active research is going on in the area of text mining in industry and academics
  • 3. What is Text Mining? • Use of computational techniques to extract high quality information from text • Extract and discover knowledge hidden in text automatically • KDD definition: “discovery by computer of new previously unknown information, by automatically extracting information from a usually large amount of different unstructured textual resources”
  • 4. Text Mining Tasks • 1. Document Categorization (Supervised Learning) • 2. Document Clustering/Organization (Unsupervised Learning) • 3. Summarization (key words, indices, etc) • 4. Visualization (word cloud, maps) • 5. Numeric prediction (stock market prediction based on news text)
  • 5. Features of Text Data • • • • • • • • High dimensionality Large number of features Multiple ways to represent the same concept Highly redundant data Unstructured data Easy for humans, hard for machine Abstract ideas hard to represent Huge amount of data to be processed – Automation is required
  • 6. Acquiring Texts • Existing digital corpora: e.g. XML (high quality text and metadata) – http://www.hathitrust.org/htrc • Other digital sources (e.g. Web, twitter, Amazon consumer reviews) – Through API: e.g. tweets – Websites without APIs can be “scraped” – Generally requires custom programming (Perl, Python, etc) or software tools (e.g. Web extractor pro) • Undigitized text – Scanned and subjected to Optical Character Recognition (OCR) – Time and labor intensive – Error-prone
  • 7. Word-level Analysis: Vector Space Model • Documents are treated as a “bag” of words or terms • Any document can be represented as a vector: a list of terms and their associated weights – D= {(t1,w1),(t2,w2),…………,(tn,wn )} – ti: i-th term – wi: weight for the i-th term • Weight is a measure of the importance of terms of information content
  • 8. Vector Space Model: Bag of Words Representation • Each document: Sparse high-dimensional vector!
  • 10. TF-IDF: Example • TF: Consider a document containing 100 words wherein the word cow appears 3 times. Following the previously defined formulas, what is the term frequency (TF) for cow? – TF(cow,d1) = 3. • IDF: Now assume we have 10 million documents and cow appears in one thousand of these. What is the inverse document frequency of the term, cow? – IDF(cow) = log(10,000,000/1,000) = 4 • TF-IDF score? – TF-IDF = 3 x 4 = 12 (Product of TF and IDF)
  • 11. Application 1: Document Search with Query Document ID Cat Dog d1 0.397 d2 Mouse Fish Horse Cow Matching Scores 0.397 0.000 0.475 0.000 0.000 1.268 0.352 0.301 0.680 0.000 0.000 0.000 0.653 d3 0.301 0.363 0.000 0.000 0.669 0.741 0.664 d4 0.376 0.352 0.636 0.558 0.000 0.000 1.286 d5 0.301 0.301 0.000 0.426 0.544 0.544 1.028
  • 12. Application 2: Word Frequencies – Zipf’s Law • Idea: We use a few words very often, and most words very rarely, because it’s more effort to use a rare word. • Zipf’s Law: Product of frequency of word and its rank is [reasonably] constant • Empirically demonstrable; holds up over different languages
  • 13. Application 2: Word Frequencies – Zipf’s Law
  • 14. Application 3: Word Cloud - Budweiser Example http://people.duke.edu/~el113/Visualizations.html
  • 15. Problems with Word-level Analysis: Sentiment • Sentiment can often be expressed in a more subtle manner, making it difficult to be identified by any of a sentence or document’s terms when considered in isolation – A positive or negative sentiment word may have opposite orientations in different application domains. (“This camera sucks.” -> negative; “This vacuum cleaner really sucks.” -> positive) – A sentence containing sentiment words may not express any sentiment. (e.g. “Can you tell me which Sony camera is good?”) – Sarcastic sentences with or without sentiment words are hard to deal with. (e.g. “What a great car! It sopped working in two days.” – Many sentences without sentiment words can also imply opinions. (e.g. “This washer uses a lot of water.” -> negative) • We have to consider the overall context (semantics of each sentence or document)
  • 16. Natural Language Processing (NLP) to the Rescue! • NLP: is a filed of computer science, artificial intelligence, and linguistics, concerned with the interactions between computers and human (natural) languages. • Key idea: Use statistical “machine learning” to automatically learn the language from data! • Major tasks in NLP – – – – – – Automatic summarization Part-of-speech tagging (POS tagging) Relationship extraction Sentiment analysis Topic segmentation and recognition Machine translation
  • 17. Demonstration: POS Tagging – 1/2 • http://cogcomp.cs.illinois.edu/demo/pos/results.php
  • 19. Demonstration: Sentence-level Sentiment – 1/3 • Stanford Sentiment Analyzer – http://nlp.stanford.edu:8080/sentiment/rntnDemo.html
  • 20. Demonstration: Sentence-level Sentiment – 2/3 • Review 1: This movie doesn’t care about cleverness, wit or any other kind of intelligent humor. -> Negative
  • 21. Demonstration: Sentence-level Sentiment – 3/3 • There are slow and repetitive parts, but it has just enough spice to keep it interesting. -> Positive
  • 22. • Text Mining Demonstration in R: Mining Twitter Data
  • 23. Twitter Mining in R – 1/2 Step 0) Install “R” and Packages R program: http://www.r-project.org/ Package: http://cran.r-project.org/web/packages/tm/index.html Package: http://cran.r-project.org/web/packages/twitteR/index.html Package: http://cran.r-project.org/web/packages/wordcloud/index.html Manual: http://cran.r-project.org/web/packages/tm/vignettes/tm.pdf Step 1) Retrieving Text from Twitter: Twitter API (Using twitteR)
  • 24. Twitter Mining in R – 2/2 Step 2) Transforming Text Step 3) Stemming Words Step 4) Build a Term-Document Matrix Step 5) Frequent Terms and Associations Step 6) Word Cloud
  • 25. Software for Text Mining • A number of academic/commercial software available: – 1. Open source packages in R – e.g. tm • R program: http://www.r-project.org/ • Package: http://cran.r-project.org/web/packages/tm/index.html • Manual: http://cran.r-project.org/web/packages/tm/vignettes/tm.pdf – 2. Stanford NLP core • http://nlp.stanford.edu/software/corenlp.shtml – – – – – 3. SAS TextMiner 4. IBM SPSS 5. Boos Texter 6. StatSoft 7. AeroText • Text Data is everywhere – you can mine it to gain insights!