Text Mining Notes

Chapter 20 of GBUS515 discusses text mining, focusing on the transition from structured quantitative data to unstructured text data. It outlines various applications of text mining, such as fraud detection in insurance claims and medical triage, and explains key processes like tokenization and preprocessing to convert text into a structured format suitable for predictive modeling. The chapter also introduces concepts like Latent Semantic Indexing (LSI) to reduce vocabulary and extract meaningful patterns from text data.

Uploaded by

drmitola

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views

Text Mining Notes

Uploaded by

drmitola

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

GBUS515 –Business Intelligence and Information Systems

Chapter 20 - Text Mining

Instructor – Dr. Sunita Goel

Adapted from Shmueli, Bruce & Patel, Data Mining for Business Analytics, 3e

© Galit Shmueli and Peter Bruce 2010

Text data
Up to now we have been dealing with structured
quantitative data
Numerical
Binary (yes/no)
Multicategory
Now we turn to unstructured text
Applications of Text Mining
Insurance fraud – notes in claim forms can be
mined and transformed into predictor variables for a
predictive model
The model is trained on prior claims in two classes –
found to be fraudulent, and not found to be
fraudulent
The model is then applied to new claims
Applications, cont.
Maintenance or support
tickets often contain text
fields
These fields could be mined
to classify ticket in several
ways:
How urgent?
How much time to fix?
What category of technician
is needed to fix?
Applications, cont.
Medical triage/diagnosis
Clinics could use patient online appointment request
forms to route requests
Admin asst.
Nurse
Doctor
What exactly is text mining?
Classify (label) thousands of documents
Extension of predictive modeling (our focus)

Extract meaning from a single document – interpreting

it like a human reads language
“Natural language processing” (very ambitious, not predictive modeling, not
our focus)
Classification (labeling) and
clustering
No attempt to extract overall document meaning
from a single document
Focus is on assigning a label or class to numerous
documents
As with numerical data mining, the goal is to do
better than guessing
“Bag-of-words”
Grammar, syntax, punctuation, word order are ignored
The document is considered as a “bag of words”
This approach is, nonetheless, effective when the goal is to
decide which category or cluster a document falls in
A typical application is supervised learning
Requires lots of documents (a corpus)*
Do not need 100% accuracy

*“Corpus” often refers to a fixed standard set of documents that many researchers
can use to develop and tune text mining algorithms.
The spreadsheet model of text
Columns are terms
Rows are documents
Cells indicate presence/absence (or frequency) of
terms in documents
Consider the two sentences:
S1 First we consider the spreadsheet model
S2 Then we consider another model

Here is the resulting spreadsheet, using

presence/absence:
firs t we cons ider the s preads heet model then another
S1 1 1 1 1 1 1 0 0
S2 0 1 1 0 0 1 1 0
Need to turn text into a matrix
For the two documents (sentences S1 and S2) that
we looked at earlier, the process of producing a
matrix is simple. We had
Words
Spaces
Periods
Each word is preceded or followed by a space or
period – a delimiter.
Real text is more complicated
Lots of things to process besides
words…
Numbers (including dates, percents, monetary
amounts), e.g. from Google Annual Report 2014:

We considered the historical trends in currency exchange rates and determined that it was reasonably possible that
changes in exchange rates of 20% could be experienced in the near term. If the U.S. dollar weakened by 20% at
December 31, 2013 and 2014, the amount recorded in AOCI related to our foreign exchange options before tax
effect would have been approximately $4 million and $686 million lower at December 31, 2013 and December
31, 2014, and the total amount of expense recorded as interest and other income, net, would have been
approximately $123 million and $90 million higher in the years ended December 31, 2013 and December 31,
2014. If the U.S. dollar strengthened by 20% at December 31, 20013 and December 31, 2014, the amount
recorded in accumulated AOCI related to our foreign exchange options before tax effect would have been
approximately $1.7 billion and $2.5 billion higher at December 31, 2013 and December 31, 2014, and the total
amount of expense recorded as interest and other income, net, would have been approximately $120 million and
$164 million higher in the years ended December 31, 2013 and December 31, 2014.
Email addresses, url’s, stray characters
introduced by file conversions, …

Sender: Distribution list for statistical items of interest <[email protected]>

From: "Massimini, Vince" <[email protected]> Subject: Comparing the Maximal Procedure to Permuted
Blocks Randomization To: <[email protected]> Precedence: list List-Help:
<mailto:[email protected]?body=INFO%20WSS-ELECTRONIC-MAIL-LIST>
For more WSS events, see washstat.org WSS Public Health/Biostatistics Section and NCI Division of
Cancer Preventi= on Jointly Sponsored Event: =20
SPEAKER: Vance W. Berger, PhD National Cancer Institute and University of = Maryland Baltimore County
and Klejda Bejleri, BS Biometry and Statistics, D= epartment of Biological Statistics and Computational
Biology, Cornell Unive= rsity, Ithaca, NY 14853 =20 TITLE: Comparing the Maximal Procedure to
Permuted Blocks Randomization
TIME AND PLACE: Monday, June 8th NCI Shady Grove, 9609 Medical Center Drive= , Rockville MD Room
5E30/32. Bring photo ID, allow time to get through secu= rity
Proper nouns & terms specific to a
particular field

From Techsmith corporate information: All-In-One Capture, Camtasia, Camtasia Studio,

Camtasia Relay, Coach's Eye, DubIt, EnSharpen, Enterprise Wide, Expressshow, Jing,
Morae, Rich Recording Technology (RRT), Snagit, Screencast.com, ScreenChomp, Show
The World, SmartFocus, TechSmith, TechSmith and T Design logo, TechSmith Fuse,
TechSmith Relay, TSCC, and UserVue are marks or registered marks of TechSmith
Corporation.

From medical journal: Eight hundred elderly women and men from the population-based
Framingham Osteoporosis Study had BMD assessed in 1988-1989 and again in 1992-
1993. BMD was measured at femoral neck, trochanter, Ward's area, radial shaft,
ultradistal radius, and lumbar spine using Lunar densitometers. (Risk Factors for Longitudinal Bone
Loss in Elderly Men and Women: The Framingham Osteoporosis Study, Journal of Bone and Mineral Research Volume
15, Issue 4, pages 710–720, April 2000)
Tokenization
We need to move from a mass of text to useful
predictor information
The first step is to separate out and identify
individual terms
The process by which you identify delimiters and use
them to separate terms is called tokenization. The
resulting terms are also called tokens.
Preprocessing
Goal – reduction of text (also called vocabulary
reduction) without losing meaning or predictive
power
Stemming
Reducing multiple variants of a word to a common
core
Travel, traveling, traveled, etc. -> travel
Ignore case
Frequency filters can eliminate terms that
Appear in nearly all documents
Appear in hardly any documents
Preprocessing, cont.

All punctuation characters treated as delimiters

Punctuation characters are sometimes also considered
terms of length 1, but not in XLMiner
Default in XLMiner is to drop all terms <=2 characters
long
Default is also to drop all terms that are on a stoplist
XLMiner comes with a default stoplist
User can edit it
Preprocessing, cont.
Frequency vs. presence/absence
Normalization, when the presence of a type of term
might be important but we don’t need the specific
term. For example
Replace [email protected] with “email token”
Replace www.domain.com with “url token”
Post-reduction matrix
Rows are still records, columns are the reduced set of terms
Options for cell entries:
0/1 (presence absence)
Frequency count
TF-IDF (term frequency – inverse document frequency)
TF = frequency of term
IDF = log of inverse of the frequency with which documents have
that term
There are varying definitions of both TF and IDF, hence of TF-IDF
Bottom line:
TF-IDF is high where a rare term is present or frequent in a document
TF-IDF is near zero where a term is absent from a document, or abundant
across all documents
From terms to concepts – Latent
Semantic Indexing (LSI)
The post-reduction term/document matrix is often still
huge – too big for easy processing
Recall how, with principal components, we derived a
small set of synthetic predictor variables, each of which
was a linear combination of “like-minded” original
variables.
Latent semantic indexing does something similar for
text – it maps multiple terms to a small set of concepts.
Intuitive explanation - LSI
For example: if we inspected our document collection, we might find that each
time the term “alternator” appeared in an automobile document, the
document also included the terms “battery” and “headlights.” Or each time
the term “brake" appeared in an automobile document, the terms “pads”
and “squeaky” also appeared. However, there is no detectable pattern
regarding the use of the terms “alternator” and “brake” together.
Documents including “alternator” might or might not include “brake” and
documents including “brake” might or might not include “alternator.” Our
four terms, battery, headlights, pads, and squeaky describe two different
automobile repair issues: failing brakes and a bad alternator.

Analytic Solver Platform, XLMiner Platform, Data Mining User Guide, 2014,
Frontline Systems, p. 245}
Extracting Meaning?
It may be possible to use the concepts to identify
themes in the document corpus, and clusters of
documents sharing those themes.
Often, however, the concepts do not map in obvious
fashion to meaningful themes.
Their key contribution is simply reducing the
vocabulary – instead of a matrix with thousands of
columns, we can deal with just a dozen or two.
The concept document matrix

A small portion of the document-concept matrix. The concepts may or

may not carry meaning that we can figure out; in any case they are
mathematical constructs from hundreds, perhaps thousands, of terms,
and can be used as the variables in a predictive model.
A predictive model
Now we have a clean, structured dataset similar to
what we have used in our numerical data mining:
Class identifications (labels) for training
Numerical predictors
Chapter Exercises
(Updated in Canvas)

Expansion of The University Hospital of The Santa Fe de Bogotá Foundation
No ratings yet
Expansion of The University Hospital of The Santa Fe de Bogotá Foundation
15 pages
PSC Checklist
No ratings yet
PSC Checklist
19 pages
PV 3 3 3
100% (2)
PV 3 3 3
3 pages
Minor Project Ii Report Text Mining: Reuters-21578: Submitted by
100% (1)
Minor Project Ii Report Text Mining: Reuters-21578: Submitted by
51 pages
Sms Text Classification
No ratings yet
Sms Text Classification
10 pages
Dbms Seperate Notes
No ratings yet
Dbms Seperate Notes
74 pages
Video Summary
No ratings yet
Video Summary
5 pages
Application of Computational Linguistics
No ratings yet
Application of Computational Linguistics
19 pages
Google Interview Warmup Questions
No ratings yet
Google Interview Warmup Questions
15 pages
Irt-23 Unit 2
No ratings yet
Irt-23 Unit 2
10 pages
Text Classification MLND Project Report Prasann Pandya
No ratings yet
Text Classification MLND Project Report Prasann Pandya
17 pages
The Future of Search
From Everand
The Future of Search
Andres J. Clary
No ratings yet
Kmeanseppcsit
No ratings yet
Kmeanseppcsit
5 pages
Anti-Serendipity: Finding Useless Documents and Similar Documents
No ratings yet
Anti-Serendipity: Finding Useless Documents and Similar Documents
9 pages
Sample Term Paper in Statistics
100% (1)
Sample Term Paper in Statistics
5 pages
Effective Classification of Text
No ratings yet
Effective Classification of Text
6 pages
Ans:-Statement Calculus or Propositional Calculus:-It Is A Branch of Logic Which Deals
No ratings yet
Ans:-Statement Calculus or Propositional Calculus:-It Is A Branch of Logic Which Deals
9 pages
Resume Parser - Skillate
No ratings yet
Resume Parser - Skillate
13 pages
Lightweight Document Clustering Sholom Weiss, Brian White, Chid Apte IBM Research Report RC-21684
No ratings yet
Lightweight Document Clustering Sholom Weiss, Brian White, Chid Apte IBM Research Report RC-21684
13 pages
Lecture 7
No ratings yet
Lecture 7
126 pages
Amol Borude 65
No ratings yet
Amol Borude 65
12 pages
What Is Structured Data?: Information Retrieval
No ratings yet
What Is Structured Data?: Information Retrieval
6 pages
System Identification PHD Thesis
100% (2)
System Identification PHD Thesis
4 pages
organized
No ratings yet
organized
12 pages
Information Technology Thesis Documentation Format
100% (2)
Information Technology Thesis Documentation Format
5 pages
NLP UNIT-II(PART-I)
No ratings yet
NLP UNIT-II(PART-I)
19 pages
1-What Is Text Mining - IBM
No ratings yet
1-What Is Text Mining - IBM
5 pages
Assignment 2
No ratings yet
Assignment 2
7 pages
A Method For Visualising Possible Contexts
No ratings yet
A Method For Visualising Possible Contexts
8 pages
Aaupload PDF
No ratings yet
Aaupload PDF
19 pages
IJCER (WWW - Ijceronline.com) International Journal of Computational Engineering Research
No ratings yet
IJCER (WWW - Ijceronline.com) International Journal of Computational Engineering Research
4 pages
Text Mining Through Semi Automatic Semantic Annotation
No ratings yet
Text Mining Through Semi Automatic Semantic Annotation
12 pages
This PPT Is Dedicated To My Inner Controller Founders.: Amma Bhagavan
No ratings yet
This PPT Is Dedicated To My Inner Controller Founders.: Amma Bhagavan
84 pages
Thesis o Meter
100% (3)
Thesis o Meter
6 pages
Dataspaces: Dataspaces Are An Abstraction in
No ratings yet
Dataspaces: Dataspaces Are An Abstraction in
5 pages
Online Billing System Thesis PDF
100% (3)
Online Billing System Thesis PDF
4 pages
Thesis Sahib - Order Now On
100% (2)
Thesis Sahib - Order Now On
11 pages
DVT UNIT -4 Notes 211124 (1)
No ratings yet
DVT UNIT -4 Notes 211124 (1)
21 pages
RM Proposal Guidelines Ver170
No ratings yet
RM Proposal Guidelines Ver170
7 pages
Bachelor Thesis Openfoam
100% (2)
Bachelor Thesis Openfoam
8 pages
NLP An Intuitive Understanding of Word Embeddings From Count Vectors To Word2Vec
No ratings yet
NLP An Intuitive Understanding of Word Embeddings From Count Vectors To Word2Vec
18 pages
Mid 1 Answers IDS
No ratings yet
Mid 1 Answers IDS
22 pages
Mphil Thesis in Computer Science Data Mining
100% (3)
Mphil Thesis in Computer Science Data Mining
7 pages
Unit 1 DM
No ratings yet
Unit 1 DM
37 pages
Fundamental of DB Handout
No ratings yet
Fundamental of DB Handout
4 pages
Concept Mining: Fundamentals and Applications
From Everand
Concept Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet
Lecture 1 - Introductory To Data Analytics
No ratings yet
Lecture 1 - Introductory To Data Analytics
11 pages
Pazzani - Content-Based Recommender Systems
No ratings yet
Pazzani - Content-Based Recommender Systems
17 pages
L - B F E A: Earning Ased Requency Stimation Lgorithms
No ratings yet
L - B F E A: Earning Ased Requency Stimation Lgorithms
20 pages
Department of Computer Science & Engineering Question Bank Unit I Introduction and Conceptual Modelling PART-A (2 Marks)
0% (1)
Department of Computer Science & Engineering Question Bank Unit I Introduction and Conceptual Modelling PART-A (2 Marks)
7 pages
Thesis Yorku
100% (3)
Thesis Yorku
8 pages
Chapter 1: Text Mining: Big Data Analytics (15CS82)
No ratings yet
Chapter 1: Text Mining: Big Data Analytics (15CS82)
12 pages
15CS34E Analytic Computing Answer Key Part-A
No ratings yet
15CS34E Analytic Computing Answer Key Part-A
17 pages
Topic 1 Introduction PDF
No ratings yet
Topic 1 Introduction PDF
24 pages
Database System Concept
No ratings yet
Database System Concept
33 pages
Komputerisasi Penelitian Hukum DGN Teknologi Data Mining
No ratings yet
Komputerisasi Penelitian Hukum DGN Teknologi Data Mining
8 pages
Data Analytics All Paper Solution
No ratings yet
Data Analytics All Paper Solution
11 pages
Dept. of ISE, Acit 1
No ratings yet
Dept. of ISE, Acit 1
12 pages
NLP Mod-5
No ratings yet
NLP Mod-5
17 pages
Introduction To Database Systems: Responsible Persons: Stephan Nebiker, Susanne Bleisch
No ratings yet
Introduction To Database Systems: Responsible Persons: Stephan Nebiker, Susanne Bleisch
24 pages
Minimized Data Inconsistency. Data Inconsistency Exists When Different Versions of The Same Data Appear
No ratings yet
Minimized Data Inconsistency. Data Inconsistency Exists When Different Versions of The Same Data Appear
4 pages
Document Clustering Method Based On Visual Features
No ratings yet
Document Clustering Method Based On Visual Features
5 pages
Econometrics Term Paper Sample
100% (1)
Econometrics Term Paper Sample
8 pages
Napocor Vs Heirs of Saturnino Borbon - Merencillo
No ratings yet
Napocor Vs Heirs of Saturnino Borbon - Merencillo
2 pages
Application Form: Technical Education and Skills Development Authority
No ratings yet
Application Form: Technical Education and Skills Development Authority
2 pages
Expense for the Month of April
No ratings yet
Expense for the Month of April
8 pages
Lesson 5 Money Matters
No ratings yet
Lesson 5 Money Matters
1 page
SD Manual
No ratings yet
SD Manual
29 pages
BFIN 1quareter Exam
100% (1)
BFIN 1quareter Exam
29 pages
Coloureq PDF
No ratings yet
Coloureq PDF
31 pages
2402.06731v1
No ratings yet
2402.06731v1
8 pages
Resume of Zinnsqt
No ratings yet
Resume of Zinnsqt
6 pages
Lockton Closer Look 2015
No ratings yet
Lockton Closer Look 2015
8 pages
Loan Agreement
No ratings yet
Loan Agreement
2 pages
Reviewer in Budgeting-Part 6
No ratings yet
Reviewer in Budgeting-Part 6
2 pages
Ul61730 DG - 2024 01 31
No ratings yet
Ul61730 DG - 2024 01 31
13 pages
Group B1 Pepperfry
No ratings yet
Group B1 Pepperfry
14 pages
Indian Wrist Watch Industry - Marketing Mix of The Leading Players
No ratings yet
Indian Wrist Watch Industry - Marketing Mix of The Leading Players
8 pages
Exclusionary Principle Under Interpreteation of Statutes
No ratings yet
Exclusionary Principle Under Interpreteation of Statutes
14 pages
Bank Muscat
No ratings yet
Bank Muscat
11 pages
Calling Card - New - Kat 11.5.17
No ratings yet
Calling Card - New - Kat 11.5.17
1 page
MBV - Alexander v. US - 113 S. Ct. 2766, 125 L. Ed. 2d. 441
No ratings yet
MBV - Alexander v. US - 113 S. Ct. 2766, 125 L. Ed. 2d. 441
27 pages
Unit 4 Sinusoidal Oscillators
No ratings yet
Unit 4 Sinusoidal Oscillators
53 pages
Mouryaa Inn Profile
No ratings yet
Mouryaa Inn Profile
9 pages
7AA56 160a
No ratings yet
7AA56 160a
16 pages
Group-05 Recruitment Test - 2020 - Result: Professional Examination Board
No ratings yet
Group-05 Recruitment Test - 2020 - Result: Professional Examination Board
2 pages
Sentence Errors Practice (Level - 2) PDF
No ratings yet
Sentence Errors Practice (Level - 2) PDF
27 pages
Electric Vehicles: The Need For Recycling
No ratings yet
Electric Vehicles: The Need For Recycling
19 pages
Biostat&Epi Discussion Week 1
No ratings yet
Biostat&Epi Discussion Week 1
16 pages

Text Mining Notes

Uploaded by

Text Mining Notes

Uploaded by

GBUS515 –Business Intelligence and Information Systems

Chapter 20 - Text Mining

Instructor – Dr. Sunita Goel

© Galit Shmueli and Peter Bruce 2010

 Extract meaning from a single document – interpreting

Here is the resulting spreadsheet, using

Sender: Distribution list for statistical items of interest <[email protected]>

From Techsmith corporate information: All-In-One Capture, Camtasia, Camtasia Studio,

 All punctuation characters treated as delimiters

A small portion of the document-concept matrix. The concepts may or

You might also like

Extract meaning from a single document – interpreting

All punctuation characters treated as delimiters