0% found this document useful (0 votes)
5 views

Text Mining Notes

Chapter 20 of GBUS515 discusses text mining, focusing on the transition from structured quantitative data to unstructured text data. It outlines various applications of text mining, such as fraud detection in insurance claims and medical triage, and explains key processes like tokenization and preprocessing to convert text into a structured format suitable for predictive modeling. The chapter also introduces concepts like Latent Semantic Indexing (LSI) to reduce vocabulary and extract meaningful patterns from text data.

Uploaded by

drmitola
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Text Mining Notes

Chapter 20 of GBUS515 discusses text mining, focusing on the transition from structured quantitative data to unstructured text data. It outlines various applications of text mining, such as fraud detection in insurance claims and medical triage, and explains key processes like tokenization and preprocessing to convert text into a structured format suitable for predictive modeling. The chapter also introduces concepts like Latent Semantic Indexing (LSI) to reduce vocabulary and extract meaningful patterns from text data.

Uploaded by

drmitola
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

GBUS515 –Business Intelligence and Information Systems

Chapter 20 - Text Mining

Instructor – Dr. Sunita Goel


Adapted from Shmueli, Bruce & Patel, Data Mining for Business Analytics, 3e

© Galit Shmueli and Peter Bruce 2010


Text data
— Up to now we have been dealing with structured
quantitative data
— Numerical
— Binary (yes/no)
— Multicategory
— Now we turn to unstructured text
Applications of Text Mining
— Insurance fraud – notes in claim forms can be
mined and transformed into predictor variables for a
predictive model
— The model is trained on prior claims in two classes –
found to be fraudulent, and not found to be
fraudulent
— The model is then applied to new claims
Applications, cont.
— Maintenance or support
tickets often contain text
fields
— These fields could be mined
to classify ticket in several
ways:
— How urgent?
— How much time to fix?
— What category of technician
is needed to fix?
Applications, cont.
— Medical triage/diagnosis
— Clinics could use patient online appointment request
forms to route requests
— Admin asst.
— Nurse
— Doctor
What exactly is text mining?
— Classify (label) thousands of documents
— Extension of predictive modeling (our focus)

— Extract meaning from a single document – interpreting


it like a human reads language
— “Natural language processing” (very ambitious, not predictive modeling, not
our focus)
Classification (labeling) and
clustering
— No attempt to extract overall document meaning
from a single document
— Focus is on assigning a label or class to numerous
documents
— As with numerical data mining, the goal is to do
better than guessing
“Bag-of-words”
— Grammar, syntax, punctuation, word order are ignored
— The document is considered as a “bag of words”
— This approach is, nonetheless, effective when the goal is to
decide which category or cluster a document falls in
— A typical application is supervised learning
— Requires lots of documents (a corpus)*
— Do not need 100% accuracy

*“Corpus” often refers to a fixed standard set of documents that many researchers
can use to develop and tune text mining algorithms.
The spreadsheet model of text
— Columns are terms
— Rows are documents
— Cells indicate presence/absence (or frequency) of
terms in documents
— Consider the two sentences:
— S1 First we consider the spreadsheet model
— S2 Then we consider another model

Here is the resulting spreadsheet, using


presence/absence:
firs t we cons ider the s preads heet model then another
S1 1 1 1 1 1 1 0 0
S2 0 1 1 0 0 1 1 0
Need to turn text into a matrix
— For the two documents (sentences S1 and S2) that
we looked at earlier, the process of producing a
matrix is simple. We had
— Words
— Spaces
— Periods
— Each word is preceded or followed by a space or
period – a delimiter.
— Real text is more complicated
Lots of things to process besides
words…
— Numbers (including dates, percents, monetary
amounts), e.g. from Google Annual Report 2014:

We considered the historical trends in currency exchange rates and determined that it was reasonably possible that
changes in exchange rates of 20% could be experienced in the near term. If the U.S. dollar weakened by 20% at
December 31, 2013 and 2014, the amount recorded in AOCI related to our foreign exchange options before tax
effect would have been approximately $4 million and $686 million lower at December 31, 2013 and December
31, 2014, and the total amount of expense recorded as interest and other income, net, would have been
approximately $123 million and $90 million higher in the years ended December 31, 2013 and December 31,
2014. If the U.S. dollar strengthened by 20% at December 31, 20013 and December 31, 2014, the amount
recorded in accumulated AOCI related to our foreign exchange options before tax effect would have been
approximately $1.7 billion and $2.5 billion higher at December 31, 2013 and December 31, 2014, and the total
amount of expense recorded as interest and other income, net, would have been approximately $120 million and
$164 million higher in the years ended December 31, 2013 and December 31, 2014.
Email addresses, url’s, stray characters
introduced by file conversions, …

Sender: Distribution list for statistical items of interest <[email protected]>


From: "Massimini, Vince" <[email protected]> Subject: Comparing the Maximal Procedure to Permuted
Blocks Randomization To: <[email protected]> Precedence: list List-Help:
<mailto:[email protected]?body=INFO%20WSS-ELECTRONIC-MAIL-LIST>
For more WSS events, see washstat.org WSS Public Health/Biostatistics Section and NCI Division of
Cancer Preventi= on Jointly Sponsored Event: =20
SPEAKER: Vance W. Berger, PhD National Cancer Institute and University of = Maryland Baltimore County
and Klejda Bejleri, BS Biometry and Statistics, D= epartment of Biological Statistics and Computational
Biology, Cornell Unive= rsity, Ithaca, NY 14853 =20 TITLE: Comparing the Maximal Procedure to
Permuted Blocks Randomization
TIME AND PLACE: Monday, June 8th NCI Shady Grove, 9609 Medical Center Drive= , Rockville MD Room
5E30/32. Bring photo ID, allow time to get through secu= rity
Proper nouns & terms specific to a
particular field

From Techsmith corporate information: All-In-One Capture, Camtasia, Camtasia Studio,


Camtasia Relay, Coach's Eye, DubIt, EnSharpen, Enterprise Wide, Expressshow, Jing,
Morae, Rich Recording Technology (RRT), Snagit, Screencast.com, ScreenChomp, Show
The World, SmartFocus, TechSmith, TechSmith and T Design logo, TechSmith Fuse,
TechSmith Relay, TSCC, and UserVue are marks or registered marks of TechSmith
Corporation.

From medical journal: Eight hundred elderly women and men from the population-based
Framingham Osteoporosis Study had BMD assessed in 1988-1989 and again in 1992-
1993. BMD was measured at femoral neck, trochanter, Ward's area, radial shaft,
ultradistal radius, and lumbar spine using Lunar densitometers. (Risk Factors for Longitudinal Bone
Loss in Elderly Men and Women: The Framingham Osteoporosis Study, Journal of Bone and Mineral Research Volume
15, Issue 4, pages 710–720, April 2000)
Tokenization
— We need to move from a mass of text to useful
predictor information
— The first step is to separate out and identify
individual terms
— The process by which you identify delimiters and use
them to separate terms is called tokenization. The
resulting terms are also called tokens.
Preprocessing
— Goal – reduction of text (also called vocabulary
reduction) without losing meaning or predictive
power
— Stemming
— Reducing multiple variants of a word to a common
core
— Travel, traveling, traveled, etc. -> travel
— Ignore case
— Frequency filters can eliminate terms that
— Appear in nearly all documents
— Appear in hardly any documents
Preprocessing, cont.

— All punctuation characters treated as delimiters


— Punctuation characters are sometimes also considered
terms of length 1, but not in XLMiner
— Default in XLMiner is to drop all terms <=2 characters
long
— Default is also to drop all terms that are on a stoplist
— XLMiner comes with a default stoplist
— User can edit it
Preprocessing, cont.
— Frequency vs. presence/absence
— Normalization, when the presence of a type of term
might be important but we don’t need the specific
term. For example
— Replace [email protected] with “email token”
— Replace www.domain.com with “url token”
Post-reduction matrix
— Rows are still records, columns are the reduced set of terms
— Options for cell entries:
— 0/1 (presence absence)
— Frequency count
— TF-IDF (term frequency – inverse document frequency)
— TF = frequency of term
— IDF = log of inverse of the frequency with which documents have
that term
— There are varying definitions of both TF and IDF, hence of TF-IDF
— Bottom line:
— TF-IDF is high where a rare term is present or frequent in a document
— TF-IDF is near zero where a term is absent from a document, or abundant
across all documents
From terms to concepts – Latent
Semantic Indexing (LSI)
— The post-reduction term/document matrix is often still
huge – too big for easy processing
— Recall how, with principal components, we derived a
small set of synthetic predictor variables, each of which
was a linear combination of “like-minded” original
variables.
— Latent semantic indexing does something similar for
text – it maps multiple terms to a small set of concepts.
Intuitive explanation - LSI
For example: if we inspected our document collection, we might find that each
time the term “alternator” appeared in an automobile document, the
document also included the terms “battery” and “headlights.” Or each time
the term “brake" appeared in an automobile document, the terms “pads”
and “squeaky” also appeared. However, there is no detectable pattern
regarding the use of the terms “alternator” and “brake” together.
Documents including “alternator” might or might not include “brake” and
documents including “brake” might or might not include “alternator.” Our
four terms, battery, headlights, pads, and squeaky describe two different
automobile repair issues: failing brakes and a bad alternator.

Analytic Solver Platform, XLMiner Platform, Data Mining User Guide, 2014,
Frontline Systems, p. 245}
Extracting Meaning?
— It may be possible to use the concepts to identify
themes in the document corpus, and clusters of
documents sharing those themes.
— Often, however, the concepts do not map in obvious
fashion to meaningful themes.
— Their key contribution is simply reducing the
vocabulary – instead of a matrix with thousands of
columns, we can deal with just a dozen or two.
The concept document matrix

A small portion of the document-concept matrix. The concepts may or


may not carry meaning that we can figure out; in any case they are
mathematical constructs from hundreds, perhaps thousands, of terms,
and can be used as the variables in a predictive model.
A predictive model
— Now we have a clean, structured dataset similar to
what we have used in our numerical data mining:
— Class identifications (labels) for training
— Numerical predictors
Chapter Exercises
(Updated in Canvas)

You might also like