Text Mining Notes
Text Mining Notes
*“Corpus” often refers to a fixed standard set of documents that many researchers
can use to develop and tune text mining algorithms.
The spreadsheet model of text
Columns are terms
Rows are documents
Cells indicate presence/absence (or frequency) of
terms in documents
Consider the two sentences:
S1 First we consider the spreadsheet model
S2 Then we consider another model
We considered the historical trends in currency exchange rates and determined that it was reasonably possible that
changes in exchange rates of 20% could be experienced in the near term. If the U.S. dollar weakened by 20% at
December 31, 2013 and 2014, the amount recorded in AOCI related to our foreign exchange options before tax
effect would have been approximately $4 million and $686 million lower at December 31, 2013 and December
31, 2014, and the total amount of expense recorded as interest and other income, net, would have been
approximately $123 million and $90 million higher in the years ended December 31, 2013 and December 31,
2014. If the U.S. dollar strengthened by 20% at December 31, 20013 and December 31, 2014, the amount
recorded in accumulated AOCI related to our foreign exchange options before tax effect would have been
approximately $1.7 billion and $2.5 billion higher at December 31, 2013 and December 31, 2014, and the total
amount of expense recorded as interest and other income, net, would have been approximately $120 million and
$164 million higher in the years ended December 31, 2013 and December 31, 2014.
Email addresses, url’s, stray characters
introduced by file conversions, …
From medical journal: Eight hundred elderly women and men from the population-based
Framingham Osteoporosis Study had BMD assessed in 1988-1989 and again in 1992-
1993. BMD was measured at femoral neck, trochanter, Ward's area, radial shaft,
ultradistal radius, and lumbar spine using Lunar densitometers. (Risk Factors for Longitudinal Bone
Loss in Elderly Men and Women: The Framingham Osteoporosis Study, Journal of Bone and Mineral Research Volume
15, Issue 4, pages 710–720, April 2000)
Tokenization
We need to move from a mass of text to useful
predictor information
The first step is to separate out and identify
individual terms
The process by which you identify delimiters and use
them to separate terms is called tokenization. The
resulting terms are also called tokens.
Preprocessing
Goal – reduction of text (also called vocabulary
reduction) without losing meaning or predictive
power
Stemming
Reducing multiple variants of a word to a common
core
Travel, traveling, traveled, etc. -> travel
Ignore case
Frequency filters can eliminate terms that
Appear in nearly all documents
Appear in hardly any documents
Preprocessing, cont.
Analytic Solver Platform, XLMiner Platform, Data Mining User Guide, 2014,
Frontline Systems, p. 245}
Extracting Meaning?
It may be possible to use the concepts to identify
themes in the document corpus, and clusters of
documents sharing those themes.
Often, however, the concepts do not map in obvious
fashion to meaningful themes.
Their key contribution is simply reducing the
vocabulary – instead of a matrix with thousands of
columns, we can deal with just a dozen or two.
The concept document matrix