Introduction To Text Mining
Introduction To Text Mining
DATA MINING
• Text Mining:
discovery by computer of new, previously
unknown information, by automatically
extracting information from a usually large
amount of different unstructured textual
resources
What is Text Mining?
• What does previously unknown mean?
Implies discovering genuinely new information
Discovering new knowledge vs. merely finding
patterns is like the difference between a detective
following clues to find the criminal vs. analysts
looking at crime statistics to assess overall trends in
a particular crime
Text Mining
Statistics Web Mining
• Document Clustering
Interpretation /
• Text Characteristics Evaluation
Data Mining /
Pattern Discovery
Attribute Selection
Text Transformation
(Attribute Generation)
Text Preprocessing
Text
Document Clustering
• Large volume of textual data
– Billions of documents must be handled in an efficient
manner
• Dependency
– Words and phrases create context for each other.
Text Characteristic
• Ambiguity
– Word ambiguity
– Sentence ambiguity
• Noisy data
– Erroneous data
– Misleading (intentionally) data.
• Unstructured text
– Chat room, normal speech, …
Text Characteristic
Attribute Selection
Text Transformation
(Attribute Generation)
Text Preprocessing
Text
Text Preprocessing
• Text cleanup
– e.g., remove ads from web pages, normalize text
converted from binary formats, deal with tables, figures
and formulas, …
• Tokenization
– Splitting up a string of characters into a set of tokens
– Need to deal with issues like:
• Apostrophes, e.g., “John’s sick”, is it 1 or 2 tokens?
• Hyphens, e.g., database vs. data-base vs. data base.
• How should we deal with “C++”, “A/C”, “:-)”, “…”?
• Is the amount of white spaces significant?
Text Processing
• Parts of Speech tagging
– The process of marking up the words in a text with their
corresponding parts of speech
– Rule based
• Depends on grammatical rules
– Statistically based
• Relies on different word order probabilities
• Needs a manually tagged corpus for machine learning
• Text Representation
Interpretation /
• Feature Selection Evaluation
Data Mining /
Pattern Discovery
Attribute Selection
Text Transformation
(Attribute Generation)
Text Preprocessing
Text
Attribute Generation
• Text Representation:
– Text document is represented by the words (features -
attributes) it contains and their occurrences (values of
attributes)
– Two main approaches of document representation
• “Bag of words”
• Vector Space
Grobelnik, M. and Mladenic, D. Text-Mining Tutorial. In the Proceeding of Learning Methods for Text
Understanding and Mining, Grenoble, France, January 26 – 29, 2004.
“Bag of Words”: Word Weighting
• In “Bag of words” representation each word is represented as a
separate variable having numeric weight
• The most popular weighting schema is
normalized word frequency tfidf:
N
tfidf ( w) = tf ⋅ log(
)
df ( w)
• tf(w) –term frequency (number of word occurrences in a document)
• df(w) –document frequency (number of documents containing the word)
• N – number of all documents
• tfidf(w) – relative importance of the word in the document
• Stemming
– Identifies a word by its root
Reduces dimensionality (number of features, words)
e.g. flying, flew → fly
– Two common algorithms :
• Porter’s Algorithm.
• KSTEM Algorithm.
Feature Selection
• Stemming Examples
– Original Text
• Document will describe marketing strategies carried out by U.S.
companies for their agricultural chemicals, report predictions for
market share of such chemicals, or report market statistics for
agrochemicals.
Grobelnik, M. and Mladenic, D. Text-Mining Tutorial. In the Proceeding of Learning Methods for Text
Understanding and Mining, Grenoble, France, January 26 – 29, 2004.
Text Mining Process
• Reduce
Dimensionality
Interpretation /
• Remove irrelevant Evaluation
attributes
Data Mining /
Pattern Discovery
Attribute Selection
Text Transformation
(Attribute Generation)
Text Preprocessing
Text
Attribute Selection
• Further reduction of dimensionality
– Learners have difficulty addressing tasks with high
dimensionality
– Scarcity of resources and feasibility issues also call
for a further cutback of attributes.
• Irrelevant features
– Not all features help!
• e.g., the existence of a noun in a news article is unlikely
to help classify it as “politics” or “sport”.
Text Mining Process
• Structured Database
• Application-dependent
Interpretation /
• Classic Data Mining Evaluation
techniques
Data Mining /
Pattern Discovery
Attribute Selection
Text Transformation
(Attribute Generation)
Text Preprocessing
Text
Data Mining
• At this point the Text mining process merges
with the traditional Data Mining process
Data Mining /
Pattern Discovery
Attribute Selection
Text Transformation
(Attribute Generation)
Text Preprocessing
Text
Interpretation and Evaluation
• What to do next?
– Terminate
• Results well-suited for application at hand.
– Iterate
• Results not satisfactory but significant.
• The results generated are used as part of the
input for one or more earlier stages.
Using text in Medical Hypothesis Discovery
• Example
• 8 students
• 9 texts from each student
• Fixed subjects
• (3 argumentative, 3 descriptive, 3 fiction)
• About 1,000 words per text
Profiling
Hans van Halteren, Linguistic Profiling for Author Recognition and Verification. 42nd Annual Meeting of the
Association for Computational Linguistics Forum Convention Centre Barcelona. July 21-26, 2004.
Results with Syntactic Features
R
• Amazon Parser
• http://lands.let.kun.nl/~dreumel/
amacas.en.html
• is used to extract syntactic features (details
about the parser is in Dutch)
• The size of the feature vector is about 900k
counts
• Best result is 25% at D = 1.3, S = 1.4
• Worse than lexical feature analysis
…So Combine the Features
Hans van Halteren, Linguistic Profiling for Author Recognition and Verification. 42nd Annual Meeting of the
Association for Computational Linguistics Forum Convention Centre Barcelona. July 21-26, 2004.
Concluding Remarks
• The first issue that can be addressed is
“parameter setting”
• There is no dynamic parameter setting
scheme
• Results with other corpora might also provide
interesting results
• Different kinds of feature selection may
provide better results……
ARROWSMITH
discovery from complementary literatures
Project Links-
http://kiwi.uchicago.edu/
http://arrowsmith.psych.uic.edu/arrowsmith_uic/index.html
References-
Swanson DR, Smalheiser NR, Torvik VI.