An Introduction To Text: Mining
An Introduction To Text: Mining
Ravindra Jaju
Data Mining
What: Looking for information from usually large amounts of data Mainly two kinds of activities Descriptive and Predictive Example of a descriptive activity Clustering Example of a predictive activity Classification
<1, 1, 0, 0, 1, 0> <0, 0, 1, 1, 0, 1> It could be two customers' baskets, containing (milk, bread, butter) and (shaving cream, razor, after-shave lotion) respectively. Or, it could be two documents - Java programming language and India beat Pakistan
10/04/04 Ravindra Jaju
10/04/04
Ravindra Jaju
Data representation
10/04/04
Ravindra Jaju
Most of the mining algorithms work only with numeric data All data, hence, are represented as numbers so that they can lend themselves to the algorithms Whether it is sales figures, crime rates, text, or images one has to find a suitable way to transform data into numbers.
10/04/04 Ravindra Jaju
The transformation to 1's and 0's hides all relationship between Java and Language, and India and Pakistan, which humans can make out (How?)
10/04/04 Ravindra Jaju
As we have seen, data transformation (from text/word to some index number in this case) means that there is some information loss One big challenge in this field today is to find a good data representation for input to the mining algorithms
10/04/04
Ravindra Jaju
Capturing the meaning of sentences is an important issue as well. Grammar, parts of speech, time sense could be easy!
Finding out automatically who the he in He is the President given a document is hard. And president of? Well ...
Ravindra Jaju
10/04/04
10/04/04
Ravindra Jaju
10/04/04
Ravindra Jaju
Java
0, 0, 0
Indonesia
10/04/04
Ravindra Jaju
10/04/04
Ravindra Jaju
We have the option of scaling the components of these words, or completely removing them from the corpus In general, we prefer to remove the stopwords and scale the remaining words
Important words should be scaled upwards, and vice versa One widely used scaling factor TF-IDF TF-IDF stands for Term Frequency and Inverse Document Frequency product, for a word.
10/04/04
Ravindra Jaju
Document/Term Clustering
Given a large set, group similar entities
Text Classification
Given a document, find what topic does it talk about
Information Retrieval
Search engines
Information Extraction
Question Answering
Ravindra Jaju
10/04/04
Partitioning Hierarchical
Agglomerative Divisive
Clustering (contd.)
Partitioning
Divide the input data into k partitions
K-means, K-medoids
Hierarchical clustering
Agglomerative
Each data point is assumed to be a cluster representative Keep merging similar clusters till we get a single cluster
Divisive
The opposite of agglomerative
10/04/04 Ravindra Jaju
Idea
Frequent terms carry more information about the cluster they might belong to Highly co-related frequent terms probably belong to the same cluster
Then candidate clusters are generated from F = {F1, , Fk}, where each Fi is a set of all frequent terms which occur together.
Ravindra Jaju
10/04/04
Classification
10/04/04
Ravindra Jaju
A tree
Each node represents some kind of an evaluation for an attribute of the data Each edge, the decision taken
Each edge represents a choice for the value of the attribute the node represents
10/04/04
In practice, the exact values of the probabilities of each event are unknown, and are estimated from the samples
10/04/04 Ravindra Jaju
Nave Bayes Assume all the wj events are independent. The RHS expands to
p(wj)
10/04/04
If java and program occur together, then boost the probability value of class computer programming If java and indonesia occur together, then the document is more likely about someother-class
Problem?
How do we come up with co-relations like above?
10/04/04
Ravindra Jaju
10/04/04
Ravindra Jaju
An example
Problem setting
Labeling documents is a manual process A lot more unlabeled documents are available as compared to labeled documents Unlabeled documents contain information which could help in the classification activity
10/04/04
Ravindra Jaju
An example (contd.)
10/04/04
Expectation Maximization
A useful technique for estimating hidden parameters In the previous example, the class labels were missing from some documents Consists of two steps
E-step: Set z(k+1) = E [z | D; (k)] M-step: Set (k+1) = arg max P( | D; z(k+1))
The above steps are repeated till convergence, and convergence does occur
Ravindra Jaju
10/04/04
Another example
10/04/04
Ravindra Jaju
Contd.
Idea
Find a direction which maximizes the separation between classes. Why?
Reduce noise, or rather Enhance the differences between classes
Project the data-points onto this For all data-points not separated by this vector, choose another
Contd.
Project all the document vectors into the space represented by the vectors as the basis vectors Now, induce a decision tree on this projected representation
The number of attributes is highly reduced
Since this representation nicely separates the data points (documents), accuracy increases
Ravindra Jaju
10/04/04
The WWW is a huge, directed graph, with documents as nodes and hyperlinks as the directed edges Apart from the text itself, this graph structure carries a lot of information about the usefulness of the nodes For example
10 random, average people on the streets say Mr. T. Ache is a good dentist 5 reputed doctors, including dentists, recommend Mr. P. Killer as a better dentist Who would you choose?
10/04/04 Ravindra Jaju
Kleinbergs HITS
HITS Hypertext Induced Topic Selection Nodes on the web can be categorized into two types hubs and authorities Authorities are nodes which one refers to for definitive information about a topic Hubs point to authorities HITS computes the hub and authority scores on a sub-universe of the web
HITS (contd.)
Repeat the above till convergence Nodes with high A scores are relevant
Relevant to what? Can we use this for efficient retrieval for a query?
10/04/04 Ravindra Jaju
PageRank
Similar to HITS, but all pages have only one score a Rank R(u) = c (R(v)/Nv)
v is the set of pages linking to u, and Nv is the number of links in v. c is a scaling factor (< 1)
The higher the rank of pages linking to a page, the higher is its own rank! To handle rank sinks (documents which do not link outside a set of pages), the formula is modified as
R(u) = c (R(v)/Nv) + cE(u) E(u) is a set of some pages, and acts as a rank source (what kind of pages?)
10/04/04 Ravindra Jaju
10/04/04
Ravindra Jaju
10/04/04
Ravindra Jaju
The above was inferred from a set of documents, with some human help
Ravindra Jaju
10/04/04
References
Data Mining Concepts and Techniques, by Jiawei Han and Micheline Kamber Principle of Data Mining, by David J. Hand et al Text Classification from Labeled and Unlabeled Documents using EM, Kamal Nigam et al Fast and accurate text classification via multiple linear discriminant projections, S. Chakrabarti et al Frequent Term-Based Text Clustering, Florian Beil et al
The PageRank Citation Ranking: Bringing Order to the Web, Lawrence Page and Sergey Brin
Untangling Text Data Mining, by Marti. A. Hearst, http://www.sims.berkeley.edu/~hearst/papers/acl99/acl99-tdm.html
And others
10/04/04
Ravindra Jaju