0% found this document useful (0 votes)
83 views

An Introduction To Text: Mining

This document provides an overview of text mining and summarizes key techniques used for text representation, clustering, and classification. It discusses how text data is transformed into numeric vectors to be analyzed by machine learning algorithms. Common text representation methods like bag-of-words and TF-IDF are introduced. Popular clustering algorithms like k-means and hierarchical clustering are summarized. Classification techniques covered include decision trees, naive Bayes, and support vector machines. Two text classification examples using expectation maximization and multiple linear discriminant projections are also briefly described.

Uploaded by

Madhuri Dalal
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
83 views

An Introduction To Text: Mining

This document provides an overview of text mining and summarizes key techniques used for text representation, clustering, and classification. It discusses how text data is transformed into numeric vectors to be analyzed by machine learning algorithms. Common text representation methods like bag-of-words and TF-IDF are introduced. Popular clustering algorithms like k-means and hierarchical clustering are summarized. Classification techniques covered include decision trees, naive Bayes, and support vector machines. Two text classification examples using expectation maximization and multiple linear discriminant projections are also briefly described.

Uploaded by

Madhuri Dalal
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 39

An Introduction to Text Mining

Ravindra Jaju

Outline of the presentation


Initiation/Introduction ... What makes text stand apart from other kinds of data? Classification Clustering Mining on The Web

10/04/04 Ravindra Jaju

Data Mining
What: Looking for information from usually large amounts of data Mainly two kinds of activities Descriptive and Predictive Example of a descriptive activity Clustering Example of a predictive activity Classification

10/04/04 Ravindra Jaju

What kind of data is this?

<1, 1, 0, 0, 1, 0> <0, 0, 1, 1, 0, 1> It could be two customers' baskets, containing (milk, bread, butter) and (shaving cream, razor, after-shave lotion) respectively. Or, it could be two documents - Java programming language and India beat Pakistan
10/04/04 Ravindra Jaju

And what kind of data is this?


<550000, 155> <750000, 115> <120000, 165> Data about people, <income, IQ> pairs!

10/04/04

Ravindra Jaju

Data representation

Humans understand data in various forms


Text Sales figures Images

Computers understand only numbers

10/04/04

Ravindra Jaju

Working with data

Most of the mining algorithms work only with numeric data All data, hence, are represented as numbers so that they can lend themselves to the algorithms Whether it is sales figures, crime rates, text, or images one has to find a suitable way to transform data into numbers.
10/04/04 Ravindra Jaju

Text mining Working with numbers


Java Programming Language India beat Pakistan OR <1, 1, 0, 0, 1, 0> <0, 0, 1, 1, 0, 1>

The transformation to 1's and 0's hides all relationship between Java and Language, and India and Pakistan, which humans can make out (How?)
10/04/04 Ravindra Jaju

Text mining Working with numbers (contd.)

As we have seen, data transformation (from text/word to some index number in this case) means that there is some information loss One big challenge in this field today is to find a good data representation for input to the mining algorithms

10/04/04

Ravindra Jaju

Text Representation Issues

Each word has a dictionary meaning, or meanings


Run (1) the verb. (2) the noun, in cricket Cricket (1) The game. (2) The insect.

Each word is used in various senses


Tendulkar made 100 runs Because of an injury, Tendulkar can not run and will need a runner between the wickets

Capturing the meaning of sentences is an important issue as well. Grammar, parts of speech, time sense could be easy!

Finding out automatically who the he in He is the President given a document is hard. And president of? Well ...
Ravindra Jaju

10/04/04

Text Representation Issues (contd.)

In general, it is hard to capture these features from a text document


One, it is difficult to extract this automatically Two, even if we did it, it won't scale!

One simplification is to represent documents as a vector of words


We have already seen examples Each document is represented as a vector, and each component of the vector represents some quantity related to a single word.

10/04/04

Ravindra Jaju

The Document Vector

Java Programming Language


<1, 1, 0, 0, 1, 0, 0> (document A)

India beat Pakistan


<0, 0, 1, 1, 0, 1, 0> (document B) India beat Australia <0, 0, 1, 1, 0, 0, 1> (document C) What vector operation can you think of to find two similar documents? How about the dot product? As we can easily verify, documents B and C will have a higher dot product than any other combination

10/04/04

Ravindra Jaju

More on document similarity

The dot product or cosine between two vectors is a measure of similarity.


Documents about related topics should have higher similarity
Language

Java

0, 0, 0

Indonesia

10/04/04

Ravindra Jaju

Document Similarity (contd.)

How about distance measures?


Cosine similarity measure will not capture the inter-cluster distances!

10/04/04

Ravindra Jaju

Further refinements to the DV representation

Not all words are equally important


the, is, and, to, he, she, it (Why?)

Of course, these words could be important in certain contexts


We have the option of scaling the components of these words, or completely removing them from the corpus In general, we prefer to remove the stopwords and scale the remaining words
Important words should be scaled upwards, and vice versa One widely used scaling factor TF-IDF TF-IDF stands for Term Frequency and Inverse Document Frequency product, for a word.

10/04/04

Ravindra Jaju

Text Mining Moving Further


Document/Term Clustering
Given a large set, group similar entities

Text Classification
Given a document, find what topic does it talk about

Information Retrieval
Search engines

Information Extraction
Question Answering
Ravindra Jaju

10/04/04

Clustering (Descriptive Activity)


Activity: Group together similar documents Techniques used

Partitioning Hierarchical
Agglomerative Divisive

Grid based Model based


10/04/04 Ravindra Jaju

Clustering (contd.)

Partitioning
Divide the input data into k partitions
K-means, K-medoids

Hierarchical clustering
Agglomerative
Each data point is assumed to be a cluster representative Keep merging similar clusters till we get a single cluster

Divisive
The opposite of agglomerative
10/04/04 Ravindra Jaju

Frequent term-based text clustering

Idea
Frequent terms carry more information about the cluster they might belong to Highly co-related frequent terms probably belong to the same cluster

D = {D1, , Dn} the set of documents


Dj subsetOf T, the set of all terms

Then candidate clusters are generated from F = {F1, , Fk}, where each Fi is a set of all frequent terms which occur together.
Ravindra Jaju

10/04/04

Classification

The problem statement


Given a set of documents, each with a label called the class label for that document Given, a classifier which learns from the above data set For a new, unseen document, the classifier should be able to predict with a high degree of accuracy the correct class to which the new document belongs

10/04/04

Ravindra Jaju

Decision Tree Classifier

A tree
Each node represents some kind of an evaluation for an attribute of the data Each edge, the decision taken

The evaluation at each node is some kind of an information gain measure


Reduction in entropy more information gained Entropy E(x) = -pilog2(pi)
pi represents the probability that the data corresponds to sample i

Each edge represents a choice for the value of the attribute the node represents

10/04/04

Good for text mining. But doesnt scale Ravindra Jaju

Statistical (Bayesian) Classification

For a document-class data, we calculate the probabilities of occurrence of events


Bayes Theorem
P(c|d) = P(c) . P(d|c) / P(d) Given a document d, the probability that it belongs to a class c is given by the above formula.

In practice, the exact values of the probabilities of each event are unknown, and are estimated from the samples
10/04/04 Ravindra Jaju

Nave Bayes Classification

Probability of the document event d


P(d) = P(w1, , wn) wi are the words The RHS is generally a headache. We have to consider the inter-dependence of each of the wj events

Nave Bayes Assume all the wj events are independent. The RHS expands to
p(wj)

Most of the Bayesian text classifiers work with this simplification


Ravindra Jaju

10/04/04

Bayesian Belief Networks


This is an intermediate approach Not all words are independent

If java and program occur together, then boost the probability value of class computer programming If java and indonesia occur together, then the document is more likely about someother-class

Problem?
How do we come up with co-relations like above?

10/04/04

Ravindra Jaju

Other classification techniques

Support Vector Machines


Find the best discriminant plane between two classes

k Nearest Neighbour Association Rule Mining Neural Networks Case-based reasoning

10/04/04

Ravindra Jaju

An example

Text Classification from labeled and

unlabeled documents with Expectation Maximization

Problem setting
Labeling documents is a manual process A lot more unlabeled documents are available as compared to labeled documents Unlabeled documents contain information which could help in the classification activity

10/04/04

Ravindra Jaju

An example (contd.)

Train a classifier with the labeled documents


Say, a Nave Bayes classifier This classifier estimates the model parameters (the prior probabilities of the various events)

Now, classify the unlabeled documents.


Assuming the applied labels to be correct, re-estimate the model parameters

Repeat the above step till convergence


Ravindra Jaju

10/04/04

Expectation Maximization
A useful technique for estimating hidden parameters In the previous example, the class labels were missing from some documents Consists of two steps

E-step: Set z(k+1) = E [z | D; (k)] M-step: Set (k+1) = arg max P( | D; z(k+1))

The above steps are repeated till convergence, and convergence does occur
Ravindra Jaju

10/04/04

Another example

Fast and accurate Text Classification via

Multiple Linear Discriminant Projections

10/04/04

Ravindra Jaju

Contd.

Idea
Find a direction which maximizes the separation between classes. Why?
Reduce noise, or rather Enhance the differences between classes

The vector corresponding to this direction is the Fishers discriminant

Project the data-points onto this For all data-points not separated by this vector, choose another

10/04/04 Ravindra Jaju

Contd.

Repeat till all data are now separable


Note, we are looking at a 2-class case. This easily extends to multiple classes

Project all the document vectors into the space represented by the vectors as the basis vectors Now, induce a decision tree on this projected representation
The number of attributes is highly reduced

Since this representation nicely separates the data points (documents), accuracy increases
Ravindra Jaju

10/04/04

Web Text Mining

The WWW is a huge, directed graph, with documents as nodes and hyperlinks as the directed edges Apart from the text itself, this graph structure carries a lot of information about the usefulness of the nodes For example
10 random, average people on the streets say Mr. T. Ache is a good dentist 5 reputed doctors, including dentists, recommend Mr. P. Killer as a better dentist Who would you choose?
10/04/04 Ravindra Jaju

Kleinbergs HITS
HITS Hypertext Induced Topic Selection Nodes on the web can be categorized into two types hubs and authorities Authorities are nodes which one refers to for definitive information about a topic Hubs point to authorities HITS computes the hub and authority scores on a sub-universe of the web

How does one collect this sub-universe?


10/04/04 Ravindra Jaju

HITS (contd.)

The basic steps


Au = Hv for all v pointing to u Hu = Av for all v pointed to by u

Repeat the above till convergence Nodes with high A scores are relevant

Relevant to what? Can we use this for efficient retrieval for a query?
10/04/04 Ravindra Jaju

PageRank

Similar to HITS, but all pages have only one score a Rank R(u) = c (R(v)/Nv)
v is the set of pages linking to u, and Nv is the number of links in v. c is a scaling factor (< 1)

The higher the rank of pages linking to a page, the higher is its own rank! To handle rank sinks (documents which do not link outside a set of pages), the formula is modified as
R(u) = c (R(v)/Nv) + cE(u) E(u) is a set of some pages, and acts as a rank source (what kind of pages?)
10/04/04 Ravindra Jaju

Some more topics which we havent touched Using external dictionaries


WordNet

Using language specific techniques


Computational linguistics Use grammar for judging the sense of a query in the information retrieval scenario

Other interesting techniques


Latent Semantic Indexing
Finding the latent information in documents using Linear Algebra Techniques

10/04/04

Ravindra Jaju

Some more comments


Some purists do not consider most of the current activities in the text mining field as real text mining For example, see Marti Hearsts write-up at Untangling Text Data Mining

10/04/04

Ravindra Jaju

Some more comments (contd.)

One example that he mentions


stress is associated with migraines stress can lead to loss of magnesium calcium channel blockers prevent some migraines magnesium is a natural calcium channel blocker spreading cortical depression (SCD) is implicated in some migraines high levels of magnesium inhibit SCD migraine patients have high platelet aggregability magnesium can suppress platelet aggregability

The above was inferred from a set of documents, with some human help
Ravindra Jaju

10/04/04

References

Data Mining Concepts and Techniques, by Jiawei Han and Micheline Kamber Principle of Data Mining, by David J. Hand et al Text Classification from Labeled and Unlabeled Documents using EM, Kamal Nigam et al Fast and accurate text classification via multiple linear discriminant projections, S. Chakrabarti et al Frequent Term-Based Text Clustering, Florian Beil et al

The PageRank Citation Ranking: Bringing Order to the Web, Lawrence Page and Sergey Brin
Untangling Text Data Mining, by Marti. A. Hearst, http://www.sims.berkeley.edu/~hearst/papers/acl99/acl99-tdm.html
And others

10/04/04

Ravindra Jaju

You might also like