0% found this document useful (0 votes)

83 views

An Introduction To Text: Mining

This document provides an overview of text mining and summarizes key techniques used for text representation, clustering, and classification. It discusses how text data is transformed into numeric vectors to be analyzed by machine learning algorithms. Common text representation methods like bag-of-words and TF-IDF are introduced. Popular clustering algorithms like k-means and hierarchical clustering are summarized. Classification techniques covered include decision trees, naive Bayes, and support vector machines. Two text classification examples using expectation maximization and multiple linear discriminant projections are also briefly described.

Uploaded by

Madhuri Dalal

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

83 views

An Introduction To Text: Mining

Uploaded by

Madhuri Dalal

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 39

An Introduction to Text Mining

Ravindra Jaju

Outline of the presentation

Initiation/Introduction ... What makes text stand apart from other kinds of data? Classification Clustering Mining on The Web

10/04/04 Ravindra Jaju

Data Mining
What: Looking for information from usually large amounts of data Mainly two kinds of activities Descriptive and Predictive Example of a descriptive activity Clustering Example of a predictive activity Classification

10/04/04 Ravindra Jaju

What kind of data is this?

<1, 1, 0, 0, 1, 0> <0, 0, 1, 1, 0, 1> It could be two customers' baskets, containing (milk, bread, butter) and (shaving cream, razor, after-shave lotion) respectively. Or, it could be two documents - Java programming language and India beat Pakistan
10/04/04 Ravindra Jaju

And what kind of data is this?

<550000, 155> <750000, 115> <120000, 165> Data about people, <income, IQ> pairs!

10/04/04

Ravindra Jaju

Data representation

Humans understand data in various forms

Text Sales figures Images

Computers understand only numbers

10/04/04

Ravindra Jaju

Working with data

Most of the mining algorithms work only with numeric data All data, hence, are represented as numbers so that they can lend themselves to the algorithms Whether it is sales figures, crime rates, text, or images one has to find a suitable way to transform data into numbers.
10/04/04 Ravindra Jaju

Text mining Working with numbers

Java Programming Language India beat Pakistan OR <1, 1, 0, 0, 1, 0> <0, 0, 1, 1, 0, 1>

The transformation to 1's and 0's hides all relationship between Java and Language, and India and Pakistan, which humans can make out (How?)
10/04/04 Ravindra Jaju

Text mining Working with numbers (contd.)

As we have seen, data transformation (from text/word to some index number in this case) means that there is some information loss One big challenge in this field today is to find a good data representation for input to the mining algorithms

10/04/04

Ravindra Jaju

Text Representation Issues

Each word has a dictionary meaning, or meanings

Run (1) the verb. (2) the noun, in cricket Cricket (1) The game. (2) The insect.

Each word is used in various senses

Tendulkar made 100 runs Because of an injury, Tendulkar can not run and will need a runner between the wickets

Capturing the meaning of sentences is an important issue as well. Grammar, parts of speech, time sense could be easy!

Finding out automatically who the he in He is the President given a document is hard. And president of? Well ...
Ravindra Jaju

10/04/04

Text Representation Issues (contd.)

In general, it is hard to capture these features from a text document

One, it is difficult to extract this automatically Two, even if we did it, it won't scale!

One simplification is to represent documents as a vector of words

We have already seen examples Each document is represented as a vector, and each component of the vector represents some quantity related to a single word.

10/04/04

Ravindra Jaju

The Document Vector

Java Programming Language

<1, 1, 0, 0, 1, 0, 0> (document A)

India beat Pakistan

<0, 0, 1, 1, 0, 1, 0> (document B) India beat Australia <0, 0, 1, 1, 0, 0, 1> (document C) What vector operation can you think of to find two similar documents? How about the dot product? As we can easily verify, documents B and C will have a higher dot product than any other combination

10/04/04

Ravindra Jaju

The dot product or cosine between two vectors is a measure of similarity.

Documents about related topics should have higher similarity
Language

Java

0, 0, 0

Indonesia

10/04/04

Ravindra Jaju

Document Similarity (contd.)

How about distance measures?

Cosine similarity measure will not capture the inter-cluster distances!

10/04/04

Ravindra Jaju

Further refinements to the DV representation

Not all words are equally important

the, is, and, to, he, she, it (Why?)

Of course, these words could be important in certain contexts

We have the option of scaling the components of these words, or completely removing them from the corpus In general, we prefer to remove the stopwords and scale the remaining words
Important words should be scaled upwards, and vice versa One widely used scaling factor TF-IDF TF-IDF stands for Term Frequency and Inverse Document Frequency product, for a word.

10/04/04

Ravindra Jaju

Text Mining Moving Further

Document/Term Clustering
Given a large set, group similar entities

Text Classification
Given a document, find what topic does it talk about

Information Retrieval
Search engines

Information Extraction
Question Answering
Ravindra Jaju

10/04/04

Clustering (Descriptive Activity)

Activity: Group together similar documents Techniques used

Partitioning Hierarchical
Agglomerative Divisive

Grid based Model based

10/04/04 Ravindra Jaju

Clustering (contd.)

Partitioning
Divide the input data into k partitions
K-means, K-medoids

Hierarchical clustering
Agglomerative
Each data point is assumed to be a cluster representative Keep merging similar clusters till we get a single cluster

Divisive
The opposite of agglomerative
10/04/04 Ravindra Jaju

Frequent term-based text clustering

Idea
Frequent terms carry more information about the cluster they might belong to Highly co-related frequent terms probably belong to the same cluster

D = {D1, , Dn} the set of documents

Dj subsetOf T, the set of all terms

Then candidate clusters are generated from F = {F1, , Fk}, where each Fi is a set of all frequent terms which occur together.
Ravindra Jaju

10/04/04

Classification

The problem statement

Given a set of documents, each with a label called the class label for that document Given, a classifier which learns from the above data set For a new, unseen document, the classifier should be able to predict with a high degree of accuracy the correct class to which the new document belongs

10/04/04

Ravindra Jaju

Decision Tree Classifier

A tree
Each node represents some kind of an evaluation for an attribute of the data Each edge, the decision taken

The evaluation at each node is some kind of an information gain measure

Reduction in entropy more information gained Entropy E(x) = -pilog2(pi)
pi represents the probability that the data corresponds to sample i

Each edge represents a choice for the value of the attribute the node represents

10/04/04

Good for text mining. But doesnt scale Ravindra Jaju

Statistical (Bayesian) Classification

For a document-class data, we calculate the probabilities of occurrence of events

Bayes Theorem
P(c|d) = P(c) . P(d|c) / P(d) Given a document d, the probability that it belongs to a class c is given by the above formula.

In practice, the exact values of the probabilities of each event are unknown, and are estimated from the samples
10/04/04 Ravindra Jaju

Nave Bayes Classification

Probability of the document event d

P(d) = P(w1, , wn) wi are the words The RHS is generally a headache. We have to consider the inter-dependence of each of the wj events

Nave Bayes Assume all the wj events are independent. The RHS expands to
p(wj)

Most of the Bayesian text classifiers work with this simplification

Ravindra Jaju

10/04/04

Bayesian Belief Networks

This is an intermediate approach Not all words are independent

If java and program occur together, then boost the probability value of class computer programming If java and indonesia occur together, then the document is more likely about someother-class

Problem?
How do we come up with co-relations like above?

10/04/04

Ravindra Jaju

Other classification techniques

Support Vector Machines

Find the best discriminant plane between two classes

k Nearest Neighbour Association Rule Mining Neural Networks Case-based reasoning

10/04/04

Ravindra Jaju

An example

Text Classification from labeled and

unlabeled documents with Expectation Maximization

Problem setting
Labeling documents is a manual process A lot more unlabeled documents are available as compared to labeled documents Unlabeled documents contain information which could help in the classification activity

10/04/04

Ravindra Jaju

An example (contd.)

Train a classifier with the labeled documents

Say, a Nave Bayes classifier This classifier estimates the model parameters (the prior probabilities of the various events)

Now, classify the unlabeled documents.

Assuming the applied labels to be correct, re-estimate the model parameters

Repeat the above step till convergence

Ravindra Jaju

10/04/04

Expectation Maximization
A useful technique for estimating hidden parameters In the previous example, the class labels were missing from some documents Consists of two steps

E-step: Set z(k+1) = E [z | D; (k)] M-step: Set (k+1) = arg max P( | D; z(k+1))

The above steps are repeated till convergence, and convergence does occur
Ravindra Jaju

10/04/04

Another example

Fast and accurate Text Classification via

Multiple Linear Discriminant Projections

10/04/04

Ravindra Jaju

Contd.

Idea
Find a direction which maximizes the separation between classes. Why?
Reduce noise, or rather Enhance the differences between classes

The vector corresponding to this direction is the Fishers discriminant

Project the data-points onto this For all data-points not separated by this vector, choose another

10/04/04 Ravindra Jaju

Contd.

Repeat till all data are now separable

Note, we are looking at a 2-class case. This easily extends to multiple classes

Project all the document vectors into the space represented by the vectors as the basis vectors Now, induce a decision tree on this projected representation
The number of attributes is highly reduced

Since this representation nicely separates the data points (documents), accuracy increases
Ravindra Jaju

10/04/04

Web Text Mining

The WWW is a huge, directed graph, with documents as nodes and hyperlinks as the directed edges Apart from the text itself, this graph structure carries a lot of information about the usefulness of the nodes For example
10 random, average people on the streets say Mr. T. Ache is a good dentist 5 reputed doctors, including dentists, recommend Mr. P. Killer as a better dentist Who would you choose?
10/04/04 Ravindra Jaju

Kleinbergs HITS
HITS Hypertext Induced Topic Selection Nodes on the web can be categorized into two types hubs and authorities Authorities are nodes which one refers to for definitive information about a topic Hubs point to authorities HITS computes the hub and authority scores on a sub-universe of the web

How does one collect this sub-universe?

10/04/04 Ravindra Jaju

HITS (contd.)

The basic steps

Au = Hv for all v pointing to u Hu = Av for all v pointed to by u

Repeat the above till convergence Nodes with high A scores are relevant

Relevant to what? Can we use this for efficient retrieval for a query?
10/04/04 Ravindra Jaju

PageRank

Similar to HITS, but all pages have only one score a Rank R(u) = c (R(v)/Nv)
v is the set of pages linking to u, and Nv is the number of links in v. c is a scaling factor (< 1)

The higher the rank of pages linking to a page, the higher is its own rank! To handle rank sinks (documents which do not link outside a set of pages), the formula is modified as
R(u) = c (R(v)/Nv) + cE(u) E(u) is a set of some pages, and acts as a rank source (what kind of pages?)
10/04/04 Ravindra Jaju

Some more topics which we havent touched Using external dictionaries

WordNet

Using language specific techniques

Computational linguistics Use grammar for judging the sense of a query in the information retrieval scenario

Other interesting techniques

Latent Semantic Indexing
Finding the latent information in documents using Linear Algebra Techniques

10/04/04

Ravindra Jaju

Some more comments

Some purists do not consider most of the current activities in the text mining field as real text mining For example, see Marti Hearsts write-up at Untangling Text Data Mining

10/04/04

Ravindra Jaju

Some more comments (contd.)

One example that he mentions

stress is associated with migraines stress can lead to loss of magnesium calcium channel blockers prevent some migraines magnesium is a natural calcium channel blocker spreading cortical depression (SCD) is implicated in some migraines high levels of magnesium inhibit SCD migraine patients have high platelet aggregability magnesium can suppress platelet aggregability

The above was inferred from a set of documents, with some human help
Ravindra Jaju

10/04/04

References

Data Mining Concepts and Techniques, by Jiawei Han and Micheline Kamber Principle of Data Mining, by David J. Hand et al Text Classification from Labeled and Unlabeled Documents using EM, Kamal Nigam et al Fast and accurate text classification via multiple linear discriminant projections, S. Chakrabarti et al Frequent Term-Based Text Clustering, Florian Beil et al

The PageRank Citation Ranking: Bringing Order to the Web, Lawrence Page and Sergey Brin
Untangling Text Data Mining, by Marti. A. Hearst, http://www.sims.berkeley.edu/~hearst/papers/acl99/acl99-tdm.html
And others

10/04/04

Ravindra Jaju

80D-7E Hyundai Shop Manual
100% (5)
80D-7E Hyundai Shop Manual
336 pages
Fartpilot BMW E46 318i M43
100% (4)
Fartpilot BMW E46 318i M43
31 pages
6 - KNN Classifier
No ratings yet
6 - KNN Classifier
10 pages
Research Paper
No ratings yet
Research Paper
7 pages
Chemray 240 User's Manual V1.1e
No ratings yet
Chemray 240 User's Manual V1.1e
67 pages
Face Recoginition Attendance System SRS
No ratings yet
Face Recoginition Attendance System SRS
13 pages
612H
100% (1)
612H
2 pages
Nigeria Visa Application
No ratings yet
Nigeria Visa Application
2 pages
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
From Everand
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
Janet Laane Effron
No ratings yet
Text Mining in R (Intro)
0% (1)
Text Mining in R (Intro)
4 pages
Module 4
No ratings yet
Module 4
63 pages
Data Mining-Multimedia Datamining
No ratings yet
Data Mining-Multimedia Datamining
8 pages
Introduction To Data Mining
No ratings yet
Introduction To Data Mining
19 pages
Introduction To Natural Language Processing (NLP)
No ratings yet
Introduction To Natural Language Processing (NLP)
87 pages
DataMining S
No ratings yet
DataMining S
103 pages
Data Warehousing & Mining: Unit - V
100% (2)
Data Warehousing & Mining: Unit - V
13 pages
Web Mining
No ratings yet
Web Mining
13 pages
Lecture 3 Data Mining
No ratings yet
Lecture 3 Data Mining
30 pages
Databases and Data Modelling
No ratings yet
Databases and Data Modelling
53 pages
5.web Data Mining
No ratings yet
5.web Data Mining
41 pages
Data Visualisation
No ratings yet
Data Visualisation
55 pages
Seminar 7 Introduction To Databases
No ratings yet
Seminar 7 Introduction To Databases
41 pages
Case - Study of Data Warehouse
No ratings yet
Case - Study of Data Warehouse
14 pages
Engineering-A Review Web Data Scrapping
No ratings yet
Engineering-A Review Web Data Scrapping
4 pages
Data Mining - IMT Nagpur-Manish
No ratings yet
Data Mining - IMT Nagpur-Manish
82 pages
A Brief Introduction To Data Mining (DM) : Bs Cs - V Iii BY Sanianayab
No ratings yet
A Brief Introduction To Data Mining (DM) : Bs Cs - V Iii BY Sanianayab
23 pages
Recommender Systems Notes
No ratings yet
Recommender Systems Notes
21 pages
Different Text Mining Techniques
No ratings yet
Different Text Mining Techniques
4 pages
Nptel Swayam DWDM Slides
No ratings yet
Nptel Swayam DWDM Slides
406 pages
CS8091 BDA Unit1
No ratings yet
CS8091 BDA Unit1
63 pages
Text Mining With R - Twitter Data Analysis
No ratings yet
Text Mining With R - Twitter Data Analysis
24 pages
Book
100% (1)
Book
388 pages
Lecture1 Big Data
No ratings yet
Lecture1 Big Data
47 pages
Deep Learning and CNNFYTGS5101-Guoyangxie
No ratings yet
Deep Learning and CNNFYTGS5101-Guoyangxie
42 pages
355955B30 Siddesh Mahind SMA Exp-5
No ratings yet
355955B30 Siddesh Mahind SMA Exp-5
11 pages
Lesson Plan: Data Warehousing and Data Mining
No ratings yet
Lesson Plan: Data Warehousing and Data Mining
1 page
Object Relational DBMSs
No ratings yet
Object Relational DBMSs
34 pages
Clouds and Big Data Computing
No ratings yet
Clouds and Big Data Computing
13 pages
Outline: Problem Statement Definitions & Examples Strategies
No ratings yet
Outline: Problem Statement Definitions & Examples Strategies
7 pages
CH 6
No ratings yet
CH 6
72 pages
Data Analytics New Quantum AKTU
No ratings yet
Data Analytics New Quantum AKTU
210 pages
Association Rules
No ratings yet
Association Rules
64 pages
Project
No ratings yet
Project
14 pages
Unit 3 Data Mining
No ratings yet
Unit 3 Data Mining
21 pages
Data Cleaning
No ratings yet
Data Cleaning
8 pages
Introduction To Data Mining
No ratings yet
Introduction To Data Mining
8 pages
Data Science Training in Hyderabad
No ratings yet
Data Science Training in Hyderabad
7 pages
Data Mining and Data Warehousing
No ratings yet
Data Mining and Data Warehousing
25 pages
01 Basics of Data Analytics and Machine Learning
No ratings yet
01 Basics of Data Analytics and Machine Learning
16 pages
Mining Frequent Itemset-Association Analysis
No ratings yet
Mining Frequent Itemset-Association Analysis
59 pages
L05 - Advance Analytical Theory and Methods - Classification
No ratings yet
L05 - Advance Analytical Theory and Methods - Classification
34 pages
Final Twitter - Sentiment - Analysis - Report
100% (1)
Final Twitter - Sentiment - Analysis - Report
14 pages
Unit-3 DMDW
No ratings yet
Unit-3 DMDW
36 pages
B.tech 15CS326E Visualization Techniques
No ratings yet
B.tech 15CS326E Visualization Techniques
5 pages
Anomaly Detection: Course: Data Mining II
No ratings yet
Anomaly Detection: Course: Data Mining II
12 pages
Nosql Database Systems: M.Tech. (Iind, Sem Ce/Cn)
100% (1)
Nosql Database Systems: M.Tech. (Iind, Sem Ce/Cn)
135 pages
Bias and Variance
No ratings yet
Bias and Variance
6 pages
ER Practical 7r
No ratings yet
ER Practical 7r
5 pages
Visualisation For Data Science Predict Overview 3267
No ratings yet
Visualisation For Data Science Predict Overview 3267
15 pages
ML AI Main Brochure
No ratings yet
ML AI Main Brochure
7 pages
Data Mining
No ratings yet
Data Mining
87 pages
Week 1
No ratings yet
Week 1
184 pages
Mc9280 Data Mining and Data Warehousing
No ratings yet
Mc9280 Data Mining and Data Warehousing
1 page
ML Module A7707 - Part1
No ratings yet
ML Module A7707 - Part1
48 pages
The Age of Big Data: Kayvan Tirdad
No ratings yet
The Age of Big Data: Kayvan Tirdad
26 pages
Optimizing Hadoop for MapReduce
From Everand
Optimizing Hadoop for MapReduce
Khaled Tannir
No ratings yet
Natural Gas To BTX
No ratings yet
Natural Gas To BTX
505 pages
QuestaSim TCL Commands Cmds
No ratings yet
QuestaSim TCL Commands Cmds
468 pages
Sap Security+ Full+Material
83% (6)
Sap Security+ Full+Material
30 pages
BA232-E-PJJ - Jadual Seminar 20204
No ratings yet
BA232-E-PJJ - Jadual Seminar 20204
4 pages
PASPort Force Sensor
No ratings yet
PASPort Force Sensor
2 pages
Rtaa SB 4 - 10011991
100% (3)
Rtaa SB 4 - 10011991
6 pages
Mohit Sharma CV 1
No ratings yet
Mohit Sharma CV 1
2 pages
AeroShell Grease 64 TDS
No ratings yet
AeroShell Grease 64 TDS
2 pages
Problem 12-4 Gas Cap Expansion
No ratings yet
Problem 12-4 Gas Cap Expansion
2 pages
Dawlance Home Appliances SBS Promotion 21 March 2024
No ratings yet
Dawlance Home Appliances SBS Promotion 21 March 2024
3 pages
2024 HR Technology Planning Imperatives
No ratings yet
2024 HR Technology Planning Imperatives
29 pages
List of All BAPIs
No ratings yet
List of All BAPIs
38 pages
02 - Samarai Island Solar-Diesel Hybrid Power Project PDF
No ratings yet
02 - Samarai Island Solar-Diesel Hybrid Power Project PDF
10 pages
Bussworks JANUS Catalog
No ratings yet
Bussworks JANUS Catalog
18 pages
Chapter 5 - Exhaust Systems
100% (1)
Chapter 5 - Exhaust Systems
26 pages
AspenTech Course Catalog PDF
100% (3)
AspenTech Course Catalog PDF
123 pages
Philippines Traffic & Vehicle Law Violations
No ratings yet
Philippines Traffic & Vehicle Law Violations
15 pages
Extending The Life of PF Capacitors
No ratings yet
Extending The Life of PF Capacitors
9 pages
Low Voltage & Control Cables: Exploration & Production - Offshore
No ratings yet
Low Voltage & Control Cables: Exploration & Production - Offshore
2 pages
JSSC 10 Jri Lee FMCW 77ghz
No ratings yet
JSSC 10 Jri Lee FMCW 77ghz
11 pages
PSU Power Input and Output
No ratings yet
PSU Power Input and Output
2 pages
Jinan Hengsheng New Building Materials Co., LTD.: Hospital Handrail
No ratings yet
Jinan Hengsheng New Building Materials Co., LTD.: Hospital Handrail
8 pages
Mapa de Memoria PLC
No ratings yet
Mapa de Memoria PLC
3 pages
1800 PFM Regulator Brochure EAM-BR8552 PDF
No ratings yet
1800 PFM Regulator Brochure EAM-BR8552 PDF
6 pages