0% found this document useful (0 votes)
14 views94 pages

DMDM Part 2

Uploaded by

mansha bhatia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views94 pages

DMDM Part 2

Uploaded by

mansha bhatia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 94

How To Create A Perfect Decision

Classification
• Classification Methods are supervised
• Widely Used for Prediction Purpose
• Example:
• Based on email’s content, email-providers
also use classification to decide whether the
incoming email messages are spam.
Classification
• Classification Methods are supervised
• They start with a training set of pre-labelled
observation to learn how likely the attributes
of these observations may contribute to the
classification of future unlabeled observations
Classification
• Classification Methods are supervised
• Example:
• Existing Marketing, sales, and customer
demographic data can be used to develop a
classifier to assign a “purchase” or “no-
purchase” label to potential future customers.
Classification
• A credit card company receives thousands of applications
for new cards. Each application contains information
about an applicant,
– age
– Marital status
– annual salary
– outstanding debts
– credit rating
– etc.
• Problem: to decide whether an application should
approved, or to classify applications into two categories,
approved and not approved.

5
Classification
• Fundamental Classification Methods :
• Classification Trees/Decision Trees
• Naive Bayes
Decision Tree and Classification
Task

• Decision tree helps us to classify data.


– Internal nodes are some attribute

– Edges are the values of attributes

– External nodes are the outcome of classification

• Such a classification is, in fact, made by posing questions starting from the
root node to each terminal node.

7
Decision Tree and Classification Task

Example : Classification
Name Body Skin Gives Birth Aquatic Aerial Has Legs Hibernates Class
Temperature Cover Creature Creature
Human Warm hair yes no no yes no Mammal
Python Cold scales no no no no yes Reptile
Salmon Cold scales no yes no no no Fish
Whale Warm hair yes yes no no no Mammal
Frog Cold none no semi no yes yes Amphibian
Komodo Cold scales no no no yes no Reptile
Bat Warm hair yes no yes yes yes Mammal
Pigeon Warm feathers no no yes yes no Bird
Cat Warm fur yes no no yes no Mammal
Leopard Cold scales yes yes no no no Fish
Turtle Cold scales no semi no yes no Reptile
Penguin Warm feathers no semi no yes no Bird
Porcupine Warm quills yes no no yes yes Mammal
Eel Cold scales no yes no no no Fish
Salamander Cold none no semi no yes yes Amphibian

What are the class label of Dragon and Shark?

8
Decision Tree and Classification Task
Example : Classification
• Suppose, a new species is discovered as follows.
Name Body Skin Gives Aquatic Aerial Has Hibernates Class
Temperature Cover Birth Creature Creature Legs

Gila Monster cold scale no no no yes yes


?
• Decision Tree that can be inducted based on the data (in Example) is as
follows.

9
Decision Tree and Classification Task
• Example illustrates how we can solve a classification problem by asking a
series of question about the attributes.
– Each time we receive an answer, a follow-up question is asked until we reach a
conclusion about the class-label of the test.

• The series of questions and their answers can be organized in the form of a
decision tree
– As a hierarchical structure consisting of nodes and edges

• Once a decision tree is built, it is applied to any test to classify it.

10
Decision Tree
• Also know as Prediction Trees
• Input variable
• Output Variable
• Nodes- (Test Points)
• Leaf Nodes
Decision Tree
• Decision Trees Varieties:
• Classification Trees: Usually apply to output
variables that are categorical- often binary in
nature- yes/no, purchase or not purchase, etc
• Regression Trees: apply to output variables
that are numeric or continuous . Ex. Predicted
price of a consumer goods.
Decision Tree: Example
Day Outlook Temperature Humidity Wind Play Tennis
1 Sunny Hot High Weak No
2 Sunny Hot High Strong No
3 Overcast Hot High Weak Yes
4 Rain Mild High Weak Yes
5 Rain Cool Normal Weak Yes
6 Rain Cool Normal Strong No
7 Overcast Cool Normal Strong Yes
8 Sunny Mild High Weak No
9 Sunny Cool Normal Weak Yes
10 Rain Mild Normal Weak Yes
11 Sunny Mild Normal Strong Yes
12 Overcast Mild High Strong Yes
13 Overcast Hot Normal Weak Yes
14 Rain Mild High Strong No

Outlook

Sunny Overcast Rain

Humidity Yes Wind

High Normal Strong Weak


No Yes No Yes
Information Gain
Information Gain
Impurity
Decision Tree
• General Algorithm
• The objective of the decision tree is to
construct a tree T from training set S
• Entropy: measures the impurity of an attribute
• Information Gain: measures the purity of an
attribute
Entropy
Entropy of class attribute:

5/14=0.36
9/14=0.64
Entropy
• Entropy of two attribute:
Information Gain
• Finding a most homogeneous branch

• Info Gain( PlayGolf, Outlook) =E(PlayGolf) –E(PlayGolf, Outlook)

• =0.940-0.693 = 0.247
Information Gain
Continuing to split
Decision Tree
Decision Tree to Decision Rules
• A decision tree can easily be transformed to a
set of rules.
Overfitting
• In decision trees, over-fitting occurs when
the tree is designed so perfectly fit to all
samples in the training data set.
• Thus it ends up with branches with strict
rules of data.
• Thus this affects the accuracy when
predicting samples that are not part of the
training set.
• Overfitting is excessively dependent on
irrelevant features of the training data
Overfitting
Pruning
• Goal: Prevent overfitting to noise in the
data

28
Unpruned Decision Tree

Outlook

Sunny Overcast Rainy

Humidity

High Normal

Windy Windy

True False True False

Temp Temp Temp Temp

Hot Mild Cool Hot Mild Cool Hot Mild cool Hot Mild Cool

No Yes No Yes Yes


Pruning
• Pruning is a process of removing the parts of
the tree which adds very little to the
classification power of the tree.
• Pruning is done with two things in mind :
• Reducing the complexity
• Reducing the chances of overfitting
• Pruning should reduce the size of a learning
tree without reducing predictive accuracy
Examples of Applications for Decision Trees

• Categorizing a customer bank loan application on


such factors as income level, years on present
job, timeliness of credit card payments and
existence of a criminal record.
• Prioritizing patients for emergency room
treatment based on age, gender, blood pressure,
temperature, heart rate, severity of pain and
other vital measurements.
• Using demographic data to determine the effect
of a limited advertising budget on the number of
likely buyers of a certain product.
Decision Tree
• Advantages & Shortcomings
• They are useful for variable selection, with the
most important predictors usually showing up
at the top of the tree.
• Decision trees are easy to use and explain with
simple math, no complex formulas. They
present visually all of the decision alternatives
for quick comparisons in a format that is easy
to understand with only brief explanations.
A Practice Example
age income studentcredit_rating
buys_computer
<=30 high no fair no
Example
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
33
Exercise
• Preform Naïve Bayes classification in Orange with
dataset
• datasets\weather.csv
• Create DT and apply it for predicting unclassified
data
• datasets\weather-test.csv
• Apply DT for Loan Prediction
• datasets\loan-prediction-problem-
dataset\train_u6lujuX_CVtuZ9i.csv
Confusion Matrix

• A confusion matrix is a table that is used to


describe the performance of a classification
model (or “classifier”) on a set of test data for
which the true values are known.
• It allows the visualization of the performance
of an algorithm.
• Most performance measures are computed
from the confusion matrix.
Confusion Matrix

• The number of correct and incorrect


predictions are summarized with count values
and broken down by each class. This is the key
to the confusion matrix.
Confusion Matrix
n=14 Predicted YesPredicted NO
Actual Yes 9 0
Actual NO 0 5

n=14 Predicted Yes Predicted NO


Actual Yes TP FN
Actual NO FP TN
Evaluation n=14
Actual Yes
Predicted Yes
TP - 4
Predicted NO
FN - 5

Measures Actual NO FP -2 TN -3

• Accuracy: Overall, how often is the classifier correct?


(TP+TN)/total = (7)/14 = 0.5

• Misclassification Rate: Overall, how often is it wrong? (FP+FN)/total =


(2+5)/14 = 0.5 equivalent to 1 minus Accuracy also known as "Error Rate“

• True Positive Rate: When it's actually yes, how often does it predict yes?
TP/actual yes = 4/9 = 0.44 also known as "Sensitivity" or "Recall“

• True Negative Rate: When it's actually no, how often does it predict no?
TN/actual no = 3/5 = 0.6 equivalent to 1 minus False Positive Rate also
known as "Specificity“

• Precision: When it predicts yes, how often is it correct?


TP/total predicted yes = 4/6 = 0.66
Ensemble Learning
• Ensemble learning is the process by which multiple models,
such as classifiers or experts, are generated and combined to
solve a particular computational intelligence problem.
• used to improve the performance of a model, or reduce the
unfortunate selection of a poor or weak model.
• instead of choosing just one - combining the outputs of multiple
models - for example, simply averaging them - can reduce the
risk of an unfortunate selection of a particularly poorly
performing classifier.
• Useful when training data is too large.
• Daily lives example-medical procedure, reviews for product, etc.
Ensemble Learning
• No Free Lunch Theorem: There is no algorithm
that is always the most accurate
• Generate a group of base-learners which
when combined has higher accuracy
Original
D Training data

Step 1:
Create Multiple D1 D2 .... Dt-1 Dt
Data Sets

Step 2:
Build Multiple C1 C2 Ct Ct
-1
Classifiers

Step 3:
Combine C*
Classifiers
Conditions for Ensemble Methods

• Training the same classifier on the same training data


several times would give the same result for most
machine learning algorithms

– Exception: methods where the training involves


some randomness

• Combining these classifiers would make no sense


Diversity
• For effectiveness individual experts must
exhibit some level of diversity among
themselves,
• Diversity can be achieved in several ways

How to produce diverse classifiers?

• We can combine different learning algorithms


(“hybridization”)
– E.g. we can train a GMM, an SVM, a k-NN,… over the
same data, and then combine their output
• We can combine the same learning algorithm trained over
different subsets of the training data
– We can also try using different subsets of the features
• For certain algorithms we can use the same algorithm over the
same data, but with a different weighting over the data
instances
Ensemble Learning
Aggregation methods

• There are several methods to combine (aggregate) the outputs


of the various classifiers
• When the output is a class label:
– Majority voting
– Weighted majority voting (e.g we can weight
each classifier by its reliability (which also has
to be estimated somehow, of course…)

• When the output is numeric (e.g. a probability


estimate for each class ci):
– We can combine the dj scores by
taking their (weighted) mean, product,
minimum, maximum, …
Bagging: Bootstrap Aggregating

• Bagging, is one of the earliest, most intuitive and perhaps the


simplest ensemble based algorithms, with a good performance.
• Diversity of classifiers in bagging is obtained by using
bootstrapped replicas of the training data.
• different training data subsets are randomly drawn – with
replacement – from the entire training dataset.
• Each training data subset is used to train a different classifier of
the same type.
• Individual classifiers are then combined by taking a simple
majority vote of their decisions.
• For any given instance, the class chosen by most number of
classifiers is the ensemble decision.
Bagging: Bootstrap Aggregating
• How bagging works as a method of increasing accuracy.
• Suppose that you are a patient and would like to have a
diagnosis made based on your symptoms.
• Instead of asking one doctor, you may choose to ask
several.
• If a certain diagnosis occurs more than one, you may
choose this as the final or best diagnosis.
• That is, the final diagnosis is made based on a majority
vote, Now replace each doctor by a classifier, and you have
the basic idea behind bagging.
• Intuitively, a majority vote made by a large group of doctors
may be more reliable than a majority vote made by a small
group.
Bagging
Bagging

• Sampling with replacement

Original Data 1 2 3 4 5 6 7 8 9 10
Bagging (Round 1) 7 8 10 8 2 5 10 10 5 9
Bagging (Round 2) 1 4 9 1 2 3 2 7 3 2
Bagging (Round 3) 1 8 5 10 5 5 9 6 3 7

• Build classifier on each bootstrap sample


Random Forest

• Random Forest is an extension of bagging


for decision trees that can be used for
classification or regression.

• Random Forests are an ensemble


of k Decision Trees.
Random Forest
• At the current node, randomly select p features from available
features D. The number of features p is usually much smaller
than the total number of features D.
• split the current node into and reduce the number of
features D from this node on.
• Repeat steps 1 to 2 until either a maximum tree depth l has been
reached or the splitting metric reaches some extrema.
• Repeat steps 1 to 3 for each tree k in the forest.
• Vote or aggregate on the output of each tree in the forest.
Contd..
• Metro city has a very high segregation in annual income of rich and poor.
Our task is to come up with an accurate predictive algorithm to estimate
annual income bracket of individual in Mexico. The brackets of income are
as follows :
• 1. Below $40,000
• 2. $40,000 – 150,000
• 3. More than $150,000
• Following are the information available for each individual :
• 1. Age , 2. Gender, 3. Highest educational qualification, 4. Working in
Industry, 5. Residence in Metro/Non-metro
• We need to come up with an algorithm to give an accurate prediction for
an individual who has following traits:
• 1. Age : 35 years , 2, Gender : Male , 3. Highest Educational Qualification :
Diploma holder, 4. Industry : Manufacturing, 5. Residence : Metro
Contd..
• Following are the outputs of the 5 different model.
Contd..
• Using these 5 models, we need to come up with single set of
probability to belong to each of the salary classes.
• For simplicity, we will just take a mean of probabilities in this
case study. Other than simple mean, we also consider vote
method to come up with the final prediction.
• To come up with the final prediction let’s locate the following
profile in each model :
• 1. Age : 35 years , 2, Gender : Male , 3. Highest Educational
Qualification : Diploma holder, 4. Industry : Manufacturing, 5.
Residence : Metro
Contd..
1. Age : 35 years , 2, Gender : Male , 3. Highest Educational
Qualification : Diploma holder, 4. Industry : Manufacturing,
5. Residence : Metro
Contd..

56
Random forest Applications

• Banking Sector: The banking sector consists of most users. There are many loyal customers and also fraud customers. To
determine whether the customer is a loyal or fraud, Random forest analysis comes in. With the help of a random forest
algorithm in machine learning, we can easily determine whether the customer is fraud or loyal. A system uses a set of a
random algorithm which identifies the fraud transactions by a series of the pattern.

• Medicines: Medicines needs a complex combination of specific chemicals. Thus, to identify the great combination in the
medicines, Random forest can be used. With the help of machine learning algorithm, it has become easier to detect and
predict the drug sensitivity of a medicine. Also, it helps to identify the patient’s disease by analyzing the patient’s medical
record.

• Stock Market: Machine learning also plays role in the stock market analysis. When you want to know the behavior of the
stock market, with the help of Random forest algorithm, the behavior of the stock market can be analyzed. Also, it can show
the expected loss or profit which can be produced while purchasing a particular stock.

• E-Commerce: When you will find it difficult to recommend or suggest what type of products your customer should see. This
is where you can use a random forest algorithm. Using a machine learning system, you can suggest the products which will
be more likely for a customer. Using a certain pattern and following the product’s interest of a customer, you can suggest
similar products to your customers.

57
Boosting
• Boosting is an ensemble technique that
attempts to create a strong classifier from a
number of weak classifiers.
• The idea of boosting is to train weak
learners sequentially, each trying to correct
its predecessor.
Boosting
• The basic idea of boosting is to generate a series of base learners
which complement each other
– For this, we will force each learner to focus on the mistakes of the
previous learner

• In boosting a model is build from the training data, then creating a


second model that attempts to correct the errors from the first model.
Models are added until the training set is predicted perfectly or a
maximum number of models are added.
• AdaBoost was the first successful boosting algorithm developed for
binary classification.
Boosting
• An iterative procedure to adaptively change
distribution of training data by focusing more
on previously misclassified records
– Initially, all N records are assigned equal weights
– Unlike bagging, weights may change at the end of
boosting round
Boosting 2
• We represent the importance of each sample by assigning weights to
the samples
• Correct classification → smaller weights
• Misclassified samples → larger weights
• The weights can influence the algorithm in two ways
– Boosting by sampling: the weights influence the resampling process
• This is a more general solution
– Boosting by weighting: the weights influence the learner
• Boosting also makes the aggregation process more clever: We will
aggregate the base learners using weighted voting
– Better weak classifier gets a larger weight
– We iteratively add new base learners, and iteratively increase the accuracy of the
combined model
Boosting
• Records that are wrongly classified will have
their weights increased
• Records that are classified correctly will have
their
Original Data
weights
1
decreased
2 3 4 5 6 7 8 9 10
Boosting (Round 1) 7 3 2 8 7 9 4 10 6 3
Boosting (Round 2) 5 4 9 4 2 5 1 7 4 2
Boosting (Round 3) 4 4 8 10 4 5 4 6 3 4

• Example 4 is hard to classify


• Its weight is increased, therefore it is more
likely to be chosen again in subsequent rounds
Stacking
• Stacking or Stacked Generalization is an ensemble
machine learning algorithm.
• It uses a meta-learning algorithm to learn how to
best combine the predictions from two or more
base machine learning algorithms.
• The benefit of stacking is that it can harness the
capabilities of a range of well-performing models
on a classification or regression task and make
predictions that have better performance than
any single model in the ensemble.
Eick; Ensemble Learning 63
stacking
• Unlike bagging, in stacking, the models are
typically different (e.g. not all decision trees)
and fit on the same dataset (e.g. instead of
samples of the training dataset).
• Unlike boosting, in stacking, a single model is
used to learn how to best combine the
predictions from the contributing models (e.g.
instead of a sequence of models that correct
the predictions of prior models).

Eick; Ensemble Learning 64


Stacking
• Let m models are trained on a dataset of n samples
• Stacking (or stacked generalization) builds the
models in the ensemble using different learning
algorithms (e.g. one neural network, one decision
tree, ...), as opposed to bagging or boosting that
train various incarnation of the same learner (e.g. all
decision trees).
• The outputs of the models are combined to compute
the ultimate prediction of any instance
Stacking
• Question: how to build a heterogeneous ensemble
consisting of different types of models (e.g., decision tree
and neural network)
• Problem: models can be vastly different in accuracy
• Idea: to combine predictions of base learners, do not just
vote, instead, use meta learner
• In stacking, the base learners are also called level-0 models
• Meta learner is called level-1 model
• Predictions of base learners are input to meta learner
• Base learners are usually different learning schemes

66
67
Linear Regression
• Linear regression is perhaps one of the most well
known and well understood algorithms in statistics and
machine learning.
• linear regression was developed in the field of statistics
and is studied as a model for understanding the
relationship between input and output numerical
variables.
• Linear regression is a linear model, e.g. a model that
assumes a linear relationship between the input
variables (x) and the single output variable (y). More
specifically, that y can be calculated from a linear
combination of the input variables (x).
68
Linear Regression
• When there is a single input variable (x), the method is referred to
as simple linear regression.
• When there are multiple input variables, refers to the method as multiple
linear regression.
• The linear equation assigns one scale factor to each input value or column,
called a coefficient and represented by the capital Greek letter Beta (β).
One additional coefficient is also added, giving the line an additional
degree of freedom (e.g. moving up and down on a two-dimensional plot)
and is often called the intercept or the bias coefficient.
• For example, in a simple regression problem (a single x and a single y), the
form of the model would be:
• y = β0 + β 1*x
• In higher dimensions when we have more than one input (x), the line is
called a plane or a hyper-plane. The representation therefore is the form
of the equation and the specific values used for the coefficients (e.g. β0
and β1 in the above example)
Making Predictions with Linear Regression

• Imagine we are predicting weight (y) from height (x). Our


linear regression model representation for this problem
would be:
• y = B0 + B1 * x1 or weight =B0 +B1 * height
• Where B0 is the bias coefficient and B1 is the coefficient for
the height column. We use a learning technique to find a
good set of coefficient values. Once found, we can plug in
different height values to predict the weight.
• For example, lets use B0 = 0.1 and B1 = 0.5. Let’s plug them
in and calculate the weight (in kilograms) for a person with
the height of 182 centimeters.
• weight = 0.1 + 0.5 * 182
• weight = 91.1
Making Predictions with Linear Regression

• Regression models are used to describe relationships between


variables by fitting a line to the observed data. Regression allows
you to estimate how a dependent variable changes as the
independent variable(s) change.
• Multiple linear regression is used to estimate the relationship
between two or more independent variables and one dependent
variable. You can use multiple linear regression when you want to
know:
• How strong the relationship is between two or more independent
variables and one dependent variable (e.g. how rainfall,
temperature, and amount of fertilizer added affect crop growth).
• The value of the dependent variable at a certain value of the
independent variables (e.g. the expected yield of a crop at certain
levels of rainfall, temperature, and fertilizer addition).
Logistic Regression
• Logistic regression is a supervised learning classification
algorithm used to assign observations into a discrete set of
classes.
• The nature of target or dependent variable is dichotomous,
which means there would be only two possible classes.
• In simple words, the dependent variable is binary in nature
having data coded as either 1 (stands for success/yes) or 0
(stands for failure/no).
• Some of the examples of classification problems are Email
spam or not spam, Online transactions Fraud or not Fraud,
Tumor Malignant or Benign etc.
• Logistic regression transforms its output using the logistic
sigmoid function to return a probability value.
Logistic Regression
• Logistic Regression is a Machine Learning algorithm which is used
for the classification problems, it is a predictive analysis algorithm
and based on the concept of probability.

• Logistic Regression uses a more complex cost function, this cost


function can be defined as the ‘Sigmoid function’ or also known as
the ‘logistic function’ instead of a linear function.

• The hypothesis of logistic regression tends it to limit the cost


function between 0 and 1. Therefore linear functions fail to
represent it as it can have a value greater than 1 or less than 0
which is not possible as per the hypothesis of logistic regression.
What is the Sigmoid Function?
• In order to map predicted values to probabilities, we use the
Sigmoid function. The function maps any real value into
another value between 0 and 1. In machine learning, we use
sigmoid to map predictions to probabilities.
• For logistic regression we are going to modify the equation a
little bit i.e.
• σ(Z) = σ(β₀ + β₁X)
Regression Models
• Binary Logistic Regression Model − The simplest form
of logistic regression is binary or binomial logistic
regression in which the target or dependent variable
can have only 2 possible types either 1 or 0.

• Multinomial Logistic Regression Model − Another


useful form of logistic regression is multinomial logistic
regression in which the target or dependent variable
can have 3 or more possible unordered types i.e. the
types having no quantitative significance. For example,
these variables may represent “Type A” or “Type B” or
“Type C”.
Decision Boundary
• We expect our classifier to give us a set of outputs or classes based on
probability when we pass the inputs through a prediction function and
returns a probability score between 0 and 1.
• As shown in the graph the threshold as 0.5, if the prediction function
returned a value of 0.7 then it would classify this observation as Class 1. If
the prediction returned a value of 0.2 then it would classify the
observation as Class 2
Sample Size
• Very small samples have so much sampling
errors.
• Very large sample size decreases the chances
of errors.
• Logistic requires larger sample size than
multiple regression.
• Hosmer and Lamshow recommended sample
size greater than 400
Exercise
Exercise of predicting Heart disease
1. Data contains;
->age - age in years
->sex - (1 = male; 0 = female)
->cp - chest pain type
->trestbps - resting blood pressure (in mm Hg on admission to the hospital)
->chol - serum cholestoral in mg/dl
->fbs - (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
->restecg - resting electrocardiographic results
->thalach - maximum heart rate achieved
->exang - exercise induced angina (1 = yes; 0 = no)
->oldpeak - ST depression induced by exercise relative to rest
->slope - the slope of the peak exercise ST segment
->ca - number of major vessels (0-3) colored by flourosopy
->thal - 3 = normal; 6 = fixed defect; 7 = reversable defect
->target - have disease or not (1=yes, 0=no)
Web Mining
Discovering Knowledge from and
about WWW - is one of the basic
abilities of an intelligent agent

WWW

Knowledge

80
Data Mining and Web Mining

Data mining: turn data into knowledge.


Web mining is to apply data mining
techniques to extract and uncover
knowledge from web documents and
services.

81
WWW Specifics
• Web: A huge, widely-distributed, highly
heterogeneous, semi-structured,
hypertext/hypermedia, interconnected
information repository
• Web is a huge collection of documents plus
– Hyper-link information
– Access and usage information

82
Web Mining taxonomy

Web Content Mining


Web Page Content Mining
Web Structure Mining
Search Result Mining
Capturing Web’s structure using link
interconnections
Web Usage Mining
General Access Pattern Mining
Customized Usage Tracking
83
Web Content Mining
• Text mining
• Data mining in text: find something useful and surprising from a text
collection;
• text mining vs. information retrieval;

84
Types of text mining
• Keyword (or term) based association analysis
• -It collects sets of keywords or terms that often happen together and
afterward discover the association relationship among them. First, it
preprocesses the text data by parsing, stemming, removing stop words, etc.
Once it pre-processed the data, then it induces association mining algorithms.
• Automatic document (topic) classification
• -This analysis is used for the automatic classification of the huge number of
online text documents like web pages, emails, etc.
• Similarity Detection
– cluster documents by a common author
– cluster documents containing information from a common source
• Sequence analysis: predicting a recurring event, discovering trends
• Anomaly detection: find information that violates usual patterns
85
Types of text mining (cont.)
• discovery of frequent phrases
• text segmentation (into logical chunks)
• event detection and tracking

86
Text Classification: An Example

Ex#
Hooligan

An English football fan


1 Yes
… Hooligan
During a game in Italy
2 Yes

England has been A Danish football fan ?
3 Yes
beating France … Turkey is playing vs. France.
?
Italian football fans were The Turkish fans …
4 No 10

cheering …
An average USA
5 No
salesman earns 75K
The game in London
6 Yes
was horrific Test
Manchester city is likely Set
7 Yes
to win the championship
Rome is taking the lead
8 Yes
10
in the football league

Training
Learn
Model
Set Classifier

87
What is Clustering ?

Given: Documents
source
A source of textual
documents
Similarity Clustering
Similarity measure measure System
• e.g., how many
words are common
• Find:
in these documents
Do
• Several clusters of Do c Do
Do c
documents that are relevant Do
DDo c
c Do c
to each other Do oc Do
c

c c c

88
Information Retrieval

Given: Documents
A source of textual source

documents
A user query (text based)
Query IR
System

Find:
A set (ranked) of Document
Document
documents that are Ranked Document

relevant to the query Documents

89
Intelligent Information Retrieval
meaning of words
Synonyms “buy” / “purchase”
Ambiguity “bat” (baseball vs. mammal)
order of words in the query
hot dog stand in the amusement park
hot amusement stand in the dog park
user dependency for the data
direct feedback
indirect feedback
authority of the source
IBM is more likely to be an authorized source then any
other company.

90
Intelligent Web Search
Combine the intelligent IR tools
meaning of words
order of words in the query
user dependency for the data
authority of the source
With the unique web features
retrieve Hyper-link information
utilize Hyper-link as input

91
What is Information Extraction?

Given:
A source of textual documents
A well defined limited query (text based)
Find:
Sentences with relevant information
Extract the relevant information and
ignore non-relevant information (important!)
Link related information and output in a
predetermined format

92
Querying Extracted Information

Documents
source

Query 1
(E.g. job title) Extraction
Query 2 System
(E.g. salary)

Combine
Query Results

Relevant Info
1
Ranked Relevant Info 2
Documents
Relevant Info 3

93
• Thank You.

You might also like