DMDM Part 2
DMDM Part 2
Classification
• Classification Methods are supervised
• Widely Used for Prediction Purpose
• Example:
• Based on email’s content, email-providers
also use classification to decide whether the
incoming email messages are spam.
Classification
• Classification Methods are supervised
• They start with a training set of pre-labelled
observation to learn how likely the attributes
of these observations may contribute to the
classification of future unlabeled observations
Classification
• Classification Methods are supervised
• Example:
• Existing Marketing, sales, and customer
demographic data can be used to develop a
classifier to assign a “purchase” or “no-
purchase” label to potential future customers.
Classification
• A credit card company receives thousands of applications
for new cards. Each application contains information
about an applicant,
– age
– Marital status
– annual salary
– outstanding debts
– credit rating
– etc.
• Problem: to decide whether an application should
approved, or to classify applications into two categories,
approved and not approved.
5
Classification
• Fundamental Classification Methods :
• Classification Trees/Decision Trees
• Naive Bayes
Decision Tree and Classification
Task
• Such a classification is, in fact, made by posing questions starting from the
root node to each terminal node.
7
Decision Tree and Classification Task
Example : Classification
Name Body Skin Gives Birth Aquatic Aerial Has Legs Hibernates Class
Temperature Cover Creature Creature
Human Warm hair yes no no yes no Mammal
Python Cold scales no no no no yes Reptile
Salmon Cold scales no yes no no no Fish
Whale Warm hair yes yes no no no Mammal
Frog Cold none no semi no yes yes Amphibian
Komodo Cold scales no no no yes no Reptile
Bat Warm hair yes no yes yes yes Mammal
Pigeon Warm feathers no no yes yes no Bird
Cat Warm fur yes no no yes no Mammal
Leopard Cold scales yes yes no no no Fish
Turtle Cold scales no semi no yes no Reptile
Penguin Warm feathers no semi no yes no Bird
Porcupine Warm quills yes no no yes yes Mammal
Eel Cold scales no yes no no no Fish
Salamander Cold none no semi no yes yes Amphibian
8
Decision Tree and Classification Task
Example : Classification
• Suppose, a new species is discovered as follows.
Name Body Skin Gives Aquatic Aerial Has Hibernates Class
Temperature Cover Birth Creature Creature Legs
9
Decision Tree and Classification Task
• Example illustrates how we can solve a classification problem by asking a
series of question about the attributes.
– Each time we receive an answer, a follow-up question is asked until we reach a
conclusion about the class-label of the test.
• The series of questions and their answers can be organized in the form of a
decision tree
– As a hierarchical structure consisting of nodes and edges
10
Decision Tree
• Also know as Prediction Trees
• Input variable
• Output Variable
• Nodes- (Test Points)
• Leaf Nodes
Decision Tree
• Decision Trees Varieties:
• Classification Trees: Usually apply to output
variables that are categorical- often binary in
nature- yes/no, purchase or not purchase, etc
• Regression Trees: apply to output variables
that are numeric or continuous . Ex. Predicted
price of a consumer goods.
Decision Tree: Example
Day Outlook Temperature Humidity Wind Play Tennis
1 Sunny Hot High Weak No
2 Sunny Hot High Strong No
3 Overcast Hot High Weak Yes
4 Rain Mild High Weak Yes
5 Rain Cool Normal Weak Yes
6 Rain Cool Normal Strong No
7 Overcast Cool Normal Strong Yes
8 Sunny Mild High Weak No
9 Sunny Cool Normal Weak Yes
10 Rain Mild Normal Weak Yes
11 Sunny Mild Normal Strong Yes
12 Overcast Mild High Strong Yes
13 Overcast Hot Normal Weak Yes
14 Rain Mild High Strong No
Outlook
5/14=0.36
9/14=0.64
Entropy
• Entropy of two attribute:
Information Gain
• Finding a most homogeneous branch
• =0.940-0.693 = 0.247
Information Gain
Continuing to split
Decision Tree
Decision Tree to Decision Rules
• A decision tree can easily be transformed to a
set of rules.
Overfitting
• In decision trees, over-fitting occurs when
the tree is designed so perfectly fit to all
samples in the training data set.
• Thus it ends up with branches with strict
rules of data.
• Thus this affects the accuracy when
predicting samples that are not part of the
training set.
• Overfitting is excessively dependent on
irrelevant features of the training data
Overfitting
Pruning
• Goal: Prevent overfitting to noise in the
data
28
Unpruned Decision Tree
Outlook
Humidity
High Normal
Windy Windy
Hot Mild Cool Hot Mild Cool Hot Mild cool Hot Mild Cool
Measures Actual NO FP -2 TN -3
• True Positive Rate: When it's actually yes, how often does it predict yes?
TP/actual yes = 4/9 = 0.44 also known as "Sensitivity" or "Recall“
• True Negative Rate: When it's actually no, how often does it predict no?
TN/actual no = 3/5 = 0.6 equivalent to 1 minus False Positive Rate also
known as "Specificity“
Step 1:
Create Multiple D1 D2 .... Dt-1 Dt
Data Sets
Step 2:
Build Multiple C1 C2 Ct Ct
-1
Classifiers
Step 3:
Combine C*
Classifiers
Conditions for Ensemble Methods
Original Data 1 2 3 4 5 6 7 8 9 10
Bagging (Round 1) 7 8 10 8 2 5 10 10 5 9
Bagging (Round 2) 1 4 9 1 2 3 2 7 3 2
Bagging (Round 3) 1 8 5 10 5 5 9 6 3 7
56
Random forest Applications
• Banking Sector: The banking sector consists of most users. There are many loyal customers and also fraud customers. To
determine whether the customer is a loyal or fraud, Random forest analysis comes in. With the help of a random forest
algorithm in machine learning, we can easily determine whether the customer is fraud or loyal. A system uses a set of a
random algorithm which identifies the fraud transactions by a series of the pattern.
• Medicines: Medicines needs a complex combination of specific chemicals. Thus, to identify the great combination in the
medicines, Random forest can be used. With the help of machine learning algorithm, it has become easier to detect and
predict the drug sensitivity of a medicine. Also, it helps to identify the patient’s disease by analyzing the patient’s medical
record.
• Stock Market: Machine learning also plays role in the stock market analysis. When you want to know the behavior of the
stock market, with the help of Random forest algorithm, the behavior of the stock market can be analyzed. Also, it can show
the expected loss or profit which can be produced while purchasing a particular stock.
• E-Commerce: When you will find it difficult to recommend or suggest what type of products your customer should see. This
is where you can use a random forest algorithm. Using a machine learning system, you can suggest the products which will
be more likely for a customer. Using a certain pattern and following the product’s interest of a customer, you can suggest
similar products to your customers.
57
Boosting
• Boosting is an ensemble technique that
attempts to create a strong classifier from a
number of weak classifiers.
• The idea of boosting is to train weak
learners sequentially, each trying to correct
its predecessor.
Boosting
• The basic idea of boosting is to generate a series of base learners
which complement each other
– For this, we will force each learner to focus on the mistakes of the
previous learner
66
67
Linear Regression
• Linear regression is perhaps one of the most well
known and well understood algorithms in statistics and
machine learning.
• linear regression was developed in the field of statistics
and is studied as a model for understanding the
relationship between input and output numerical
variables.
• Linear regression is a linear model, e.g. a model that
assumes a linear relationship between the input
variables (x) and the single output variable (y). More
specifically, that y can be calculated from a linear
combination of the input variables (x).
68
Linear Regression
• When there is a single input variable (x), the method is referred to
as simple linear regression.
• When there are multiple input variables, refers to the method as multiple
linear regression.
• The linear equation assigns one scale factor to each input value or column,
called a coefficient and represented by the capital Greek letter Beta (β).
One additional coefficient is also added, giving the line an additional
degree of freedom (e.g. moving up and down on a two-dimensional plot)
and is often called the intercept or the bias coefficient.
• For example, in a simple regression problem (a single x and a single y), the
form of the model would be:
• y = β0 + β 1*x
• In higher dimensions when we have more than one input (x), the line is
called a plane or a hyper-plane. The representation therefore is the form
of the equation and the specific values used for the coefficients (e.g. β0
and β1 in the above example)
Making Predictions with Linear Regression
WWW
Knowledge
80
Data Mining and Web Mining
81
WWW Specifics
• Web: A huge, widely-distributed, highly
heterogeneous, semi-structured,
hypertext/hypermedia, interconnected
information repository
• Web is a huge collection of documents plus
– Hyper-link information
– Access and usage information
82
Web Mining taxonomy
84
Types of text mining
• Keyword (or term) based association analysis
• -It collects sets of keywords or terms that often happen together and
afterward discover the association relationship among them. First, it
preprocesses the text data by parsing, stemming, removing stop words, etc.
Once it pre-processed the data, then it induces association mining algorithms.
• Automatic document (topic) classification
• -This analysis is used for the automatic classification of the huge number of
online text documents like web pages, emails, etc.
• Similarity Detection
– cluster documents by a common author
– cluster documents containing information from a common source
• Sequence analysis: predicting a recurring event, discovering trends
• Anomaly detection: find information that violates usual patterns
85
Types of text mining (cont.)
• discovery of frequent phrases
• text segmentation (into logical chunks)
• event detection and tracking
86
Text Classification: An Example
Ex#
Hooligan
cheering …
An average USA
5 No
salesman earns 75K
The game in London
6 Yes
was horrific Test
Manchester city is likely Set
7 Yes
to win the championship
Rome is taking the lead
8 Yes
10
in the football league
Training
Learn
Model
Set Classifier
87
What is Clustering ?
Given: Documents
source
A source of textual
documents
Similarity Clustering
Similarity measure measure System
• e.g., how many
words are common
• Find:
in these documents
Do
• Several clusters of Do c Do
Do c
documents that are relevant Do
DDo c
c Do c
to each other Do oc Do
c
c c c
88
Information Retrieval
Given: Documents
A source of textual source
documents
A user query (text based)
Query IR
System
Find:
A set (ranked) of Document
Document
documents that are Ranked Document
89
Intelligent Information Retrieval
meaning of words
Synonyms “buy” / “purchase”
Ambiguity “bat” (baseball vs. mammal)
order of words in the query
hot dog stand in the amusement park
hot amusement stand in the dog park
user dependency for the data
direct feedback
indirect feedback
authority of the source
IBM is more likely to be an authorized source then any
other company.
90
Intelligent Web Search
Combine the intelligent IR tools
meaning of words
order of words in the query
user dependency for the data
authority of the source
With the unique web features
retrieve Hyper-link information
utilize Hyper-link as input
91
What is Information Extraction?
Given:
A source of textual documents
A well defined limited query (text based)
Find:
Sentences with relevant information
Extract the relevant information and
ignore non-relevant information (important!)
Link related information and output in a
predetermined format
92
Querying Extracted Information
Documents
source
Query 1
(E.g. job title) Extraction
Query 2 System
(E.g. salary)
Combine
Query Results
Relevant Info
1
Ranked Relevant Info 2
Documents
Relevant Info 3
93
• Thank You.