Lecture 3 Ver2
Lecture 3 Ver2
Lecture 3
• As we discussed earlier, there are two main types of PR tasks according to the type and
method of learning:
1. Supervised Learning
2. Unsupervised Learning
• Adding, there are other types such as:
3. Semi-supervised Learning: A combination of a small amount of labeled data and a large amount
of unlabeled data. (Example: Training a system on a few labeled medical images and a large
dataset of unlabeled ones to improve diagnostic accuracy.)
4. Reinforcement Learning: Feedback from actions (rewards or penalties) but not specific labeled
examples. (Example: a robot learns to navigate a maze by receiving rewards for reaching the goal
and penalties for hitting walls, improving its strategy over time based on this feedback.)
Pattern Recognition Systems can also be divided into several types, depending on
the knowledge needed to be extracted from the available data.
3
1. Classification
4
2. Clustering
5
3. Regression
6
4. Dimensionality Reduction
8
6. Association Rule Learning
9
Other Types (Assignment)
TASK: Write the knowledge needed, purpose, some algorithms with theory of operation
in one sentence and give an example for each of the previous PR tasks.
10
Data Collection and Preprocessing
In a Pattern Recognition (PR) system, data collection and preprocessing are the
foundation steps that directly impact the success of pattern recognition tasks.
Next is a brief explanation of each step along with examples of algorithms used.
Definition: Data collection involves gathering the raw data required for training
and testing the pattern recognition model. The quality and quantity of this data
are critical for the effectiveness of the system.
DATA COLLECTION
Sensor Data: Collected from physical sensors (e.g., camera images, audio recordings, biometric
sensors).
Types of Data: Historical Data: Data from previous instances stored in databases, often used for supervised learning
tasks (e.g., medical records, transaction logs).
Generated Data: Synthetic data created for specific needs (e.g., simulated datasets for rare events).
• Classification Accuracy
• Total cost/benefit – when different errors involve different costs
• Lift and ROC curves
• Error in numeric predictions
18
Classifier error rate
19
Evaluation on “LARGE” data
• If many (thousands) of examples are available, including several hundred
examples from each class, then a simple evaluation is sufficient
• Randomly split data into training and test sets (usually 2/3 for train, 1/3
for test)
• Build a classifier using the train set and evaluate it using the test set.
20
Classification Step 1:
Split data into train and test sets
THE PAST
Results Known
+
+ Training set
-
-
+
Data
Testing set
21
Classification Step 2:
Build a model on a training set
THE PAST
Results Known
+
+ Training set
-
-
+
Data
Model Builder
Testing set
22
Classification Step 3:
Evaluate on test set (Re-train?)
Results Known
+
+ Training set
-
-
+
Data
Model Builder
Evaluate
Predictions
+
Y N
-
+
Testing set -
23
Handling unbalanced data
• Sometimes, classes have very unequal frequency
• Attrition prediction: 97% stay, 3% attrite (in a month)
• medical diagnosis: 90% healthy, 10% disease
• eCommerce: 99% don’t buy, 1% buy
• Security: >99.99% of Americans are not terrorists
• Similar situation with multiple classes
• Majority class classifier can be 97% correct, but useless
24
Balancing unbalanced data
• With two classes, a good approach is to build BALANCED train and test sets, and train
model on a balanced set
• randomly select desired number of minority class instances
• add equal number of randomly selected majority class
• Generalize “balancing” to multiple classes
• Ensure that each class is represented with approximately equal proportions in train
and test
25
A note on parameter tuning
• It is important that the test data is not used in any way to create the classifier
• Some learning schemes operate in two stages:
• Stage 1: builds the basic structure
• Stage 2: optimizes parameter settings
• The test data can’t be used for parameter tuning!
• Proper procedure uses three sets: training data, validation data, and test data
• Validation data is used to optimize parameters
26
Once evaluation is complete, all the
data can be used to build the final
classifier
27
Classification:
Train, Validation, Test split
Results Known
+
Model
+ Training set Builder
-
-
+
Data
Evaluate
Model Builder
Predictions
+
-
Y N +
Validation set -
+
- Final Evaluation
+
Final Test Set Final Model -
28
Evaluating Classification & Predictive
Performance
29
Multiple methods are available to
classify or predict
30
Accuracy Measures (Classification)
31
Misclassification error
33
• 201 1’s correctly classified as “1”
C TP FN
not C FP TN
37
Example
Using the specified confusion matrix Classification Confusion Matrix
calculate: Predicted Class
1- Accuracy
Actual Class 1 0
2- Error Rate
3- Sensitivity 1 201 85
4- Specificity 0 25 2689
Solution:
TP=201, FP=25, TN=2689, FN= 85
Overall error rate = (25+85)/3000 = 3.67%
Accuracy = 1 – err = (201+2689) = 96.33%
Sensitivity = 201/(201+85) = 68.14%
Specificity = 2689/(2689+25) = 99.08%
38
Most DM algorithms classify via a 2-step process:
For each record,
1. Compute probability of belonging to class “1”
2. Compare to cutoff value, and classify
Cutoff for accordingly
39
Cutoff Actual Class
1
Prob. of "1"
0.996
Actual Class
1
Prob. of "1"
0.506
Table 1
1
0.988
0.984
0
0
0.471
0.337
1 0.980 1 0.218
• If cutoff is 0.50: 13 records 1 0.948 0 0.199
are classified as “1” 1 0.889 0 0.149
• If cutoff is 0.80: seven 1 0.848 0 0.048
records are classified as
0 0.762 0 0.038
“1”
1 0.707 0 0.025
1 0.681 0 0.022
1 0.656 0 0.016
0 0.622 0 0.004
40
Cut off Prob.Val. for Success (Updatable) 0.25
owner non-owner
Different owner
non-owner
11
4
1
8
owner 7 5
non-owner 1 11
41
Assignment: Prepare a report to:
• Discuss the theory of operation of discriminant analysis classifier including:
1. Idea of operation
2. Mathematical formulation
3. Advantages and disadvantages
4. Python instruction for using this classifier with a detailed description of its
function including all input and output parameters
• Write python instructions used in data splitting and constructing confusion matrix with
detailed explanation of the used functions
• Use any of available datasets in python to write a full program that classifies this dataset
using discriminant analysis classifier. Split the data into 70% training and construct the
resulting confusion matrix. Obtain as many performance measurements as you can. Each
step in your program should be accompanied by a screenshot of its output.
42