0% found this document useful (0 votes)
26 views

Lecture 3 Ver2

Uploaded by

Abo dahab
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views

Lecture 3 Ver2

Uploaded by

Abo dahab
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

CSE 588 - Pattern Recognition

Lecture 3

Dr. Dina Salem

• Introduction to Types of Pattern Recognition Tasks


• Step1: Data collection and preprocessing
• Step5: Model Evaluation
Introduction to Types of Pattern
Recognition Tasks
Types of PR Tasks

• As we discussed earlier, there are two main types of PR tasks according to the type and
method of learning:
1. Supervised Learning
2. Unsupervised Learning
• Adding, there are other types such as:
3. Semi-supervised Learning: A combination of a small amount of labeled data and a large amount
of unlabeled data. (Example: Training a system on a few labeled medical images and a large
dataset of unlabeled ones to improve diagnostic accuracy.)
4. Reinforcement Learning: Feedback from actions (rewards or penalties) but not specific labeled
examples. (Example: a robot learns to navigate a maze by receiving rewards for reaching the goal
and penalties for hitting walls, improving its strategy over time based on this feedback.)
Pattern Recognition Systems can also be divided into several types, depending on
the knowledge needed to be extracted from the available data.
3
1. Classification

• Type of Learning: Supervised


• Knowledge Needed: Labeled data with predefined categories or classes.
• Purpose: Assign new data to one of the predefined classes by learning from labeled training
data.
• Example: Identifying whether an email is spam or not based on learned patterns from labeled
emails.
• Some Algorithms:
1. Decision Trees: Learn rules from labeled data to classify new instances.
2. Support Vector Machines (SVM): Find the optimal boundary (hyperplane) that separates
different classes.
3. Neural Networks (NN): Use layers of neurons to model complex decision boundaries for
classification tasks.
4. k-Nearest Neighbors (k-NN): Classify new data points based on the majority class of their
nearest neighbors in the dataset.

4
2. Clustering

• Type of Learning: Unsupervised


• Knowledge Needed: No labeled data; the algorithm identifies inherent patterns or groupings in
the data.
• Purpose: Discover natural groupings (clusters) of similar data points without predefined
categories.
• Example: Grouping customers into distinct segments based on purchasing behaviors, even if the
segments aren’t predefined.
• Some Algorithms:
1. k-Means: Partitions data into k clusters by minimizing the variance within each cluster.
2. Hierarchical Clustering: Builds a tree of clusters based on distance between data points.
3. DBSCAN (Density-Based Spatial Clustering of Applications with Noise): Clusters data based on
density, identifying core points, and outliers

5
3. Regression

• Type of Learning: Supervised


• Knowledge Needed: Labeled data with continuous target values (instead of classes).
• Purpose: Predict continuous values based on input features by learning from labeled training
data.
• Example: Predicting house prices based on patterns related to square footage, location, and
other features.
• Some Algorithms:
1. Linear Regression: Models the relationship between input variables and a continuous output by
fitting a linear equation.
2. Polynomial Regression: Extends linear regression to model non-linear relationships.
3. Support Vector Regression (SVR): Similar to SVM, but predicts continuous output instead of
classifying data points.

6
4. Dimensionality Reduction

• Type of Learning: Unsupervised


• Knowledge Needed: Data without explicit labels but with patterns to reduce redundancy or
find the most informative features.
• Purpose: Reduce the number of features in the data while retaining the most important
structures or patterns.
• Example: Simplifying a dataset of images for facial recognition to reduce the number of pixels
analyzed, while still capturing the most important features needed for recognizing faces.
• Some Algorithms:
1. Principal Component Analysis (PCA): Identifies the directions (principal components) along
which the data has the most variance, reducing the dimensionality of the data.
2. t-SNE (t-Distributed Stochastic Neighbor Embedding): Reduces dimensionality while preserving
the relationships between data points for visualization
7
5. Outlier Detection

• Type of Learning: Supervised or Unsupervised


• Knowledge Needed: In unsupervised cases, patterns of normal behavior are inferred from the
data; in supervised cases, labeled data contains examples of outliers or anomalies.
• Purpose: Identify data points that deviate significantly from the majority, indicating potential
anomalies or noise.
• Example: Identifying unusual sensor readings in an industrial machine by detecting values that
fall far outside the normal operating range.
• Some Algorithms:
1. Isolation Forest: Constructs trees to isolate anomalies from the rest of the data based on their
uniqueness.
2. One-Class SVM: Learns a model of the "normal" class and flags instances that don't fit as
outliers.
3. LOF (Local Outlier Factor): Measures the local density deviation of data points to identify
anomalies.

8
6. Association Rule Learning

• Type of Learning: Unsupervised


• Knowledge Needed: No explicit labels; the algorithm identifies patterns of co-occurrence or
frequent associations between variables.
• Purpose: Discover relationships between variables, typically in transactional datasets.
• Example: In a retail store, discovering that customers who buy bread often also buy butter.
• Some Algorithms:
1. Apriori Algorithm: Finds frequent itemsets and generates association rules based on those
itemsets.
2. FP-Growth (Frequent Pattern Growth): A more efficient algorithm for finding frequent itemsets
without candidate generation.
3. Eclat: Uses a depth-first search strategy to find frequent itemsets for association rule learning

9
Other Types (Assignment)

• Other Types include:


1. Density Estimation Algorithms (Unsupervised Learning)
2. Sequence Analysis Algorithms (Supervised or Unsupervised Learning)
3. Topic Modeling Algorithms (Unsupervised Learning)
4. Reinforcement Learning Algorithms (Knowledge Gained from Interaction)

TASK: Write the knowledge needed, purpose, some algorithms with theory of operation
in one sentence and give an example for each of the previous PR tasks.

10
Data Collection and Preprocessing

In a Pattern Recognition (PR) system, data collection and preprocessing are the
foundation steps that directly impact the success of pattern recognition tasks.
Next is a brief explanation of each step along with examples of algorithms used.
Definition: Data collection involves gathering the raw data required for training
and testing the pattern recognition model. The quality and quantity of this data
are critical for the effectiveness of the system.
DATA COLLECTION

Sensor Data: Collected from physical sensors (e.g., camera images, audio recordings, biometric
sensors).
Types of Data: Historical Data: Data from previous instances stored in databases, often used for supervised learning
tasks (e.g., medical records, transaction logs).
Generated Data: Synthetic data created for specific needs (e.g., simulated datasets for rare events).

Challenges: Ensuring data is representative, diverse, and large enough to cover


the variations in patterns.

Manual Data Entry: Human-driven data entry processes.


Automated Sensors: Devices like cameras, microphones, or IoT sensors that
Tools/Techniques: automatically collect data.
Web Scraping: Algorithms to extract data from websites or online resources.
Definition: Preprocessing refers to
transforming raw data into a clean and
structured format suitable for pattern
recognition algorithms. It is essential to
reduce noise, handle missing values,
Pre- normalize, and enhance data quality.

Processing Importance: Preprocessing ensures that


the data used in the PR system is clean,
consistent, and optimized for the
algorithms, leading to better pattern
recognition accuracy and performance.
1. Data Cleaning: Removing or fixing incomplete,
inconsistent, or noisy data.

• Example Algorithm: Outlier removal techniques (e.g., Z-score for outlier


detection).

Pre- 2. Data Transformation: Converting data into a suitable


format, such as normalizing numerical values or encoding
Processing categorical variables.
• Example Algorithm: Min-Max Normalization (scales data to a range
Key Steps [0,1]) or Z-score normalization.

3. Dimensionality Reduction: Reducing the number of


features or variables in the dataset while retaining
important information.
• Example Algorithms: Principal Component Analysis (PCA) and t-SNE (t-
distributed Stochastic Neighbor Embedding).
4. Feature Extraction/Selection: Identifying
and selecting the most relevant features for
pattern recognition to reduce redundancy
and computational complexity.
Pre-Processing • Example Algorithms: Lasso regression (for feature
selection), Mutual Information.
Key Steps 5. Handling Missing Data: Dealing with
(cont.) incomplete data by either imputing values
or removing instances with missing
attributes.
• Example Algorithms: K-Nearest Neighbor (KNN)
Imputation (predicting missing values),
Mean/Median Imputation.
Examples of Pre-Processing Algorithms
1. Image • Gaussian Smoothing: Reduces noise in images.

Preprocessing: • Histogram Equalization: Enhances contrast in images.

• Tokenization: Splits text into words or sentences.


2. Text Data • Stop Word Removal: Removes common but unimportant words (e.g.,
"the", "is").
Preprocessing: • Stemming/Lemmatization: Reduces words to their base form (e.g.,
"running" to "run").

3. Time-Series Data • Resampling: Adjusts the frequency of time-series data


(e.g., from daily to monthly).
Preprocessing: • Smoothing: Applies moving averages to reduce volatility.
Evaluation

• How predictive is the model we learned?


• Error on the training data is not a good indicator of performance on future data
Q: Why?
A: Because new data will probably not be exactly the same as the training data!
• Overfitting – fitting the training data too precisely - usually leads to poor results on new
data
Evaluation issues
Possible evaluation measures:

• Classification Accuracy
• Total cost/benefit – when different errors involve different costs
• Lift and ROC curves
• Error in numeric predictions

How reliable are the predicted results ?

18
Classifier error rate

Natural performance measure for classification problems: error rate


• Success: instance’s class is predicted correctly
• Error: instance’s class is predicted incorrectly
• Error rate: proportion of errors made over the whole set of instances
Training set error rate: is way too optimistic!
• you can find patterns even in random data

19
Evaluation on “LARGE” data
• If many (thousands) of examples are available, including several hundred
examples from each class, then a simple evaluation is sufficient
• Randomly split data into training and test sets (usually 2/3 for train, 1/3
for test)
• Build a classifier using the train set and evaluate it using the test set.

20
Classification Step 1:
Split data into train and test sets
THE PAST
Results Known

+
+ Training set
-
-
+
Data

Testing set

21
Classification Step 2:
Build a model on a training set
THE PAST
Results Known

+
+ Training set
-
-
+
Data

Model Builder

Testing set

22
Classification Step 3:
Evaluate on test set (Re-train?)
Results Known
+
+ Training set
-
-
+
Data

Model Builder
Evaluate
Predictions
+
Y N
-
+
Testing set -

23
Handling unbalanced data
• Sometimes, classes have very unequal frequency
• Attrition prediction: 97% stay, 3% attrite (in a month)
• medical diagnosis: 90% healthy, 10% disease
• eCommerce: 99% don’t buy, 1% buy
• Security: >99.99% of Americans are not terrorists
• Similar situation with multiple classes
• Majority class classifier can be 97% correct, but useless

24
Balancing unbalanced data
• With two classes, a good approach is to build BALANCED train and test sets, and train
model on a balanced set
• randomly select desired number of minority class instances
• add equal number of randomly selected majority class
• Generalize “balancing” to multiple classes
• Ensure that each class is represented with approximately equal proportions in train
and test

25
A note on parameter tuning
• It is important that the test data is not used in any way to create the classifier
• Some learning schemes operate in two stages:
• Stage 1: builds the basic structure
• Stage 2: optimizes parameter settings
• The test data can’t be used for parameter tuning!
• Proper procedure uses three sets: training data, validation data, and test data
• Validation data is used to optimize parameters

26
Once evaluation is complete, all the
data can be used to build the final
classifier

Making the Generally, the larger the training


most of the data the better the classifier (but
returns diminish)
data
The larger the test data the more
accurate the error estimate

27
Classification:
Train, Validation, Test split
Results Known
+
Model
+ Training set Builder
-
-
+
Data
Evaluate
Model Builder
Predictions
+
-
Y N +
Validation set -

+
- Final Evaluation
+
Final Test Set Final Model -
28
Evaluating Classification & Predictive
Performance

29
Multiple methods are available to
classify or predict

Why For each method, multiple


Evaluate? choices are available for settings

To choose best model, need to


assess each model’s performance

30
Accuracy Measures (Classification)

31
Misclassification error

Error = classifying a Error rate = percent of


record as belonging to misclassified records
one class when it out of the total
belongs to another records in the
class. validation data
32
“High separation of records” means
that using predictor variables
attains low error
Separation of
Records
“Low separation of records” means
that using predictor variables does
not improve much on naïve rule

33
• 201 1’s correctly classified as “1”

Confusion • 85 1’s incorrectly classified as “0”


• 25 0’s incorrectly classified as “1”

Matrix • 2689 0’s correctly classified as “0”

Classification Confusion Matrix


Predicted Class
Actual Class 1 0
1 201 85
0 25 2689
34
35
Confusion matrix glossary
• In a 2-class problem where the class is either C or not C the confusion
matrix looks like this:
Classifier Output
True Class C not C

C TP FN

not C FP TN

• TP is the number of true positives. It’s a C, and classifier output is C


• FN is the number of false negatives. It’s a C, and classifier output is not C.
• TN is the number of true negatives. It’s not C, and classifier output is not C.
• FP is the number of false positives. It’s not C, and classifier output is C.
36
Accuracy: The accuracy of a measurement is how
close a result comes to the true value. Acc. = no of
correct classified patterns/ total no of patterns =
TP+TN/(TP+TN+FP+FN)
Error Rate: (sum of misclassified records)/(total
records) = (FP+FN)/(TP+TN+FP+FN)
Performance
measurements Sensitivity (True Positive Rate TPR): measures the
proportion of positives that are correctly identified.
TPR=TP/P =TP/(TP+FN)

Specificity (True Negative Rate TNR): measures the


proportion of negatives that are correctly identified.
TNR=TN/N =TN/(TN+FP)

37
Example
Using the specified confusion matrix Classification Confusion Matrix
calculate: Predicted Class
1- Accuracy
Actual Class 1 0
2- Error Rate
3- Sensitivity 1 201 85
4- Specificity 0 25 2689

Solution:
TP=201, FP=25, TN=2689, FN= 85
Overall error rate = (25+85)/3000 = 3.67%
Accuracy = 1 – err = (201+2689) = 96.33%
Sensitivity = 201/(201+85) = 68.14%
Specificity = 2689/(2689+25) = 99.08%
38
Most DM algorithms classify via a 2-step process:
For each record,
1. Compute probability of belonging to class “1”
2. Compare to cutoff value, and classify
Cutoff for accordingly

classification • Default cutoff value is 0.50


If >= 0.50, classify as “1”
If < 0.50, classify as “0”
• Can use different cutoff values
• Typically, error rate is lowest for cutoff = 0.50

39
Cutoff Actual Class
1
Prob. of "1"
0.996
Actual Class
1
Prob. of "1"
0.506
Table 1
1
0.988
0.984
0
0
0.471
0.337
1 0.980 1 0.218
• If cutoff is 0.50: 13 records 1 0.948 0 0.199
are classified as “1” 1 0.889 0 0.149
• If cutoff is 0.80: seven 1 0.848 0 0.048
records are classified as
0 0.762 0 0.038
“1”
1 0.707 0 0.025
1 0.681 0 0.022
1 0.656 0 0.016
0 0.622 0 0.004

40
Cut off Prob.Val. for Success (Updatable) 0.25

Confusion Classification Confusion Matrix

Matrix for Actual Class


Predicted Class

owner non-owner

Different owner
non-owner
11
4
1
8

Cutoffs Cut off Prob.Val. for Success (Updatable) 0.75

Classification Confusion Matrix


Predicted Class

Actual Class owner non-owner

owner 7 5
non-owner 1 11

41
Assignment: Prepare a report to:
• Discuss the theory of operation of discriminant analysis classifier including:
1. Idea of operation
2. Mathematical formulation
3. Advantages and disadvantages
4. Python instruction for using this classifier with a detailed description of its
function including all input and output parameters
• Write python instructions used in data splitting and constructing confusion matrix with
detailed explanation of the used functions
• Use any of available datasets in python to write a full program that classifies this dataset
using discriminant analysis classifier. Split the data into 70% training and construct the
resulting confusion matrix. Obtain as many performance measurements as you can. Each
step in your program should be accompanied by a screenshot of its output.

42

You might also like