0% found this document useful (0 votes)
12 views19 pages

Pattern L1 L6

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views19 pages

Pattern L1 L6

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Pattern Recognition Summary

Lecture 1
❖ Supervised Learning has training set, features (predictors, or input), and outcome
(response or output).
❖ Unsupervised Learning observe only the features and have no outcome. We need to
cluster data or organize it.
❖ A pattern is a set of objects, processes or events which consist of both deterministic
and stochastic components.
❖ Machine learning is a method of teaching computers to learn from data.
o Pattern recognition and machine learning fields can be used to create systems
that can automatically detect and respond to patterns in data.
❖ Machine Learning vs Pattern Recognition:

Machine Learning Pattern Recognition


method of teaching computers to learn the process of identifying patterns in data
from data
has origins in Computer Science has origins in Engineering

❖ Detection vs Description:

Detection Description
something happened what has happened?
Examples: Heard noise, Saw something Examples: Gun shot, talking, laughing,
interesting, Non-flat signals crying, etc
❖ Features: The intrinsic traits or characteristics that tell one pattern (object) apart
from another
o Allows Focus on relevant, distinguishing parts of a pattern and Data reduction
and abstraction.
❖ Importance of Features:
o Cannot be over-stated.
o We usually don’t know which to select, what they represent, and how to tune
them.
o Classification and regression schemes are mostly trying to make the best of
whatever features are available.
o One feature is usually not descriptive.
▪Lack of features may be due to Relevance, Missing values,
Dimensionality, Time, and space varying characteristics.
❖ We can decide if a feature is effective through a training phase.
❖ Feature space: D dimensional (D the number of features) populated with features from
training samples.
❖ Decision boundary methods
o Learn the separation in the feature space.
o Examples: Cluster Centers, Decision Surfaces
❖ Parametric methods:
o Based on class sample exhibiting a certain parametric distribution.
o Learn the parameters through training.
o Example: Gaussian.
❖ Density methods:
o Does not enforce a parametric form.
o Learn the density function directly.
❖ A deterministic model is a model that assumes that the outcome of a system or process
is fully determined by its initial conditions and parameters
o does not involve any randomness or uncertainty.
o always produces the same result for the same input.
o can be useful when the system or process is well-understood, predictable, and
stable, and when the accuracy and precision of the model are important.
o Example: Crystal Structure.
❖ A stochastic model is a model that incorporates some elements of randomness or
uncertainty into the system or process.
o does not assume that the outcome of a system or process is fully determined
by its initial conditions and parameters.
o It can vary according to some probability distribution or function.
o can be useful when the system or process is complex, dynamic, and
unpredictable, and when the variability and distribution of the model are
important.
o Example: White Noise.
❖ Statistical Tests:
o t-test: Tests for the difference between the means of two independent groups.
o ANOVA: Tests for the difference between the means of three or more groups.
o F-test: Compares the variances of two groups.
o Chi-square test: Tests for relationships between categorical variables.
o Correlation analysis: Measures the strength and direction of the linear
relationship between two continuous variables.
❖ Machine Learning Models:
o Linear regression: Predicts a continuous outcome based on a linear relationship
with one or more independent variables.
o Logistic regression: Predicts a binary outcome (e.g., yes/no) based on a set of
independent variables.
o Naive Bayes: Classifies data points based on Bayes’ theorem and assuming
independence between features.
o Hidden Markov Models: Models sequential data with hidden states and
observable outputs.
❖ Density-based clustering algorithms: can deal with non-hyperspherical clusters and are
robust to outliers.
❖ Traditional vs Modern Pattern Recognition:

Traditional Modern
Hand-crafted features Automatically learned features

Simple and low-level concatenation of Hierarchical and complex


numbers or traits

Syntactic Semantic
Feature detection and description Feature detection and description are not
are separate tasks from classifier jointly optimized with classifiers
design

❖ Error rate refers to a measure of the degree of prediction error of a model made with
respect to the true model
❖ Two routes of Bayes Rule:
o Forward (synthesis) route: From class to sample in a class
o Backward (analysis) route: From sample to class ID (always harder).
❖ In bayes rule, we turn a backward (analysis) problem into several forward (synthesis)
problems, also known as analysis-by-synthesis.
❖ Types of errors:
o True positive is an outcome where the model correctly predicts the positive
class.
o True negative is an outcome where the model correctly predicts the negative
class.
o False positive is an outcome where the model incorrectly predicts the positive
class. And
o False negative is an outcome where the model incorrectly predicts the negative
class.
❖ Various ways to measure error rate:
o Training & Testing error (under your control)
o Empirical error (generalization Error)
❖ Precision vs Recall:
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒
o 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 + 𝐹𝑎𝑙𝑠𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒

𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒
o 𝑅𝑒𝑐𝑎𝑙𝑙 = 𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 + 𝐹𝑎𝑙𝑠𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒

o One goes up and the other HAS to go down.


❖ Mean vs Median:

Mean Modern
- Traditional measure of center - A resistant measure of the data’s center

- Sum the values and divide by the number - At least half of the ordered values are
of values less than or equal to the median value
- At least half of the ordered values are
- Computation of mean is easier
greater than or equal to the median
- prone to noise value
- If n is odd, the median is the middle
ordered value
- If n is even, the median is the average of
the two middle ordered values

- Finding median in higher dimension is


much complex
❖ The mean and median of data from a symmetric distribution should be close together.
❖ Spread (Variability): exists when some values are different from (above or below) the
mean.
❖ Quartiles: Three numbers which divide the ordered data into four equal sized
groups.
❖ Variance vs Covariance:

Variance Covariance
the average squared deviation from the determines whether relation is positive or
mean of a set of data. It is used to find the negative, but it was impossible to measure
standard deviation the degree to which the variables are
related
Measure of the deviation from the mean Measure of how much each of the
for points in one dimension dimensions vary from the mean with
respect to each other

measured between two dimensions


sees if there is a relation between two
dimensions

❖ Correlation is another way to determine how two variables are related.


❖ Covariance between one dimension is the variance.
❖ Covariance Types:
o Positive Covariance: Both dimensions increase or decrease together.
o Negative Covariance: one increases the other decreases.
❖ In addition to whether variables are positively or negatively related, correlation also
tells the degree to which the variables have related each other.
Lecture 2
❖ Dimensionality Reduction: way to simplify complex high-dimensional data that
Summarize data with a lower dimensional real valued vector.
❖ Dimensionality Reduction Solutions:
o Multi-Dimensional Scaling:
▪ Preserve distance measures.
▪ Find projection that best preserves inter-point distances.
o Principal Component Analysis (PCA)
▪ best data representation (not necessarily best separation)
▪ Find projection that maximize the variance.
o ICA (Independent Component Analysis):
▪ Very similar to PCA except that it assumes non-Guassian features
o Fisher’s Linear Discriminant:
▪ Preserve class separation (special case of PCA)
▪ Maximizing the component axes for class-separation
❖ Feature vectors represent features used by machine learning models in multi-
dimensional numerical values.
o Other Definition: A feature vector is an ordered list of numerical properties of
observed phenomena.
o As machine learning models can only deal with numerical values, converting any
necessary features into feature vectors is crucial.
❖ Quantitative Data vs Qualitative Data:

Qualitative Quantitative
Categorical Numerical
Humans can analyze qualitative data to machine learning models can only deal
make a decision with quantitative data

Examples: Examples:
- Gender - Age
- Religion - Height
- Marital status - Weight
- Qualifications - Income
❖ The linear model is one of the simplest models in machine learning. It assumes that the
data is linearly separable and tries to learn the weight of each feature.
o We can view linear classification models in terms of dimensionality reduction.
❖ Intuitively, good features are those with large separation of means relative to
variances.
❖ Fisher’s Linear Discriminant:
o Selects a projection that maximizes the class separation. To do that, it
maximizes the ratio between the between-class variance to the within-class
variance.
o to project the data to a smaller dimension and to avoid class overlapping, it
maintains two properties:
▪ A large variance among the dataset classes, so that the projected class
averages should be as far apart as possible.
▪ A small variance within each of the dataset classes, so that a small within-
class variance has the effect of keeping the projected data points closer
to one another.
▪ To find the projection within the properties, it learns a weight vector
that can be calculated via:

▪ Can be used as a supervised learning classifier.


Lecture 3
❖ Clustering: One way to summarize a complex real-valued data point with a single
categorical variable.
❖ Principal Component Analysis (PCA): An exploratory technique used to reduce the
dimensionality of the data set to 2D or 3D, can be used to:
o Reduce the number of dimensions in data.
o Find patterns in high-dimensional data.
o Visualize data of high dimensionality.
o Examples:
▪ Face recognition
▪ Image compression
▪ Gene expression analysis
❖ PCA steps to reduce dimensionality to 𝑟-dim:
o Compute Mean Vector 𝜇 and covariance matrix ∑ of original points.
o Compute eigenvectors and eigenvalues of ∑.
o Select top 𝑟 eigenvectors.
o Project points into subspace spanned by them: 𝑦 = 𝐴(𝑥 − 𝜇) where 𝑦 is the new
point, 𝑥 is the old one, and 𝐴 are the eigenvectors.
❖ Eigenvectors (𝜆) vs Eigenvalues (𝑥):

Eigenvectors Eigenvalues
those vectors that are only stretched, the factor by which an eigenvector is
with no rotation or shear stretched or squished
does not change direction in a The value zero can be eigenvalue
transformation
Called characteristic vector Called characteristic value
The zero vector cannot be an eigenvector. • 1 means no change,
• 2 means doubling in length,
• −1 means pointing backwards
Formula: det(𝐴 − 𝜆𝐼) = 0 Formula: 𝐴𝑥 = 𝜆𝑥
* 𝐴 is a square matrix and 𝐼 is the identity matrix.

❖ A vector can be an eigenvector of A if and only if B does not have an inverse, or


equivalently det(B)=0.
❖ We say that 2 vectors are orthogonal if they are perpendicular to each other (the dot
product of the two vectors is zero).
❖ Bases of a vector space: a set of vectors in that space that can be used as coordinates
for it. The set must:
o span the vector space.
o be linearly independent.
❖ Principal Component 1 (PC1):
o The eigenvalue with the largest absolute value will indicate that the data have
the largest variance along its eigenvector, the direction along which there is
greatest variation.
o only a few directions manage to capture most of the variability in the data.
❖ PCA Disadvantages:
o While PCA simplifies the data and removes noise, it always leads to some loss of
information when we reduce dimensions.
o PCA is a linear combination dimensionality reduction technique, but not all real-
world datasets may be linear.
Lecture 4
❖ Parameter estimation is defined as the experimental determination of values of
parameters that govern the system behavior, assuming that the structure of the
process is known.
❖ A discrete distribution is one in which the data can only take on certain values.
o probabilities can be assigned to the values in the distribution.
❖ A continuous distribution is one in which data can take on any value within a specified
range (which may be infinite)
o normally described in terms of probability density, which can be converted into
the probability that a value will fall within a certain range.
❖ Parameter Estimation Approaches:
o Parametric:
▪ Algorithms that simplify the function to a known form.
▪ assume a certain parametric form and estimate the parameters.
▪ A learning model that summarizes data with a set of parameters of fixed
size is called a parametric model.
▪ Examples:
• Logistic Regression
• Linear Discriminant Analysis
• Perceptron
• Naive Bayes
• Simple Neural Networks
▪ Advantages: Simpler, speed, less data.
▪ Disadvantages: constrained, limited complexity, and poor fit.
o Nonparametric:
▪ Algorithms that do not make strong assumptions about the form of the
mapping function.
▪ good when you have a lot of data and no prior knowledge, and when you
don’t want to worry too much about choosing just the right features.
▪ does not assume a parametric form and estimate the density profile
directly.
▪ Example: K-NN, Decision Trees, SVM
▪ Advantages: Flexibility, Power, Performance.
▪ Disadvantages: More data, slower, and overfitting.
o Boundary: estimate the separation hyperplane (hypersurface) between both.
❖ Maximum Likelihood Estimator:
o batch estimator.
o Parameters have fixed but unknown values.
o The maximum likelihood estimator of the mean is the sample mean that is the
estimate of 𝜇 is the average value of all the data points.
❖ Bayesian estimator:
o parameters as random variables with a prior distribution.
o allows us to change the a priori distribution by incorporating measurements to
sharpen the profile.
❖ Given the numbers of occurrence: if number of samples are large enough, the selection
process is not biased.
o Caveat: sampling may be biased
❖ Maximum A Posteriori (MAP): Like MLE with one additional twist:
o p(.), prior probability of parameter values is more likely to be u o with a normal
distribution.
o MLE has a uniform prior, MAP not necessarily.
❖ MLE vs Bayesian Estimator:

MLE Bayesian
Allow the freedom that parameters in
All data must be kept
themselves can be random variables
Difficult to update estimation Allow multiple evidence

Difficult to incorporate other evidence


Insist on a single measurement Allow iterative update

Faster (differentiation) Slow (integration)


Single model Multiple weighted
Known model Unknown model fine
Less information More information (nonuniform prior)

❖ Bayesian classifier and MAP will in general give different results when used to classify
new samples.
❖ Bayesian classifier is optimal, but can be very expensive, especially when many
hypotheses are kept and evaluated.
❖ Gibbs: randomly pick one hypothesis according to the current posterior.
Lecture 5
❖ Supervised Learning:
o Discover patterns in the data with known target (class) or label.
o These patterns are then utilized to predict the values of the target attribute in
future data instances.
❖ Unsupervised Learning:
o The data have no target attribute.
❖ Clustering: Task of grouping a set of data points such that data points in the same
group are more similar to each other, each group is known as a cluster.
o A cluster is represented by a single point, known as centroid.
▪ Centroid is computed as the means of all data points in a cluster.
▪ Cluster boundary is decided by the farthest data point in the cluster.
o The goals of clustering:
▪ Group data that are close (or similar) to each other.
▪ Identify such groupings (or clusters) in an unsupervised manner.
❖ Clustering Types:
o Exclusive Clustering: K-Means.
▪ Basic Idea: randomly initialize the k cluster centers and determine points
in each cluster by the closest one to the point.
▪ Properties: always coverage to some solution and can be “local
minimum”
▪ Cons: sensitive to initial centers and outliers and assumes that means
can be computed.
o Overlapping Clustering: Fuzzy C-Means.
▪ Each data point is separated into different clusters and then assigned a
probability score for being in that cluster.
▪ Pros:
• Allows a data point to be in multiple clusters.
• gives better results for overlapped data sets compared to k-
means clustering.
▪ Cons:
• Need to define the number of clusters.
• Sensitive to initial assignment of centroids. (not deterministic)
o Hierarchal Clustering: Agglomerative Clustering, Divisive Clustering.
▪ Produces a nested sequence of clusters, a tree, also called dendrogram.
▪ Agglomerative (bottom-up) “more popular”: builds the dendrogram
(tree) from the bottom level, merges the most similar (or nearest) pair
of clusters, and stops when all the data points are merged into the root
cluster.
▪ Divisive (top-down): It starts with all data points in one cluster, the root,
Splits the root into a set of child clusters. Each child cluster is recursively
divided further, and stops when cluster with only a single point
▪ Pros:
• Dendrograms are great for visualization
• Provides hierarchical relations between clusters
• Shown to be able to capture concentric clusters
▪ Cons:
• Not easy to define levels for clusters.
• other clustering techniques outperform hierarchical clustering
o Probabilistic Clustering: Mixture of Gaussian Models.
❖ Hard clustering vs. Soft Clustering:

Hard Clustering Soft Clustering


Each data point can belong to multiple
Each data point is clustered or grouped to
clusters along with its probability score or
any one cluster.
likelihood.
Example: K-means Clustering Example: Fuzzy C-Means

❖ Problems with Euclidean distance at high dimensions euclidean distance loses pretty
much all meaning.
❖ Binary attribute: an attribute that has two values or states but no ordering
relationships.
❖ We use a confusion matrix to introduce the distance functions / measures.
❖ Clustering Criteria:
o Similarity Function: use an appropriate distance function.
o Stopping Criteria:
▪ No (or minimum) re-assignments of data points to different clusters.
▪ No (or minimum) change of centroids.
▪ Minimum decrease in the sum of squared error.
o Cluster Quality
▪ Intra-cluster cohesion (compactness)
• measures how near the data points in a cluster are to the cluster
centroid.
• Sum of squared error (SSE) is a commonly used measure.
▪ Inter-cluster separation (isolation)
• different cluster centroids should be far away from one another.
❖ Normalization: technique to force the attributes to have a common value range
o Two main approaches to standardize interval scaled attributes, range and z-
score.
❖ Z-score: transforms the attribute values so that they have a mean of zero and a mean
absolute deviation of 1.
❖ Clustering evaluation measures are Entropy and Purity.
o Entropy: measures the uncertainty of a random variable, it characterizes the
impurity of an arbitrary collection of examples.
▪ The higher the entropy, the more the information content.
▪ If the entropy is 0, then the outcome is “certain”
▪ If the entropy is maximum, then any outcome is equally possible.
o Purity: measures the extent that a cluster contains only one class of data.
Lecture 7
❖ Decision Tree: a graph that is used to represent choices and their results in the form
of a tree.
o The nodes in the graph represent an event or choice.
o The edges in the graph represent the decision rules or conditions.
o The tree is terminated by leaf nodes that represent the result of following a
combination of decisions.
o Mostly used in Machine Learning and Data Mining using Python.
o Built using recursive partitioning (divide-and-conquer):
▪ Uses the feature values to split the data into smaller subsets of similar
classes.
o Supervised learning algorithm.
o Can be used for solving regressions and classifications.
❖ Greedy Algorithm: always makes the choice that seems to be the best at the moment.
❖ Algorithms used in Decision Trees:
o ID3 (extension of D3): builds decision trees using a top-down greedy search
approach through the space of possible branches with no backtracking.
▪ It begins with the original set S as the root node.
▪ On each iteration of the algorithm, it iterates through the very unused
attribute of the set S and calculates Entropy(H) and Information
gain(IG) of this attribute.
▪ It then selects the attribute which has the smallest Entropy or Largest
Information gain.
▪ The set S is then split by the selected attribute to produce a subset of
the data.
▪ The algorithm continues to recur on each subset, considering only
attributes never selected before.
❖ The Decision Tree Algorithms strengths and weaknesses:

Strengths Weaknesses
More efficient that other complex models Easy to overfit or underfit the model
Can be used on data with relatively few Small changes in training data can result
training examples or a very large number. in large changes to decision logic.

Can handle numeric, nominal features, or Often biased towards splits on features
missing data. having a large number of levels.

❖ Geni Index: a method used to select from n attributes of the dataset which attribute
would be placed at the root or the terminal node.
o It measures how often a randomly chosen element would be incorrectly
identified.
o An attribute with lower Gini index should be preferred.
❖ Entropy Formula:

−(𝑝+ log 2 (𝑝+ )) − (𝑝− log 2 (𝑝− ))

o p+ is the probability of positive examples.


o p- is the probability of negative examples.
❖ Information Gain: measures how well a given attribute separates the training examples
according to their target classification.
o Used to select among the candidate attributes at each step while growing the
tree.
o Gain is measure of how much we can reduce uncertainty (Value lies between
0,1).
o Information Gain Formula:
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 (𝑝𝑎𝑟𝑒𝑛𝑡) − 𝑝+ ∗ 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(+) − 𝑝− ∗ 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(−)
❖ CART in Decision Trees:
o Classification Trees: used to separate the dataset into classes belonging to the
response variable.
o Regression Trees: needed when the response variable is numeric or continuous.
❖ Ways to remove overfitting in Decision Trees:
o Pruning Decision Trees:
▪ Remove the decision nodes starting from the leaf node such that the
overall accuracy is not disturbed.
▪ This is done by segregating the actual training set into two sets: training
data set, D and validation data set, V.
▪ Prepare the decision tree using the segregated training data set, D. Then
continue trimming the tree accordingly to optimize the accuracy of the
validation data set, V
o Random Forest:
▪ Has two main concepts:
• A random sampling of training data set when building trees.
• Random subsets of features considered when splitting nodes.
▪ A technique known as bagging is used to create an ensemble of trees
where multiple training sets are generated with replacement.
• In the bagging technique, a data set is divided into N samples
using randomized sampling. Then, using a single learning
algorithm a model is built on all samples.

You might also like