Pattern L1 L6
Pattern L1 L6
Lecture 1
❖ Supervised Learning has training set, features (predictors, or input), and outcome
(response or output).
❖ Unsupervised Learning observe only the features and have no outcome. We need to
cluster data or organize it.
❖ A pattern is a set of objects, processes or events which consist of both deterministic
and stochastic components.
❖ Machine learning is a method of teaching computers to learn from data.
o Pattern recognition and machine learning fields can be used to create systems
that can automatically detect and respond to patterns in data.
❖ Machine Learning vs Pattern Recognition:
❖ Detection vs Description:
Detection Description
something happened what has happened?
Examples: Heard noise, Saw something Examples: Gun shot, talking, laughing,
interesting, Non-flat signals crying, etc
❖ Features: The intrinsic traits or characteristics that tell one pattern (object) apart
from another
o Allows Focus on relevant, distinguishing parts of a pattern and Data reduction
and abstraction.
❖ Importance of Features:
o Cannot be over-stated.
o We usually don’t know which to select, what they represent, and how to tune
them.
o Classification and regression schemes are mostly trying to make the best of
whatever features are available.
o One feature is usually not descriptive.
▪Lack of features may be due to Relevance, Missing values,
Dimensionality, Time, and space varying characteristics.
❖ We can decide if a feature is effective through a training phase.
❖ Feature space: D dimensional (D the number of features) populated with features from
training samples.
❖ Decision boundary methods
o Learn the separation in the feature space.
o Examples: Cluster Centers, Decision Surfaces
❖ Parametric methods:
o Based on class sample exhibiting a certain parametric distribution.
o Learn the parameters through training.
o Example: Gaussian.
❖ Density methods:
o Does not enforce a parametric form.
o Learn the density function directly.
❖ A deterministic model is a model that assumes that the outcome of a system or process
is fully determined by its initial conditions and parameters
o does not involve any randomness or uncertainty.
o always produces the same result for the same input.
o can be useful when the system or process is well-understood, predictable, and
stable, and when the accuracy and precision of the model are important.
o Example: Crystal Structure.
❖ A stochastic model is a model that incorporates some elements of randomness or
uncertainty into the system or process.
o does not assume that the outcome of a system or process is fully determined
by its initial conditions and parameters.
o It can vary according to some probability distribution or function.
o can be useful when the system or process is complex, dynamic, and
unpredictable, and when the variability and distribution of the model are
important.
o Example: White Noise.
❖ Statistical Tests:
o t-test: Tests for the difference between the means of two independent groups.
o ANOVA: Tests for the difference between the means of three or more groups.
o F-test: Compares the variances of two groups.
o Chi-square test: Tests for relationships between categorical variables.
o Correlation analysis: Measures the strength and direction of the linear
relationship between two continuous variables.
❖ Machine Learning Models:
o Linear regression: Predicts a continuous outcome based on a linear relationship
with one or more independent variables.
o Logistic regression: Predicts a binary outcome (e.g., yes/no) based on a set of
independent variables.
o Naive Bayes: Classifies data points based on Bayes’ theorem and assuming
independence between features.
o Hidden Markov Models: Models sequential data with hidden states and
observable outputs.
❖ Density-based clustering algorithms: can deal with non-hyperspherical clusters and are
robust to outliers.
❖ Traditional vs Modern Pattern Recognition:
Traditional Modern
Hand-crafted features Automatically learned features
Syntactic Semantic
Feature detection and description Feature detection and description are not
are separate tasks from classifier jointly optimized with classifiers
design
❖ Error rate refers to a measure of the degree of prediction error of a model made with
respect to the true model
❖ Two routes of Bayes Rule:
o Forward (synthesis) route: From class to sample in a class
o Backward (analysis) route: From sample to class ID (always harder).
❖ In bayes rule, we turn a backward (analysis) problem into several forward (synthesis)
problems, also known as analysis-by-synthesis.
❖ Types of errors:
o True positive is an outcome where the model correctly predicts the positive
class.
o True negative is an outcome where the model correctly predicts the negative
class.
o False positive is an outcome where the model incorrectly predicts the positive
class. And
o False negative is an outcome where the model incorrectly predicts the negative
class.
❖ Various ways to measure error rate:
o Training & Testing error (under your control)
o Empirical error (generalization Error)
❖ Precision vs Recall:
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒
o 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 + 𝐹𝑎𝑙𝑠𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒
o 𝑅𝑒𝑐𝑎𝑙𝑙 = 𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 + 𝐹𝑎𝑙𝑠𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒
Mean Modern
- Traditional measure of center - A resistant measure of the data’s center
- Sum the values and divide by the number - At least half of the ordered values are
of values less than or equal to the median value
- At least half of the ordered values are
- Computation of mean is easier
greater than or equal to the median
- prone to noise value
- If n is odd, the median is the middle
ordered value
- If n is even, the median is the average of
the two middle ordered values
Variance Covariance
the average squared deviation from the determines whether relation is positive or
mean of a set of data. It is used to find the negative, but it was impossible to measure
standard deviation the degree to which the variables are
related
Measure of the deviation from the mean Measure of how much each of the
for points in one dimension dimensions vary from the mean with
respect to each other
Qualitative Quantitative
Categorical Numerical
Humans can analyze qualitative data to machine learning models can only deal
make a decision with quantitative data
Examples: Examples:
- Gender - Age
- Religion - Height
- Marital status - Weight
- Qualifications - Income
❖ The linear model is one of the simplest models in machine learning. It assumes that the
data is linearly separable and tries to learn the weight of each feature.
o We can view linear classification models in terms of dimensionality reduction.
❖ Intuitively, good features are those with large separation of means relative to
variances.
❖ Fisher’s Linear Discriminant:
o Selects a projection that maximizes the class separation. To do that, it
maximizes the ratio between the between-class variance to the within-class
variance.
o to project the data to a smaller dimension and to avoid class overlapping, it
maintains two properties:
▪ A large variance among the dataset classes, so that the projected class
averages should be as far apart as possible.
▪ A small variance within each of the dataset classes, so that a small within-
class variance has the effect of keeping the projected data points closer
to one another.
▪ To find the projection within the properties, it learns a weight vector
that can be calculated via:
Eigenvectors Eigenvalues
those vectors that are only stretched, the factor by which an eigenvector is
with no rotation or shear stretched or squished
does not change direction in a The value zero can be eigenvalue
transformation
Called characteristic vector Called characteristic value
The zero vector cannot be an eigenvector. • 1 means no change,
• 2 means doubling in length,
• −1 means pointing backwards
Formula: det(𝐴 − 𝜆𝐼) = 0 Formula: 𝐴𝑥 = 𝜆𝑥
* 𝐴 is a square matrix and 𝐼 is the identity matrix.
MLE Bayesian
Allow the freedom that parameters in
All data must be kept
themselves can be random variables
Difficult to update estimation Allow multiple evidence
❖ Bayesian classifier and MAP will in general give different results when used to classify
new samples.
❖ Bayesian classifier is optimal, but can be very expensive, especially when many
hypotheses are kept and evaluated.
❖ Gibbs: randomly pick one hypothesis according to the current posterior.
Lecture 5
❖ Supervised Learning:
o Discover patterns in the data with known target (class) or label.
o These patterns are then utilized to predict the values of the target attribute in
future data instances.
❖ Unsupervised Learning:
o The data have no target attribute.
❖ Clustering: Task of grouping a set of data points such that data points in the same
group are more similar to each other, each group is known as a cluster.
o A cluster is represented by a single point, known as centroid.
▪ Centroid is computed as the means of all data points in a cluster.
▪ Cluster boundary is decided by the farthest data point in the cluster.
o The goals of clustering:
▪ Group data that are close (or similar) to each other.
▪ Identify such groupings (or clusters) in an unsupervised manner.
❖ Clustering Types:
o Exclusive Clustering: K-Means.
▪ Basic Idea: randomly initialize the k cluster centers and determine points
in each cluster by the closest one to the point.
▪ Properties: always coverage to some solution and can be “local
minimum”
▪ Cons: sensitive to initial centers and outliers and assumes that means
can be computed.
o Overlapping Clustering: Fuzzy C-Means.
▪ Each data point is separated into different clusters and then assigned a
probability score for being in that cluster.
▪ Pros:
• Allows a data point to be in multiple clusters.
• gives better results for overlapped data sets compared to k-
means clustering.
▪ Cons:
• Need to define the number of clusters.
• Sensitive to initial assignment of centroids. (not deterministic)
o Hierarchal Clustering: Agglomerative Clustering, Divisive Clustering.
▪ Produces a nested sequence of clusters, a tree, also called dendrogram.
▪ Agglomerative (bottom-up) “more popular”: builds the dendrogram
(tree) from the bottom level, merges the most similar (or nearest) pair
of clusters, and stops when all the data points are merged into the root
cluster.
▪ Divisive (top-down): It starts with all data points in one cluster, the root,
Splits the root into a set of child clusters. Each child cluster is recursively
divided further, and stops when cluster with only a single point
▪ Pros:
• Dendrograms are great for visualization
• Provides hierarchical relations between clusters
• Shown to be able to capture concentric clusters
▪ Cons:
• Not easy to define levels for clusters.
• other clustering techniques outperform hierarchical clustering
o Probabilistic Clustering: Mixture of Gaussian Models.
❖ Hard clustering vs. Soft Clustering:
❖ Problems with Euclidean distance at high dimensions euclidean distance loses pretty
much all meaning.
❖ Binary attribute: an attribute that has two values or states but no ordering
relationships.
❖ We use a confusion matrix to introduce the distance functions / measures.
❖ Clustering Criteria:
o Similarity Function: use an appropriate distance function.
o Stopping Criteria:
▪ No (or minimum) re-assignments of data points to different clusters.
▪ No (or minimum) change of centroids.
▪ Minimum decrease in the sum of squared error.
o Cluster Quality
▪ Intra-cluster cohesion (compactness)
• measures how near the data points in a cluster are to the cluster
centroid.
• Sum of squared error (SSE) is a commonly used measure.
▪ Inter-cluster separation (isolation)
• different cluster centroids should be far away from one another.
❖ Normalization: technique to force the attributes to have a common value range
o Two main approaches to standardize interval scaled attributes, range and z-
score.
❖ Z-score: transforms the attribute values so that they have a mean of zero and a mean
absolute deviation of 1.
❖ Clustering evaluation measures are Entropy and Purity.
o Entropy: measures the uncertainty of a random variable, it characterizes the
impurity of an arbitrary collection of examples.
▪ The higher the entropy, the more the information content.
▪ If the entropy is 0, then the outcome is “certain”
▪ If the entropy is maximum, then any outcome is equally possible.
o Purity: measures the extent that a cluster contains only one class of data.
Lecture 7
❖ Decision Tree: a graph that is used to represent choices and their results in the form
of a tree.
o The nodes in the graph represent an event or choice.
o The edges in the graph represent the decision rules or conditions.
o The tree is terminated by leaf nodes that represent the result of following a
combination of decisions.
o Mostly used in Machine Learning and Data Mining using Python.
o Built using recursive partitioning (divide-and-conquer):
▪ Uses the feature values to split the data into smaller subsets of similar
classes.
o Supervised learning algorithm.
o Can be used for solving regressions and classifications.
❖ Greedy Algorithm: always makes the choice that seems to be the best at the moment.
❖ Algorithms used in Decision Trees:
o ID3 (extension of D3): builds decision trees using a top-down greedy search
approach through the space of possible branches with no backtracking.
▪ It begins with the original set S as the root node.
▪ On each iteration of the algorithm, it iterates through the very unused
attribute of the set S and calculates Entropy(H) and Information
gain(IG) of this attribute.
▪ It then selects the attribute which has the smallest Entropy or Largest
Information gain.
▪ The set S is then split by the selected attribute to produce a subset of
the data.
▪ The algorithm continues to recur on each subset, considering only
attributes never selected before.
❖ The Decision Tree Algorithms strengths and weaknesses:
Strengths Weaknesses
More efficient that other complex models Easy to overfit or underfit the model
Can be used on data with relatively few Small changes in training data can result
training examples or a very large number. in large changes to decision logic.
Can handle numeric, nominal features, or Often biased towards splits on features
missing data. having a large number of levels.
❖ Geni Index: a method used to select from n attributes of the dataset which attribute
would be placed at the root or the terminal node.
o It measures how often a randomly chosen element would be incorrectly
identified.
o An attribute with lower Gini index should be preferred.
❖ Entropy Formula: