0% found this document useful (0 votes)
27 views18 pages

Combine PDF

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views18 pages

Combine PDF

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

2024-09-25

Session 1 Introduction to Numpy

NumPy (Numerical Python):


o Introduction to NumPy
• An open source Python library used in almost every field of science
SCHOOL OF COMPUTER TECHNOLOGY and engineering
o Vector Space Analysis and Linear Algebra • A universal standard for working with numerical data in Python
AASD 4001 • NumPy users include everyone from beginner coders to experienced
researchers doing state-of-the-art scientific and industrial research
Mathematical Concepts for Machine Learning and development
Lecture 1 • NumPy arrays are used extensively in Pandas, SciPy, Matplotlib, scikit-
learn, scikit-image and most other data science, and machine
learning Python packages.

Reza Moslemi, Ph.D., P.Eng.


[email protected]

1 2 3

Vector Space Analysis and


Introduction to Numpy (cont’d) Introduction to Numpy (cont’d)
Linear Algebra
The NumPy library contains multidimensional array and matrix Installing NumPy Linear algebra
data structures • conda install numpy • Is the branch of mathematics concerning linear equations. In the context of
• It provides ndarray, a homogeneous n-dimensional array object, • pip install numpy machine learning, it is the mathematical toolset to work with data (often
with methods to efficiently operate on it. vectors, matrices, or tensors)
• NumPy can be used to perform a wide variety of mathematical Vector space:
operations on arrays.
Importing NumPy
• A vector space (also called a linear space) is a set of vectors, which may be
It adds powerful data structures to Python that guarantee efficient • import numpy as np
• added together and multiplied ("scaled") by scalar numbers.
calculations with arrays and matrices and it supplies an enormous
library of high-level mathematical functions that operate on these Let’s switch to the jupyter notebook and open “Practice 1.ipynb” to
arrays and matrices. What are scalars, vectors, matrices and tensors?
practice NumPy!
Scalar: a scalar value is simply a number
What is an array? • e.g.: 0.1, -5, 48.2, pi
• An array is a central data structure of the NumPy library
• A grid of values containing information about the raw data, how to
locate an element, and how to interpret an element
Vector: an n-by-1 entity with n values (1D)
• The elements are all of the same type, referred to as the array dtype 2
• e.g.: 0 is a 3-by-1 vector (3 rows and 1 column)
−8.3

4 5 6

Vector Space Analysis and Vector Space Analysis and Vector Space Analysis and
Linear Algebra Linear Algebra Linear Algebra
Matrix: an n-by-m entity with n*m values (2D) Vector operations: Vector operations:
4 6 75 8.4 8 0 8 8 0
• e.g.: −8 5 6 55 is a 3-by-4 matrix (3 rows and 4 column) • Addition: 5 + −2 = 3 • Dot product: 5 . −2 = 8 ∗ 0 + 5 ∗ −2 + 4 ∗ 14 = 46
0 0 42 54 4 14 18 B 4 14
C • The result of vector dot product (aka inner product) is a scalar (just a number!)
8 8
• The dot product of a vector with itself is the square of its magnitude: 5 . 5 =
4 4
Tensor: an n-by-m-by-l entity with n*m*l values (3D) 8 ∗ 8 + 5 ∗ 5 + 4 ∗ 4 = 105
A • The dot product is also related to the angle between the two vectors: 𝐴.⃗ 𝐵 =
𝐴⃗ 𝐵 cos 𝜃 where 𝜃 is the angle btw the two vectors.
• If 𝐴⃗. 𝐵 =0, it means that the 2 vectors are perpendicular to each other.

av
8 24
• Scalar product: 3* 5 = 15 v
4 12

7 8 9

1
2024-09-25

Vector Space Analysis and Vector Space Analysis and Vector Space Analysis and
Linear Algebra Linear Algebra Linear Algebra
Vector operations: Covariance and correlation Covariance and correlation
5 ∗ 14 − −2 ∗ 4 • Covariance is a mathematical term to quantitatively measure how • A problem with covariance is that it is difficult to interpret. What is a
8 0 78
much two vectors are related to each other. Similarly, it measures how large vs. low covariance value? Is 10, 50, 5000 high or low?
• Cross product: 5 × −2 = −(8 ∗ 14 − 4 ∗ 0) = −112
two vectors change with respect to one another • In order to solve that problem, we need a normalized metric, i.e.
4 14 8 ∗ −2 − 0 ∗ 5 −16
• 𝑐𝑜𝑣 𝑥,𝑦 =
∑ 𝑥 𝑖 – 𝑥 ̅ 𝑦𝑖–𝑦̅
, no need to remember this formula as we will use python correlation.
• The result of vector cross product is another vector ( NOT just a number!) 𝑁–1
• The resultant vector is perpendicular to both original vectors. to calculate covariance.

8 78 0 78 • What is 𝑐𝑜𝑣 𝑥,𝑥 ? Correlation


• 5 . −112 =0 and −2 . −112 =0 • Correlation is obtained from dividing the covariance by the std of both
4 −16 14 −16 variables
• The cross product is also related to the angle between the two vectors: 𝑐𝑜𝑣𝑥,𝑦
𝐴⃗ × 𝐵 = 𝐴⃗ 𝐵 sin 𝜃 𝑛 where 𝜃 is the angle btw the two vectors and 𝑛 is a unit • 𝜌𝑥,𝑦 =
σ𝑥σ𝑦
vector perpendicular to the plane containing 𝐴⃗ and 𝐵.
• Correlation is always btw -1 and 1.
• If 𝐴⃗ × 𝐵 =0, it means that the 2 vectors are collinear, either in the same
direction or exact opposite direction.

low covariance value high covariance value

10 11 12

Vector Space Analysis and Vector Space Analysis and Vector Space Analysis and
Linear Algebra Linear Algebra Linear Algebra
Correlation Matrix operations: Matrix operations:
• 𝜌𝑥,𝑦 = 1 means that there is perfect correlation. 𝑏 ± 𝑒 𝑓 𝑎±𝑒 𝑏± 𝑓 1 0 0
• Element-wise addition: 𝑎 =
• 𝜌𝑥,𝑦 = 0 means that there is no correlation. 𝑐 𝑑 𝑔 ℎ 𝑐± 𝑔 𝑑±ℎ • Identity matrix: 𝐼𝑛 = 0 ⋱ 0
0 0 1
• 𝜌𝑥,𝑦 = −1 means that there is perfect inverse correlation.
1 on the main diagonal and zero elsewhere
𝑎 𝑏 . 𝑒 𝑓 = 𝑎𝑒 + 𝑏𝑔 𝑎 𝑓 +
• Multiplication: • Inverse of a matrix (𝐴–1): 𝐴𝐴 –1 = 𝐴 –1 𝐴 = 𝐼𝑛
𝑐 𝑏ℎ 𝑔 ℎ
𝑑 𝑐𝑒 +
𝑑𝑔 𝑐 𝑓 + 𝑑ℎ • The above simple inverse is only defined for square matrices.
• Even for square matrices, an inverse may NOT always exist.
• Is AB=BA? Try with A= 1 2 and B= 4 1
7 4 −2 0 • Transpose of a matrix (𝐴′ or 𝐴𝑇 ): flipping a matrix over its diagonal;
switching the row and column indices of the matrix.
0 can be used to scale a vector by 𝛼: 𝛼 0 𝑥 𝑏 =
• 𝛼
=
𝛼𝑥
= 𝛼
𝑥 • Determinant of a 2-by-2 matrix: det A = 𝐴 = 𝑎
𝑐 𝑑
0 𝛼 0 𝛼 𝑦 𝛼𝑦 𝑦
𝑎 𝑏 𝑐
• Determinant of a 3-by-3 matrix: 𝑑 𝑒 𝑓 =
𝑔 ℎ 𝑖

13 14 15

Vector Space Analysis and


Linear Algebra
Matrix operations:
–1
𝑏 1 𝑑 −𝑏
• Inverse of a 2-by-2 matrix: 𝑎 =
𝑐 𝑑 𝑎𝑑–𝑏𝑐 −𝑐 𝑎
• Inverse of a 3-by-3 matrix:
1
𝐴 –1 = adj A
det
adj𝐴A = 𝐶𝑇
𝐶 = (−1) 𝑖+j 𝑀𝑖j
Adjugate of a matrix is the transpose of the cofactor matrix.
𝑀𝑖j, the (i, j) minor, is the determinant of the submatrix formed by
deleting the ith row and jth column.
What is the inverse of the following matrix?
2 1 0 0.5 −0.25 0
𝐴= 0 2 0 𝐴 –1 = 0 0.5 0
2 0 1 −1 0.5 1

Let’s switch to jupyter notebook and open “Practice 2.ipynb” to


practice what we have learnt!

16

2
2024-09-25

Mathematics of Natural Language Mathematics of Natural Language


Processing algorithms (cont’d) Processing algorithms (cont’d)

𝑁
IDF =𝑙𝑜𝑔
Building blocks of human language are words, but machine 𝑑𝑓𝑥
learning algorithms usually work with vectors of features. 𝑑𝑓𝑥 : number of documents containing x
SCHOOL OF COMPUTER TECHNOLOGY
𝑁: total number of documents
AASD 4001
Text vector of features
Mathematical Concepts for Machine Learning • e.g.: Let’s assume the size of the corpus is 100 documents. If there are 20
documents that contain the term “student”, then the IDF is:
Lecture 2 𝑙𝑜𝑔
100
= 𝑙𝑜𝑔5 = 0.70
20
Let’s see how it can be done.
Reza Moslemi, Ph.D., P.Eng. Mathematically, TF-IDF (𝑊𝑥,𝑦 ) of a word x in a document y is
obtained from:
𝑊𝑥,𝑦 = 𝑇𝐹𝑥,𝑦 ×𝐼𝐷𝐹𝑥

1 4 7

Mathematics of Natural Language Mathematics of Natural Language


Session 1 Processing algorithms (cont’d) Processing algorithms (cont’d)

Using bag of words with TF-IDF, we end up with a sparse matrix:


o Mathematics of Natural Language Processing algorithms Consider the following 2 phrases:
• College Student
o Regression algorithms • College Professor

We can use their word count to make a vector of the features


(words)
• Let’s gather all the words in both phrases: (college, student, professor)
• College Student -> (1,1,0)
• College Professor -> (1,0,1)
• These vectors, representing a document as a vector of words, are often
known as bag of words
Where each row represents a phrase and each column represents
1 0 a given word, with the TF-IDF being the value of the cell.
𝐵𝑂𝑊𝑀𝑎𝑡𝑟𝑖𝑥 = 1
1 0 1

2 5 8

Mathematics of Natural Language Mathematics of Natural Language


Regression algorithms
Processing algorithms Processing algorithms (cont’d)

It is common to improve the bag of words by using the TF-IDF In machine learning, regression algorithms are supervised models
What is Natural Language Processing (NLP)?
(Term Frequency – Inverse Document Frequency) method. for estimating the relationships between a dependent variable
• Natural language processing (NLP) is a branch of artificial intelligence
• TF: importance of the term in the document and one or more independent variables.
that helps computers understand, interpret and manipulate human
language. NLP draws from many disciplines, including computer science • IDF: importance of the term in all documents (corpus) • Linear Regression
and computational linguistics, in its pursuit to fill the gap between human
• Logistic Regression (for classification)
communication and computer understanding.
TF-IDF is a popular scoring approach used to weigh terms for NLP • Etc.
tasks because it assigns a value to a term according to its
importance in a document scaled by its importance across all
documents in your corpus, which mathematically eliminates What is a supervised learning model?
Why NLP?
naturally occurring words, and selects words that are more
• There is a wealth of audio- and text-based resources. If we can use NLP
to our advantage, we gain worthy information.
descriptive of your text.

• Spam/ham detection for fraud prevention in text messages/emails


• An online business analyzing reviews on its products TF = The number of times a word appears in a document divided
• An automaker using NLP to learn the public feedback to its latest vehicle line by the total number of words in the document
• Analyzing social media trends, what people like, what they discuss most, etc. • e.g.: when a 30-word document contains the term “student” 5 times, the
TF for the word ‘student’ is 5/30=1/6.

3 6 9

1
2024-09-25

Regression algorithms (cont’d) Regression algorithms (cont’d) Regression algorithms (cont’d)

Linear regression is A confusion matrix is given below:


This idea can be readily extended to n points.

Actual Class

P N
Y Y? Y? Predicted P TP FP
Let’s look at this in 2D: Class N FN TN

where:
P =Positive; N =Negative
TP =True Positive; FP =False Positive; TN =True Negative; FN =False Negative

X Let’s move to jupyter and open “Linear Regression.ipynb” to


continue.

10 13 16

Regression algorithms (cont’d) Regression algorithms (cont’d) Regression algorithms (cont’d)


Logistic regression (logit regression) :
But there could be an infinite number of lines. An example of a confusion matrix for disease diagnosis for 200
patients:

l3 n = 200 patients Actual Class

Y l2
The first 2 are Binary classification (two classes only, usually 0 and 1) whereas P N
the last example is multiclass classification (Multinomial Logistic/Multi-class Predicted P 40 15
l1 Logistic Regression using Softmax function instead of Sigmoid)
Class N 5 140
Logistic regression employs the so-called logistic function (sigmoid):
1
𝑝=
1 + 𝑒–𝑙 A couple of definitions:
𝑝
𝑙𝑜𝑔𝑖𝑡: 𝑙 = ln( )
1−𝑝 • Accuracy: (TP+TN)/TOTAL =180/200 =90%
• Misclassification error: (FP+FN)/TOTAL = 20/200 =10%

11 14 17

Regression algorithms (cont’d) Regression algorithms (cont’d) Regression algorithms (cont’d)

Which one to choose? Using logistic function, it can predict values that lie between 0 and We try to avoid FP and FN predictions:
1. Then, using a threshold (usually 0.5) it can classify into 2 classes.
Constraints: ground truth, intercept, etc. Logistic regression has discreet outputs as opposed to linear
regression. Dependent variable is categorical.
We will adopt a least square loss function
• A metric to minimize the overall error (sum of squared residuals)
• 𝐿𝑆𝐿𝐹 = ∑𝑛𝑖=1 𝑒𝑖2 = ∑𝑛𝑖=1 𝑦𝑖 − 𝑦/𝑖 2 where 𝑦𝑖 and 𝑦/𝑖 are the actual and estimated y-
values at 𝑥𝑖, respectively.

• R-squared (Coef. of determination) indicates the goodness of fit

𝑅2 = 1 −
𝑆𝑆 r𝑒𝑠
=1−
∑𝑖𝑛' ( 𝑦 𝑖 ' 𝑦( 𝑖 *
∑𝑖𝑛' ( 𝑦 𝑖 ' 𝑦) 𝑖 *
Y e3 e4
𝑆𝑆 𝑡𝑜𝑡

• R-squared lies between 0 and 1 e2


• R2 = 0 means a very bad fit
• R2 = 1 means an excellent fit
e1
• Other metrics: RMSE, MAE, MAPE In order to evaluate the quality of a logistic regression model, we Let’s move to jupyter and open “Logistic Regression.ipynb” to
X use the so-called confusion matrix (error matrix). discuss logistic regression in more detail.

12 15 18

2
2024-09-25

Decision Tree Decision Tree


algorithms (Cont’d) algorithms (Cont’d)
Elements of a tree Which attribute at the root level would minimize the entropy
• Node: split for the value of a certain attribute of the data in the second level (remaining data belonging to
• Root: the node where the first split happens one c lass as muc h as possible)?
SCHOOLOF COMPUTERTECHNOLOGY • Leaves: terminal nodes predicting the outcome
• In simple terms, which attribute makes the remaining data belong to
• Edges: Outc ome of a split to next node one class as much as possible?
AASD 4001 •

Splitting over A: YY ZZand YYY ZZ
Splitting over B:YYYYY and ZZZZ
Mathematic al Conc epts for Mac hine Learning • Splitting over C: Y ZZZand YYYY Z

Lec ture 3
Split on A Split on B Split on C
Reza Moslemi, Ph.D., P.Eng. A=0 A=1 B=0 B=1 C=0 C=1
YY ZZ YYY ZZ YYYYY ZZZZ Y ZZZ YYYY Z

1 4 7

Decision Tree Decision Tree


Session 3
algorithms (Cont’d) algorithms (Cont’d)
How does a dec ision tree work? The primary weakness of a DT model is its tendency to overfit.
o Dec ision Tree algorithms • At each step, DT model tries to find the attribute that minimizes the • Not very good predic tive c apability
entropy of the data in the next step
• In simple terms, making the remaining data as consistent (belonging to How to improve?
o Gradient Descent algorithm one class) as possible • Limit the depth of the tree
• In mathematics, it is known as a greedy algorithm • Random Forest is a well-known approach to improve the performance
• It makes the optimal choice at each step as it attempts to find the overall of a single DT
o Support Ve c tor Mac hines optimal way to solve the entire problem.
• Although the concept seems simple, it works well in practice!
• May not be the optimal solution
o Clustering algorithms • But works often well
• There are ways to improve it (e.g. random forest)

2 5 8

Decision Tree Decision Tree


Decision Tree algorithms
algorithms (Cont’d) algorithms (Cont’d)
What is a dec ision tree? Entropy – mathematic al and intuitive approac h
• Degree of disorder or randomness
Random Forest:
• New random sample of features for each tree (bootstrap or bagging)
• Mathematic ally, entropy is given by 𝐻 𝑆 = − ∑i 𝑝i 𝑆 𝑙𝑜𝑔2 𝑝i 𝑆
• where 𝑝i 𝑆 is probability of an element/class ‘i’ in the data. bootstrapping is random sampling with replacement from the available
training data. Bagging (= bootstrap aggregation) is performing it many
• Intuitively, if you were to choose an attribute for the root, what would times and training an estimator for each bootstrapped dataset
you choose to minimize the entropy of the data in the second level? • Random subset of attributes for each tree at each level (uncorrelated
sub-trees)
Attribute A Attribute B Attribute C Class • For classification, it is common to allow an 𝑚 number of 𝑝 features at each split
0 0 1 Y where 𝑚 = 𝑝
Outlook Humidity Wind Played 1 0 1 Y
Rain High Strong Yes • For regression a good default is 𝑚 = 𝑝/3
Overcast High Weak Yes
0 0 1 Y
Rain High Strong No 1 0 1 Y • More robust than a single dec ision tree
Overcast Normal Weak Yes 1 0 0 Y • Reduc es overfitting and error due to bias
Rain Normal Weak Yes 1 1 1 Z
Sunny High Strong No 1 1 0 Z
Overcast Normal Weak Yes 0 1 0 Z
Overcast High Strong Yes
0 1 0 Z Let’s switch to jupyter notebook to make classification/prediction
Rain Normal Strong Yes
Sunny High Weak Yes using DT and RF model for data provided in “kyphosis.csv”.
Rain Normal Strong No
Overcast High Strong Yes

3 6 9

1
2024-09-25

Gradient Descent
Gradient Descent algorithm Support Vector Machines
algorithm (cont’d)
Gradient Descent algorithm • There is a chance that you miss the minimum if the steps are too large. What are Support Vec tor Mac hines?
• What is gradient? • Move in smaller steps, i.e. −∇𝑓(𝑝) multiplied by a small learning rate, usually 0.01 • Supervised machine learning algorithms to analyze data and recognize patterns
• What is a gradient descent algorithm? as a starting point • Used for both regression and also classifica tion
• A learning rate that is too large can cause the model to converge too
• What about stochastic gradient descent algorithm? • Can perform linear and non-linear analysis (decision boundaries)
quickly to a suboptimal solution, whereas a learning rate that is too small
ca n ca use the process to get stuck. • Work well with high-dimensional data (data with more than a few
number of features), but can be computationally expensive
Gradient: • We can take larger steps when we are far from the minimum and smaller
steps when close to the minimum. This process is called learning rate • It chooses extreme vec tors or support vec tors to c reate the hyperplane
• The gradient vector can be interpreted as the "direction and rate of
fastest inc rease" scheduling and is often carried out automatically by the gradient descent • Support vec tors are defined as the
• Consider a room where the temperature is given by T(x, y, z). libraries. data points, whic h are c losest to the
At each point in the room, the gradient of Tat that point will • Avoid getting stuck at local minima or saddle hyperplane and have some effec t on
show the direc tion in whic h the temperature rises most quic kly, its position. As these vec tors are
moving away from (x, y, z). The magnitude of the gradient will • Randomness helps to arrive at a global minimum
and avoid getting stuc k at loc al minima supporting the hyperplane, therefore
determine how fast the temperature rises in that direction. named as Support vec tors
• For non-linear dec ision boundaries,
• Consider a surface whose height above sea level at point (x, y) is H(x, y).
The gradient of H at a point is a vec tor pointing in the direc tion of the data c an be c onverted into a linear
steepest slope at that point. one using higher dimensions

10 13 16

Gradient Descent Gradient Descent Support Vector Machines


algorithm (cont’d) algorithm (cont’d) (cont’d)
Gradient: Gradient descent algorithm does not work well with large Example:
datasets. Why? • Let’s c onsider data points with 2 features x1 and x2
• Mathematic ally : • Uses the entire training set to compute the gradient of the loss function • It is possible to draw several lines that completely separate the classes
for each iteration of the gradient descent and then updates the function • Support Vec tor Mac hines (SVM) provide a mathematic al framework to

6𝑥 1 • Let’s assume we have 10,000 samples with 20 features each. For each determine the optimal separation between classes with the maximal margin
• ∇𝑓 𝑝 = ⋮ where 𝑝 = 𝑥1 , … , 𝑥𝑛 is an n-dimensional space. point, we need to calculate the partial derivative wrt 20 features.
6ƒ • It is done so that the model can efficiently predict future data points
• Therefore, for each step we have 10,000*20=200,000 sets of calculations.
6𝑥 𝑛
• The data points closest to the optimal hyperplane are called support vectors
• In our case, n is the number of features in each sample.
• SVM maps training examples to points in space so as to maximize the width of
Disadvantages: the gap between the two categories. New examples are then mapped into
• Redundant c omputation for the same training example for large that same spac e and predic ted
datasets to belong to a c ategory based
• Can be very slow and intractable on which side of the gap they fall

Stochastic gradient descent is a solution that extends the idea of gradient


desc ent to large datasets.

11 14 17

Gradient Descent Gradient Descent Support Vector Machines


algorithm (cont’d) algorithm (cont’d) (cont’d)
Stochastic Gradient Descent (SGD):
We can use the concept of gradient to minimize any loss But what about not linearly-separable datasets?
• The same idea to use (-) gradient direction to minimize a loss function
function: • The approac h here is known as kernel trick.
• Instead of using all the data points( the gradient of all-example-loss),
• Take the gradient at a point 𝑝, it gives the direction at which the loss • A Kernel is a mathematical transformation to project a given datapoint
only one training example (the gradient of one-example-loss) is used (dataset) onto a higher dimension space, it may make it easier to classify
function grows largest (steepest ascent). at a time for c omputation the data where it could be linearly divided by a plane.
• In order to minimize the func tion, move exac tly in the opposite • While the gradient of the "all-example-loss" may push us down a loc al
direc tion, i.e. −∇𝑓(𝑝). minima, or get us stuck at a saddle point, the gradient of "one-
example-loss" might point in a different direction, and might help us
steer c lear of these.
• We can improve SGD performance by using a random mini-batch
(subset, certain number of examples) of the data points, rather than Kernel trick
just one. This is the method that is often used in practice.
Different types of gradient desc ents:
• Batc h Gradient Desc ent
• Stoc hastic Gradient Desc ent
• Mini batc h Gradient Descent
Let’s move to jupyter notebook and open “SGD.ipynb” to practice Let’s move to jupyter notebook and open “SVM.ipynb” to further
SGD c lassific ation. explore SVMs with scikit-learn.

12 15 18

2
2024-09-25

Clustering algorithms Clustering algorithms (cont’d)

Clustering algorithms Mathematically:


• Unsupervised machine learning algorithms to group/classify a set of • Given a set of datapoints 𝑥1, … , 𝑥 𝑛 data is partitioned into 𝑘 (≤ 𝑛) clusters
datapoints in a dataset such that datapoints in one cluster are more so as to minimize the within-cluster variance (squared error function):
similar (in some sense) to each other than to those in other clusters

where 𝜇 i is the mean of the points in 𝑖 -th cluster, 𝐶i

In practice, K-Means algorithm is as follows:


• Randomly pic k 𝑘 (≤ 𝑛) c entroids
• Assign each data point to the closest centroid
• Recompute the new centroids based on the mean of the points in each
cluster
• Iterate the above 2 steps until the c entroids stop moving (below a given
threshold)
• To predict the c luster of a new unseen datapoint, c alculate its distanc e
from all centroids and assign the class of its nearest centroid

19 22

Clustering algorithms (cont’d) Clustering algorithms (cont’d)

In general, for clustering algorithms one needs to be mindful of:


Different types of c lustering methods: • Too many and/or irrelevant features can negatively affect the clustering
• C onnec tivity-based c lustering results
• Distanc e based • The number of clusters is not known a priori, you should decide and
• C entroids-based c lustering (probably) fine tune it
• Represents each cluster by a single mean vector • The resultant clusters are not necessarily grouped as you wanted them to
• Distribution-based c lustering • Last but not least, let’s repeat that K-Means is an unsupervised machine
learning algorithm and therefore you cannot verify its performance in
• Modeled using statistical distributions real-life scenarios by comparing it to a test set (training/test approach).
• Density-based C lustering
• Defines clusters as connected dense regions in the data space
Let’s move to jupyter notebook and open “K-Means.ipynb” to
explore this in prac tic e.
One of the most common and simple clustering algorithm is known as K-Means
clustering

20 23

Clustering algorithms (cont’d)

K-Means clustering
• Attempts to group similar clusters of data together (based on their
• distanc e from K c entroids)
• Minimizes within cluster variance
• Unsurprised machine learning algorithm – you do not need/have
target information
• In simple terms: only need 𝑥1 , … , 𝑥𝑛 but not 𝑦1 , … , 𝑦𝑛
• What are possible applications of K-Means clustering (and clustering,
in general)

21

3
2024-09-25

Matrix Factorization (cont’d) Matrix Factorization (cont’d)

What is Singular Value Decomposition (SVD)? Let’s assume that there are n users, m movies, and k (𝑘 ≤ 𝑛, to
be chosen by the algo. designer) latent features, based on
the singular value decomposition of a complex m-by-n matrix whic h the users have rated the movies.
SCHOOLOF COMPUTERTECHNOLOGY M is given by:

AASD 4001 • 𝑴 = 𝑼𝚺𝑽𝑻 If we form an R matrix consisting of the data shown in the
movie database, we can use SVD-like decomposition to
Mathematic al Conc epts for Mac hine Learning write:
where m-by-m 𝑼 and n-by-n 𝑽 are orthogonal matric es, and
Lec ture 4 𝚺 is a m-by-n rectangular diagonal matrix with non-negative
real numbers on the diagonal. 𝑹 ≈ 𝑷 ∗ 𝑸𝑻 = 𝑹.

Reza Moslemi, Ph.D., P.Eng. orthogonal matrix: Each row of the n-by-k 𝑷 matrix denotes the association btw
• a square matrix whose columns and rows are orthonormal vectors a user and the features.
• In simple terms: 𝑽𝑻𝑽 = 𝑽𝑽𝑻 = 𝑰 or 𝑽𝑻 = 𝑽#𝟏 Each row of the m-by-k 𝑸 matrix denotes the association btw
a movie and the features.

1 4 7

Session 3 Matrix Factorization (cont’d) Matrix Factorization (cont’d)

An example of SVD: As an example:

o M atrix Factorization 1 0 0 0 2
0 0 3 0 0
0 0 0 0 0
o M athematics of Digital Signal Proc e ssing 0 2 0 0 0
0 0 −1 0 0
0 −1 0 0 3 0 0 0 0 − 0.2 0 0 0 − 0.8
= −1 0 0 0 0 5 0 0 0 0 −1 0 0 0
0 0 0 −1 0 0 2 0 0 0 0 0 1 0
0 0 −1 0 0 0 0 0 0
− 0.8 0 0 0 0.2

Obviously, manual calculation of the above SVD-related


matric es is c umbersome. If we c ould find P and Q, eac h predic tion for any user rating
• Fortunately, numpy linear algebra library has built-in SVD method any movie would be:
and we will shortly explore an SVD example with that
• 𝑟i& = 𝑝i 𝑞 𝑇i = ∑𝑘𝑎 ) * 𝑝i𝑎𝑞𝑎&
• It is important to know the concept behind the libraries we are using

2 5 8

Matrix Factorization Matrix Factorization (cont’d) Matrix Factorization (cont’d)

Factorization: How does SVD relate to ML and recommendation systems in But how do we find P and Q matric es?
• 𝑥 2 − 𝑦2 = 𝑥 − 𝑦 𝑥 + 𝑦 particular? • SVD only works for matrices without missing values, but the rating matrix
• 𝑥 2 + 𝑦2 + 2𝑥𝑦 = 𝑥 + 𝑦 2 has a lot of missing values

Consider the following rating matrix (user-item, user-movie here) : • We can form an optimization problem to solve that issue
The same idea can be applied to matrices and is called matrix
factorization! Movie #1 Movie #2 Movie #3 Movie #4
One a pproac h is to initialize the P and Q matric es with some
User #1 5 3 - 1
• There are various methods for matrix factorization, each with its own
User #2 4 - - 1 random values, calculate 𝑹. and calculate how different it is
properties and applic ations
• LU (lower-upper) Decomposition
User #3 1 1 - 5 from ac tual 𝑹 (the error).
• Cholesky Decomposition User #4 1 - - 4
• QR Dec omposition User #5 - 1 5 4
• Singular Value Dec omposition (SVD)
• Etc. Where eac h user has rated some movies 0-5 (the “–” in the table In simple terms, we need to minimize the following loss function
means that the user has not rated that movie). containing squared errors:
• e.g.: Netflix released such a database for completion on Kaggle (appox.
What is most relevant to us in this course, especially for 500,000 user and 17,000 movies). )
rec ommender systems, is SVD. 𝑚𝑖𝑛𝑷,𝑸 1 𝑟𝑥i − 𝒑 𝒊. 𝒒𝑻
𝒙
i,𝑥

3 6 9

1
2024-09-25

Mathematics of Digital Signal Mathematics of Digital Signal


Matrix Factorization (cont’d)
Processing (cont’d) Processing (cont’d)
Obtaining the digital signal for original analog signal Basic operations (processing) on digital signals:
How to minimize the given c ost func tion? • Sampling: converting a continuous
• In session 4, we learnt about the gradient descent algorithm for loss time-varying signal into a disc rete-time Time Shifting
func tion minimization. signal, a sequenc e of real numbers • Time shifting is, as the name suggests, the shifting of a signal in time.
) • Quantization replac es eac h real number This is done by adding or subtracting an integer quantity of the shift
For the loss func tion ∑ 𝑟 − 𝒑 . 𝒒𝑻 , the gradients are given by: with an approximation from a finite set of
i,𝑥 𝑥i 𝒊 𝒙 to the time variable in the function. Subtracting a fixed positive
disc rete values. quantity from the time variable will shift the signal to the right (delay)
• ∇𝑝𝑥 ,𝑘 = ∑i ,𝑥 −2 𝑟𝑥 i − 𝒑𝒊 . 𝒒𝒙𝑻 𝑞 i𝑘
by the subtracted quantity, while adding a fixed positive amount to
• ∇𝑞i ,𝑘 = ∑i ,𝑥 −2 𝑟𝑥 i − 𝒑𝒊 . 𝒒𝒙𝑻 𝑝𝑥𝑘 the time variable will shift the signal to the left (advance) by the
𝑥 𝑡 = 𝑡 + 𝑡 2 + 1/2𝑡 4 𝑥𝑛
added quantity.

and the iterative update formula for P and Q matrices is given by:
• 𝑷 ← 𝑷 − 𝜂. ∇𝑷
• 𝑸 ← 𝑸 − 𝜂. ∇𝑸

10 13 16

Mathematics of Digital Signal Mathematics of Digital Signal


Matrix Factorization (cont’d)
Processing (cont’d) Processing (cont’d)
In order to avoid overfitting, it is common to augment the Continuing with the previous example: Basic operations (processing) on digital signals:
previous loss func tion with an extra term of the following form 𝑥 𝑡 = 𝑡 + 𝑡 ) + 1/2𝑡*
(aka regularization): Time Sc aling
)
• Time scaling compresses or dilates a signal by multiplying the time
𝑚𝑖𝑛 1 𝑟 − 𝒑 . 𝒒𝑻 +𝜆 1 𝑝 ) +1 𝑞 ) variable by some quantity. If that quantity is greater than one, the
𝑷,𝑸 𝑥i 𝒊 𝒙 𝑥 i
𝑥𝑛 𝑛, - , 1 , ) , / , * , 0 = {0.00, 2.50, … } signal becomes narrower and the operation is called decimation. In
i,𝑥 𝑥 i
contrast, if the quantity is less than one, the signal becomes wider
Which results in the following expressions for the gradients: and the operation is called expansion or interpolation, depending
• ∇𝑝𝑥,𝑘 = ∑ i,𝑥 −2 𝑟𝑥i − 𝒑𝒊. 𝒒𝑻 𝑞 i𝑘 + 2𝜆𝑝i,𝑘 Exercise: on how the gaps between values are filled.
𝒙
• ∇𝑞 i,𝑘 = ∑ i,𝑥 −2 𝑟𝑥i − 𝒑𝒊. 𝒒𝑻 𝑝𝑥𝑘 + 2𝜆𝑞 i,𝑘 Consider a c ontinuous signal 𝑥 𝑡 = 0.5𝑡 ) + sin(𝑡). C alc ulate
𝒙
𝑓 𝑛 for n =0-5 up to 2 dec imal points.
which will be used in the update formula.
𝑥𝑛 𝑛, - , 1 , ) , / , * , 0 = {0.00, 1.34, … }
Let’s move to jupyter notebook and open “Matrix
Factorization.ipynb” and practice.

11 14 17

Mathematics of Digital Signal Mathematics of Digital Signal Mathematics of Digital Signal


Processing Processing (cont’d) Processing (cont’d)
Digital signal proc essing (DSP): Basic operations (processing) on digital signals: Basic operations (processing) on digital signals:
• The use of digital processing units (e.g. a computer) to perform a wide Decimation
variety of signal proc essing operations. Flipping (time reversal) • In decimation, the input of the signalis changed to be 𝑓 𝑐𝑛 . The quantity used
• The digital signal is a sequence of numbers that represent samples of a • It flipsthe signal over the y (vertic al) axis. for decimation 𝑐 must be an integer so that the input takes values for which a
continuous variable in a domain such as time, space, or frequency. • A technique to focus wave energy to a selected point in space and discrete function isproperly defined. The decimated signal 𝑓 𝑐𝑛 corresponds to
time, localize and characterize a source of wave propagation, and/or the originalsignal 𝑓 𝑛 where only ea ch 𝑛 sample ispreserved (including 𝑓 0 ),
communica te information between two points. and so we are throwing away samples of the signal.
Applications: • Itis the process of reducing the sampling rate. In practice, this usually implies
lowpass-filtering a signal, then throwing away some of its samples. (Also
downsampling or compaction)

12 15 18

2
2024-09-25

Mathematics of Digital Signal Mathematics of Digital Signal


Processing (cont’d) Processing (cont’d)
Basic operations (processing) on digital signals:
Periodic signals:
Expansion • 𝑥 𝑛 =𝑥 𝑛+𝑁
𝑛
• In expansion,the input of the signal ischanged to be 𝑓 . We know that the
𝑐
signal 𝑓 𝑛 isdefined only for integer valuesof the input n. Thus,in the expanded
signal we can only place the entries of the originalsignal 𝑓 at valuesof 𝑛 that are
multiples of 𝑐. In other words, we are spacing the valuesof the discrete-time signal
𝑐 − 1 entries away from each other. Since the signal isundefined elsewhere, the
standard convention in expansion is to fill in the undetermined values with zeros.
• Itproduces an approximation of the sequence that would have been obtained
by sampling the signal at a higher rate (upsampling).

19 22

Mathematics of Digital Signal Mathematics of Digital Signal


Processing (cont’d) Processing (cont’d)
Basic operations (processing) on digital signals: Complex numbers
• Complex numbers shorten the equations used in DSP, and enable
Interpolation techniques that are difficult or impossible with real numbers alone. For
• In prac tic e, we may know spec ific information that allows us to provide instance , the Fast Fourier Transform is based on complex numbers.
𝑛
good estimates of the entries of 𝑓 that are missing after expansion.
𝑐 C a rtesian form of a c omplex number: 𝑧 = 𝑥 + 𝑗𝑦
This process of inferring the undefined values is known as interpolation.
• x is the real part, Re(z)
• y is the imaginary part, Im(z)
• Imaginary unit: 𝑗 = −1

Polar form of a complex number: 𝑧 = 𝑟 cos 𝜃 + 𝑗 sin 𝜃


• r is the magnitude of z, 𝑧
• 𝜃 is the angle (phase) of z
𝑦
• 𝑟= 𝑥 2 + 𝑦2 𝜃 = tan#*
𝑥
• 𝑥 = 𝑟 cos 𝜃 𝑦 = 𝑟 sin 𝜃

20 23

Mathematics of Digital Signal Mathematics of Digital Signal


Processing (cont’d) Processing (cont’d)
Complex numbers
Even and Odd signals

Exponential form of a c omplex number: 𝑧 = 𝑟𝑒 12


Even signal:
• 𝜃 must be strictly in radians (NOT degrees)
• Signal is flipped about the y-axis
• 𝑥 𝑛 = 𝑥 −𝑛
Euler’s formula:
• 𝑒&1 = cos 𝜃 + 𝑗 sin 𝜃
• 𝑧 = 𝑟 cos 𝜃 + 𝑗 sin 𝜃 = 𝑟𝑒&1
Odd signal:
• Signal is flipped around the origin
• 𝑥 𝑛 = −𝑥 −𝑛
• At n=0, 𝑥 0 = −𝑥 0 = 0.

21 24

3
2024-09-25

Sinusoid 1D Discrete Fourier transform


The Fourier transform is analogous to a glass prism. The prism is a physical
device that separates light into various colour components each
f(x) = A sin( 2 𝜋𝑢𝑥 + 𝜑 )
depending on its wavelength content. When we consider light, we talk
about its spectrum or frequency content. Likewise, the Fourier transform
SCHOOL OF COMPUTER TECHNOLOGY acts like a “mathematical prism” that separates a function into different
components, each based on frequency. The Fourier transform
AASD 4001 characterizes a function by its frequency content.
Mathematical Concepts for Machine Learning
Lecture 5

Reza Moslemi, Ph.D., P.Eng.


A : amplitude
u : ordinary frequency, the number of cycles per unit of time (Hertz) or distance
𝜔=2 𝜋 u : angular frequency, radians per unit of time/distance
T= 2 𝜋 /( 2 𝜋𝑢) =1/u : period Mathematical prism Glass prism
𝜑 : phase

1 4 7

Session 5 The 1D Fourier Transform Important DFT quantities


The Fourier transform of a single variable, continuous function f (t) is defined by the equation 𝑀–1
1 2𝜋𝑢𝑡 2𝜋𝑢𝑡
o Fourier Transforms in Signal and Image processing F(u) = ∫
"
f (t)e−j2πut dt (1) From equation 𝐹 𝑢 =
𝑀
3 𝑓(𝑡)(cos − 𝑗 sin
𝑀
) it follows that the components
! " 𝑀
o Introduction where j = − 1 . Conversely, given F(u), we can obtain f (t) by the inverse Fourier transform
𝑡=0
of the Fourier transform are complex quantities. Let’s express F(u) in exponential coordinates
o 1D and 2D Fourier transforms " (2)
F(u) = | F(u) | e−jϕ(u)
f(t) = ∫! F (u)ej2πutdu where | F(u) | = [R2(u) + I2(u)]1/2 is called the magnitude or spectrum of the Fourier
o Filtering in the frequency domain
"
These two equations comprise the Fourier transform pair. transform. R(u) and I(u) are real and imaginary parts of F(u), respectively. The
properties of the spectrum are used for image enhancement.
• The function can be recovered from its Fourier transform.
• The domain of the Fourier transform is the frequency domain. If t is in seconds, u is Hertz (1/second). Another quantity that we will rely on is the power spectrum also referred to as spectral
density is defined as the square of the Fourier spectrum.
The Fourier transform of a discrete function of one variable f (t), t = 0, 1, 2, ..., M −1, where M is the P(u) = | F(u) |2 = R2(u) + I 2(u)
number of samples, is given by and
ϕ(u) = tan−1 I(u)
𝑀–1 R(u)
1
𝐹 𝑢 =𝑀3 𝑓(𝑡)𝑒–j2𝜋𝑢𝑡/𝑀 for u = 0, 1, 2, ..., M −1 Is phase angle or phase spectrum of the transform.
𝑡=0 It determines how the sinusoids line up relative to one another
We can obtain the original function back using the inverse DFT to form f (t).

𝑀–1
𝐹(𝑢)𝑒j2𝜋𝑢𝑡/𝑀
ϕ(u)
𝑓 𝑡 = 3 for t = 0, 1, 2, ..., M −1
𝑢=0

2 5 8

Intro to Fourier transform and


frequency domain 1D Discrete Fourier transform Example of 1D Fourier transform
𝑀–1
Fourier’s contribution (the analytic Theory of Heat, 1822): f1 1 – j2𝜋 𝑢𝑡 Fourier transform of the box function is the sinc function. In this case the Fourier
How do we compute 𝐹 𝑢 = 3 𝑓 𝑡 𝑒 𝑀 ?
𝑀 transform is a real-valued function. Both f(t) and F(u) could be discrete functions (the
f2 𝑡=0
Any function that periodically repeats itself can be Firstly, we substitute u=0 and sum for all values of f (t). We then substitute u=1 in the exponential points on the plots are linked).
expressed as a sum of sines and cosines of different f3 and repeat the summation over all values of t. We repeat this process for all values of u in order It is important to keep in mind that samples f(t) are not necessarily taken at integer
frequencies, each multiplied by a different coefficient. to obtain the complete Fourier transform. It takes approximately M2 summations and
values of t in a finite interval. These can be equally spaced floating-point numbers.
f4 multiplications to compute DFT.
"
Is the transform 𝐹 𝑢 a discrete quantity? Does it have the same number of components as 𝑓 𝑡 ? F(u) = ∫ f (t)e−j2πut dt = 𝐴𝑤 sinc uw ; sinc t =
sin πt
The concept that complicated functions could be represented ! " 𝜋𝑡
as a sum of simple sines and cosines was not at all intuitive. The concept of the frequency domain follows directly from Euler’s formula
When it first appeared, it was a revolutionary concept which e jθ = cosθ + j ⋅sinθ
took mathematicians over a century to “adjust”. Substituting this formula into the equation for F(u) and knowing that cos(−θ) = cosθ gives us f (t) F(u) |F(u) |

Even functions that are not periodic but whose area under the 𝑀–1
curve is finite can be expressed as the integral of sines and 1 2𝜋𝑢𝑡 2𝜋𝑢𝑡
𝐹 𝑢 = 3 𝑓(𝑡)(cos − 𝑗 sin ) for u = 0, 1, 2, ..., M −1
cosines multiplied by a weighing function. The formulation in this 𝑀 𝑀 𝑀
𝑡=0
case is the Fourier transform, and its utility is even greater than the
Fourier series in most practical applications. Each term of the Fourier transform is composed of the sum of all values of the function f (t).
The values of f (t) are multiplied by sines and cosines of various frequencies. The domain
This core technology allowed for the first time practical
(values of u) over which the values of u range is called the frequency domain, because u
processing and meaningful interpretation of a wide range of determines the frequency of the components of the transform. Each of the M terms of F(u)
is called a frequency component of the transform. u u
signals, from medical monitors and scanners to modern a ·f1 + b ·f2 + c ·f3 + d ·f4
electronic communication.

3 6 9

1
2024-09-25

2D DFTand its Inverse 2D DFT Magnitude and Phase 2D DFTimportant properties


y
Extension of 1D Fourier transform and its inverse 2 dimensions is 1
𝑀–1 𝑁–1
𝑢𝑥 + 𝑣𝑦 )
straightforward. The discrete Fourier transform of f (x, y) of size M x N is given 1. The value of the Fourier Transform 𝐹 𝑢, 𝑣 = 3 3 𝑓 𝑥, 𝑦 𝑒 –j2𝜋( 𝑀 𝑁 at (u, v) = (0,0)
(u, v) = (0, 0) 𝑀𝑁
by equation 𝑀–1 𝑁–1
𝑥=0 𝑦=0
1
1
𝑀–1 𝑁–1
–j2𝜋( 𝑢𝑥 𝑣𝑦 is 𝐹 0,0 = 3 3 𝑓 𝑥, 𝑦
𝑀+ 𝑁 ) 𝑀𝑁
𝐹 𝑢, 𝑣 =
𝑀𝑁
3 3 𝑓 𝑥, 𝑦 𝑒 (1) 𝑥=0 𝑦=0
𝑥=0 𝑦=0 which is the average of f (x, y). If f (x, y) is an image, then the value of its Fourier transform
The expression must be computed for values u = 0, 1, 2, ..., M − 1and also for at the origin is equal to the average gray level of the image. Because both
v = 0, 1, 2, ..., N − 1. frequencies are 0 at the origin, F(0,0) sometimes is called the dc component of the
Given F(u, v) , we obtain f (x, y) via the inverse Fourier transform, given spectrum. This terminology is from electrical engineering where “dc” refers to direct
Phase F(u, v) current.
by the expression | F(u, v)|
f (x, y)
𝑀–1 𝑁–1
𝑢𝑥 𝑣𝑦 2. The spectrum of a Fourier transform is symmetric about origin
𝑓 𝑥, 𝑦 = 3 3 𝐹 𝑢, 𝑣 𝑒j2𝜋( 𝑀 + 𝑁 ) (2) x
𝑢=0 𝑣=0 |F(u, v) | = | F(−u, −v) |
for x = 0, 1, 2, ..., M − 1 and y = 0, 1, ..., N − 1 Does phase appear informative?
| F(u, v) | decreases with higher frequencies. Hence, low
Equations (1)-(2) comprise the 2D discrete Fourier transform pair. The frequencies contain more image information than the
variables u and v are frequency variables and x and y are spatial or higher ones. Higher frequencies correspond to faster gray
image variables. level changes in the image.

10 13 16

Visual interpretation of 2D DFT 2D DFT Magnitude and Phase Filtering in the frequency domain

Filtering in the frequency domain is straightforward and it can be


𝑀–1 𝑁–1
𝑢𝑥 𝑣𝑦
𝑓 𝑥, 𝑦 = 3 3 𝐹 𝑢, 𝑣 𝑒j2𝜋( 𝑀 + 𝑁 ) is decomposed into a weighted
𝑢=0 𝑣=0 illustrated by the following diagram.
sum of 2D functions using scalar products.
Filter Inverse
Fourier
function Fourier
transform
H(u, v) transform
F(u, v) H(u, v)F(u, v)
f (x, y)
H(u, v) is called a filter since it suppresses Post-
Pre-processing certain frequencies in the transform while
Leaving others unchanged. processing
The FT of the output (filtered) image is given by
G(u, v) = H(u, v)F(u, v)
The multiplication is defined on element-by
element basis. Each component of H
multiplies both, real and imaginary parts of the
corresponding component F. The filtered image is g(x, y)
f (x, y) obtained by taking the inverse FT of G(u, v) :
Input image g(x, y) = F−1{ G(u, v)} Enhanced image

11 14 17

2D Discrete Fourier Transform 2D DFT Magnitude and Phase Basic filters and their properties
Notch filter Satellite image of Florida Fourier spectra
We define the 2D Fourier spectrum and the gulf of Mexico showing noise
Used to remove repetitive "Spectral" noise
| F(u, v)| = [R2(u, v) + I2(u, v)]1/2
from an image
phase angle
A notch filter is a filter that contains nulls in its
ϕ (u, v) = tan−1 I(u, v) frequency response.
R(u, v) They are used in many applications where
and power spectrum as specific frequency components must be
P(u, v) = | F(u, v)|2 = R2(u, v) + I2(u, v) Image Inverse FT eliminated.

noise
filter to capture
Notch pass
FT-Magnitude FT-Phase
Ignoring the phase Assuming that the FT has been centered, we
where R(u, v) and I(u, v) are the real and imaginary parts of F(u, v).
can mathematically define the Notch filter
It is a common practice to multiply the input image function by (−1)x+y Image after notch filtering
for an illustration on the right- hand side as
prior to computing the Fourier transform. It has been shown
mathematically that if v = 0,
H = 30
otherwise

filter
by notch pass
Noise captured
1
F( f (x, y) (−1) x+ y) =
F(u −M/2 , v −N/2)
The equation states that the origin of the Fourier transform of f (x, y)(−1)x+y The filter would set the central vertical
line to 0 and leave all other frequency
is located at u=M/2 and v=N/2, which is the centre of the M x N area
Lowpass filter Highpass filter components untouched.
occupied by the 2D DFT. We refer to this area as the frequency domain.
How can we obtain the processed image?

12 15 18

2
2024-09-25

The effect of high-pass and low-


Applications: image processing pass filtering
Let F(u, v) and H(u, v) (filter) denote the Fourier transforms of f(x, y) and
Forensic Application h(x, y). Then the convolution theorem states that
f (x, y) * h(x, y) ⇔ F(u, v) H(u, v)

19 22

The effect of high-pass and low-


Low-pass and high-pass 2D filters
pass filtering
These filters can be created directly in the frequency domain. Low-
pass filters are known as smoothing filters. High-pass filters are known f (x, y) * h(x, y) ⇔ F(u, v) H(u, v)
as sharpening filters.
The expression on the left (spatial convolution) can be obtained
High-pass filter examples Low-pass filter examples
by taking the inverse Fourier transform of the expression on the
Butterworth |H(u,v)| Butterworth |H(u,v)|
right. Conversely, the expression on the right can be obtained by
taking the forward Fourier transform of the expression on the left.
The operation of the discrete convolution of two functions is
defined by
H(u, v)
Function h is mirrored about
the origin
𝑀–1 𝑁–1

𝑓 𝑥, 𝑦 ∗ ℎ(𝑥, 𝑦) = 3 3 𝑓 𝑚, 𝑛 ℎ(𝑥 − 𝑚, 𝑦 − 𝑛)
Gaussian 𝑚=0 𝑛=0

Gaussian

20 23

Low-pass and high-pass 2D filters

21

3
2024-09-25

Convolution Convolution

Stride =1 pixel 6 ×6 image


Filter 2
SCHOOLOF COMPUTERTECHNOLOGY

AASD 4001
Mathematic al Conc epts for Mac hine Learning
Lec ture 6

Reza Moslemi, Ph.D., P.Eng.


When you repeat convolution operation for each filter,
6 ×6 image you will extract features from the original image that
Filter 1 will be then used in training Convolutional Neural
Networks to detect objects. Two images

1 4 7

Session 5 Convolution Convolution

o Image processing techniques (Image transformations) Stride=2


o Convolution Blurring/smoothing kernel
o Scaling
o Rotation
-1 -3
o Translation
o Smoothing
Sharpening kernel

Feature detector kernel


6 ×6 image Filter 1

2 5 8

Convolution Convolution Convolution/spatial filtering

In image processing, a kernel, convolution A 3x3 mask/filter


matrix, filter, or mask is a small matrix used and the image
for blurring, sharpening, embossing, edge
detection, and more. This is accomplished section under it
by doing a convolution between the
kernel and an image.
Convolving a filter with an image results in
extrac tion of features that help the
Filter 1
computer detect an objec t in an image.
Otherwise, by simply c ollapsing an image
into a vec tor of its pixel intensities, these
features would not be ca ptured.
Convolution operations on images are
widely used for image enhanc ement.
Before we state the mathematic al
definition of convolution in the context of Mec hanics of c onvolution:
Filter 2 images, let’s form its conceptual
1. Slide the filter from pixel
understanding. to pixel in an image,
2. At eac h pixel (x, y),
Convolution is a misnomer, it a c tually is
c alculate the response
6 ×6 Matrix Exercise: Compute missing values cross-correlation since the kernel is not of the filter using a
Each filter detects a small pattern. flipped (reversed).
in the resulting matrix (avoid border pixels!) mathematic al formula.

3 6 9

1
2024-09-25

Geometric transformations of
Convolution/spatial filtering Affine Transformation
images
The response of the filter at pixel (x, y) is A shear in the x direction shown in the below graph is produced by
given by the sum of productsof the filter
Geometric transformations are widely used for the removal of
u = x + 0.2y
coefficientsand the corresponding image artifacts and for image registration (geographical
v=y
image pixel values in the area covered mapping).
by the filter mask. For the 3 × 3 matrix
shown on the right panel, the response of It is often necessary to perform a geometric transformation of
filtering with with filter at a point (x, y) in the image co ordinate system in order to:
the image is
R = w(-1, -1) f (x −1, y −1) + w(-1, 0) f (x −1, y) + . . .
- Align images that were taken at different times or with
+ w(0, 0) f (x, y) + . . . + w(1,0) f (x + 1, y) + w(1, 1) f (x + 1,y + 1)
different sensors
- Correc t images for lens distortion
For a mask of size m × n, we assume that
m=2a+1 and n=2b+1, a> 0, b> 0. In - Correct effects of camera orientation
practice, we use filters of odd sizes, with
the smallest meaningful size being 3 × 3 .
- Create special effects by morphing images

10 13 16

Geometric transformations of
Convolution/spatial filtering Affine transformation
images
In general, filtering of an image f (x, y) of size M × N with a filter mask w of size m × n is This transformation produces both, a shear and a rotation.
given by:
In a geometric/spatial transformation each point (x, y) of image f (x, y)
𝑎 𝑏 is mapped to a point (u, v) in a new coordinate system. u = x + 0.2y
g 𝑥, 𝑦 = & & 𝑤 𝑠, 𝑡 𝑓 𝑥 + 𝑠, 𝑦 + 𝑡 (*) u = f1(x, y) v = -0.3x + y
𝑠"#𝑎 𝑡 " # 𝑏 v = f2(x, y)
𝑎 𝑏
Convolution: g 𝑥, 𝑦 = 𝑤 ∗ 𝑓 𝑥, 𝑦 = 3 3 𝑤 𝑠, 𝑡 𝑓 𝑥 − 𝑠, 𝑦 − 𝑡
𝑠"#𝑎 𝑡"#𝑏

where a = (m −1)/2 and b = (n −1)/2. How do we generate a complete image?


We apply this equation for x = 0, 1, ..., M-1 and y = 0, 1, 2, ..., N-1
In this way, the mask processes all pixels in the image. Such filter masks are
sometimes called convolution masks. Equation (*) defines a convolution operation
(truly a cross-correlation, convolution rotates the filter by 180o before performing
multiplication) on the image f (x, y). It is also referred to as spatial filtering as opposed
to frequency domain filtering (Fourier transform).

What happens when the centre of the filter approaches the border of the image? If A digital image array has an implicit grid that is mapped to discrete
the centre of the mask moves any closer to the border, one or more rows or points in the new domain. These points may not fall on grid point in
columns of the mask will be located outside of the plane.
the new domain.

11 14 17

Convolution/spatial filtering Affine Transformation Affine Transformation


A common approach is to do image “padding” that is, to add rows and An affine transformation is any transformation that preserves collinearity (all A rotation by θ is produc ed by
columns of 0s (or other constant pixel value) or padding by replicating rows points lying on a line before the transformation will lie on a line after it) and
and columns. The padding is then stripped off at the end of the process. This ratios of distances (the midpoint of a line segment remains the midpoint after u = x cos θ + y sin θ
keeps the size of the filtered image the same as original. But the values of transformation).
In general, an affine transformation is a composition of rotations,
v = −x sin θ + y cos θ
the padding will have an effect near the edges especially if the size of the
mask is large. translations, magnific ations, and shears.
u = c11x + c12y + c13
v = c21x + c22y + c23

Magnifications Translations (change


in location)
The combination of all these non-zero parameters creates rotations and shears.

What is the difference between linear transformation and affine transformation?


A lineartransformation fixes the origin, whereas an affine transformation need
not do so. An affine transformation is the c omposition of a linear transformation
with a translation, so while the linear part fixes the origin, the translation can map
Padding it somewhere else.

12 15 18

2
2024-09-25

How to Find
Affine transformation Affine transformations
Transformation
We can evolve a sequence of basic affine transformations into a complex affine
transform. Suppose that you are given a pair of images to align. You want
Combinations of transformations are most easily described in terms of matrix to try an affine transform to register one to the coordinate system
operations. To use matrix operations we introduce homogeneous coordinates. These of the other. How do you find the transform parameters?
enable all affine operations to be expressed as a matrix multiplication. Otherwise,
translation is an exception.
The rotation of a point, straight line or an entire image on the screen, about a point
other than origin, is achieved by first moving the image until the point of rotation
occupies the origin, then performing rotation, then finally moving the image to its
original position.
Translation of point by the change of coordinate cannot be combined with other
transformation by using simple matrix application. Such a combination is essential if
we wish to rotate an image about a point other than origin by translation, rotation
again translation.
To combine these three transformations into a single transformation, homogeneous
coordinates are used. In homogeneous coordinate system, two-dimensional
Cartesian coordinate positions (x, y) are represented by triple-coordinates (h.x, h.y, h),
h≠0 in homogeneous c oordinates.

19 22 25

How to Find
Affine transformation Affine Transformations
Transformation
The transformation matrix of a sequenc e of affine Find a number of points {p0, p1, . . . , pn−1} in image A that match
As such, the affine equations are expressed as: transformations, T1, T2, T3 is: points {q0, q1, . . . , qn−1} in image B. Use the homogeneous
T = T3T2T1 coordinate representation of each point as a column in matrices
𝑢 𝑎 𝑏 𝑐 𝑥 P and Q:
𝑣 = 𝑑 𝑒 𝑓 𝑦 The composite transformation for the example above is
1 0 0 1 1 0.92 0.39 -1.56
T = T3T2T1 = -0.39 0.92 2.35
An equivalent expression using matrix notation is
0.0 0.0 1.0
q = Tp
Any combination of affine transformations in this way is an
Where, affine transformation.
𝑢 𝑎 𝑏 𝑐 𝑥 then we can write:
The inverse transform is:
q= 𝑣 , 𝐓 = 𝑑 𝑒 𝑓 , 𝐩= 𝑦
T−1 = T−1T−1T−1 Q=HP
1 0 0 1 1 1 2 3
We need to solve for H in order to find the appropriate
transformation.

20 23 26

Composite Affine
Affine transformation Affine transformation
Transformation
The transformation matrices can be used as building blocks. Given scaling and rotation matrices, R, S, T for rotation, scaling, Matrices for two-dimensional transformation in homogeneous coordinate:
and translation, obtain formula for their produc t H=RST.
How do transformation matrices for translation and scaling look
like?

R S T=

(counter-clockwise)

(clockwise)
Translation by (x0, y0) Scaling by s1 and s2 Rotating by 𝜃
H
You will usually want to translate the center of the image to the origin of the
coordinate system, do any rotations and scalings, and then translate it back.

21 24 27

3
2024-09-25

Intensity transformations Gamma correction Logarithimic transformations

Intensity transformations are used to increase contrast between certain


intensity values that emphasize certain image features. For instanc e, you Image enhanc ement in frequenc y domain
may want to reverse black and the white intensities or you may want to Original γ-co rrected, γ = 2.0
make the darks darker and the lights lighter. An application of intensity
transformations is to increase the contrast between certain intensity
values so that you can pick out things in an image.
Typic al intensity transformation func tions include:
1. Photographic negative
2 . G amma transformation
3. Logarithmic transformations
4. Contrast-stretching transformations

(a) Fourierspectrum (b) Log-transformed

28 31 34

Contrast-stretching
Photographic negative Logarithmic transformations
transformations
Logarithmic transformation c an be used to brighten the Contrast-stretching transformations increase the contrast between the darks
Assume input pixel intensities are in range [0, L], where L=255. intensities in an image. It is used to increase the and the lights. Sometimes, we want to increase the intensity around a certain
detail(contrast) of lower intensity values. They are especially grey level. As a result, dark colours become a lot darker and light colours
useful for bringing out detail in Fourier transforms. The become a lot lighter with only a few levels of grey around the level of interest.
logarithmic transform of image f (x, y) is: Contrast-stretching transformation can be created with the following function:
1 Contrast-stretching transformations with changing E
g(x, y) = c log(1 + f (x, y)) g(x, y) =
(1 + m /( f (x, y) + ϵ)) E
The constant c is typically used to scale the range of the log
E controls the slope of the function and m is the
function to match the intensity range of the original image. mid-line where we want to switch from dark to
light values. ϵ is a small constant used to
c = 255/log(1 + 255) prevent division by 0.
It can also be used to further increase contrast. The higher the c,
the brighter the image will appear.
(a) (b)
(a) Original digital mammogram f (x, y), Log transformation compresses the dynamic range of
(b) Negative image obtained using the negative transformation images with large variations in pixel values.
g(x, y) = 255 − f (x, y) Grey levels What is the midline point m in this plot?

29 32 35

Gamma transformation/ Contrast-stretching


Logarithimic transformations
correction transformations
We can also change the midline that affects the contrast curve.
Image enhanc ement in spatial domain
With G amma correction Fixing E=4 and varying m = 0.1,0.2,...,0.6, we have the following
g(x, y) = where c, γ are
cf (x, y)γ, plot.
Contrast-stretching transformations with varying m
positive constants, we c an adjust
grey level components to either
brighten up or darken their
intensities.
γ here c ontrols the curve. Notice 1
that the full range of the input is g(x, y) =
(1 + m /( f (x, y) + ϵ)) E
mapped to the full range of the
output.
What condition should γ satisfy for
brightening up the image?

f (x, y)

30 33 36

4
2024-09-25

Smoothing filters Smoothing filters


Smoothing filters are used for blurring and noise reduction. Blurring is used in Here is the result of smoothing an image using averaging filter.
preprocessing steps, specifically, for removal of small details from an image prior Notic e how filtering smoothes loc al sharp changes in pixel
to object extraction and bridging of small gaps in lines and curves.
intensities of the original image.
Noise reduction can be accomplished by blurring with an average filter. By
replacing the value of every pixel in an image by the average of the grey levels
in the neighbourhood defined by the filter mask, this process results in an image
with reduced abrupt transitions in pixel intensities. A major use of the averaging
filter is in reduction of “irrelevant” detail in an image (pixel regions that are small
with respect to the size of the filter mask).
The diagram on the left shows an example of
1 1 1 A 3 × 3 averaging filter. It yields the standard average of
pixels under the mask. The constant, is a normalizing
1
× 1 1 1 constant. For a mask of size m ×n, what is the normalizing
9
constant equal to?
1 1 1 The response of this mask at any point of an image (x, y)
is given by:

37 40

Smoothing filters Final Exam Review

Given below is another example of the smoothing filter. What is the constant Final exam (30 pts) will focus on the material that was covered
multiplier in front of the mask equal to? during the co urse sessions:
This mask is called a weighted average. The image pixels are multiplied by • You need to have knowledge of the underlying concepts and mathematics
different coefficients, thus giving more importance to some pixels at the of the topic s c overed in the c lassroom.
expense of others. The pixel at the centre of the mask is multiplied by a higher • The final exam will NOT have any (python) coding questions
value than any other, thus giving this pixel more importanc e in the calculation fo
• You CAN use your c ourse material during the exam
the average. The other pixels are inversely weighted as a function of distance
from the ce ntre of the mask. • This is NOT a group exam. Each student shall only use his/her knowledge to
answer the questions.
• You cannot communicate (in any form) with your classmates or other
individualsto answer the questions.
The diagonal terms are further away from the centre
1 2 1 than the orthogonal neighbours by a fac tor of 2 • Failure to comply with GBC exam policies results in academic consequences.
and thus are weighted less. The basic strategy
1 behind weighing the centre point the highest and
× 2 4 2
16 then reducing the value of the coefficients as a
func tion of inc reasing distanc e from the origin is
1 2 1 simply an attempt to reduce blurring in the
smoothing proce ss.

38 41

Smoothing filters
Given below is a general form of a 3 × 3 smoothing filter.

w(−1,− 1) w(−1,0) w(−1,1)

W= w(0,− 1) w(0,0) w(0,1)

w(1,− 1) w(1,0) w(1,1)


In general, a smoothing filter is a weighted averaging filter of size m × n (m and n
are odd). The formula for filtering an M × N image with the weighted averaging
filteris given by the expression

(1)

The complete filtered image is obtained by applying equation (1) for


x = 0,1,2,...,M-1 and y = 0,1,2,...,N-1. *denotes the operation of convolution.
What is the denominator equal to? Isit a constant?

39

You might also like