0% found this document useful (0 votes)
6 views

Slides Merged

The document outlines a course on Machine Learning, detailing its structure, motivation, and types of machine learning. It covers topics such as linear regression, support vector machines, and deep neural networks, along with practical exercises in industrial applications. The course is organized by professors from the University Erlangen-Nürnberg and includes a written exam based on understanding and implementation of the methods taught.

Uploaded by

Akhil Ps
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Slides Merged

The document outlines a course on Machine Learning, detailing its structure, motivation, and types of machine learning. It covers topics such as linear regression, support vector machines, and deep neural networks, along with practical exercises in industrial applications. The course is organized by professors from the University Erlangen-Nürnberg and includes a written exam based on understanding and implementation of the methods taught.

Uploaded by

Akhil Ps
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 374

Organizational Information 3

Motivation 9
Machine Learning Types 22
Machine Learning Pipeline 55
Summary 69
Linear Regression - Motivation 72
Linear Regression - Overall Picture 86
Linear Regression - Model 94
Linear Regression - Optimization 103
Linear Regression - Basis Functions 128
Logistic Regression - Motivation 137
Logistic Regression - Framework 153
Overfitting and Underfitting 185
Problem Statement 208
Optimization 216
Kernel Trick 224
Hard and Soft Margin 234
Regression 238
Summary 244
Applications 247
Intuition 252
Mathematics 259
Summary 267
Perceptron 270
Multilayer Perceptron 280
Applications 333
Machine Learning for Engineers
Organizational Information

Bilder: TF / Malter
Course Organizers
Prof. Dr. Bjoern M. Eskofier
Machine Learning and Data Analytics Lab
University Erlangen-Nürnberg

Prof. Dr. Jörg Franke


Institute for Factory Automation and Production Systems
University Erlangen-Nürnberg

Prof. Dr. Nico Hanenkamp


Institute for Resource and Energy Efficient Production Systems
University Erlangen-Nürnberg

23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 2


Course Details
5 ECTS
4 SWS
Written Exam
• Exam is in English. The questions are based on understanding the methods and
ability to implement them.
• Two exercises should be successfully completed.
• Exam date will be announced for all universities.

23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 3


Course Structure
Lectures
1. Introduction to Machine Learning (today)
2. Linear Regression
3. Support Vector Machines
4. Dimensionality Reduction
5. Deep Neural Networks

Exercises
Two real-world industrial applications
• Exercise 1: Energy prediction using Support Vector Machines
• Exercise 2: Image classification using Convolutional Neural Networks

23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 4


Lecture Material
Pattern Recognition and Machine Learning
by C. Bishop, 2007

Deep Learning
by A. Courville, I. Goodfellow, Y. Bengio, 2015

Machine Learning: A Probabilistic Perspective


by K. Murphy, 2012

23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 5


Introduction to Machine Learning
Motivation

Bilder: TF / Malter
Four Industrial Revolutions
Currently in the 4. Industrial Revolution 3. Industrial Revolution
Electronics and IT enable automation-driven rationalization
1. Industrial Revolution:
1. Industrial Revolution Steam 1960
Machinesas well as varied series production
Steam machines allow the
2. Industrial
1750 Revolution: Division of Labor (keyword: Conveyer belts)
Industrialization
3. Industrial Revolution: Electronics, Machines and Automatisation with IT
4. Industrial Revolution: Interlinked Production. Machines Communicate
with each other and optimize the process

2. Industrial Revolution 4. Industrial Revolution


Mass division of labor with the Industrial production is interlinked with information and
1870 2013 communication technology
help of electrical energy
In April 2013, the national platform Industry 4.0 was
founded at the Hannover Messe

Source: Industrie 4.0 in Produktion, Automatisierung und Logistik, T. Bauernhansl, M. Hompel, B. Vogel-Heuser, 2014
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 2
Core of the Fourth Industrial Revolution is Digitalization
• 1960 : Dominant Factor Mechanical Engineering
Percantage of Engineering in Mechanical and Plant Engineering
• Mechanical Engineering decreasing with each decade
• Mechanical and plant
• During
Mechanical Engineering Electrical Engineering Computer Sciences Systems Engineering
that time Computer Science (AI, and Machine Learning)
engineering increase
was dominated by
100% 3
5 5
90% steadily! 9 15 mechcanical engineering in
10
1960s
80% 18
• Share of mechanical engineering
70% 30
has been steadily decreasing
60%
until today
50%
95 • Impact of computer sciences has
40% 85 25
70 been increasing since 1980s
30%
20%
30
→ Mechanical and plant
10% engineering is composed of
0% mutliple fields and became an
1960 1980 2000 2020
interdisciplinary area unlike 1960s
Source: Automatisierung 4.0, T. Schmertosch, M. Krabbes, 2018
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 3
What drives that progress in Computer Science?

Annual Of course:
number Algorithmic
of AI papers advances (NOT ONLY AI)

25000 BUT: One significant force of progress IS AI
USA
Europe China
• You can see that clearly with the amount of publications and the
Total number of AI papers

20000
applications in that area

15000 So now: What is Artificial Intelligence actually? An What is the
difference to Machine Learning and Deep Learning?
10000

5000

0
2000 2005 2010 2015

Source: Left: Scopus, 2019


23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 4
Artificial Intelligence vs. Machine Learning
• So what is Artificial Intelligence? What is Machine Learning? What is
Deep Learning?
• AI: Algorithms mimic behavior of humans
• Machine Learning: Algorithms do that by learning from data (Features
are hand crafted!)
• Deep Learning: Algorithms do the learning from data fully automatic
(The algorithms finds good features themselves)

Source: https://www.ibm.com/blogs
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 5
Driving factors for the Advancement of AI
There are three important reasons for these advancements!
1. Increase of the Computational Power
Increase ofin
2. Increase Increase
thethe Amount of the amount
of Available Dataof Development of new
computational power available data algorithms
3. Development of New Algorithms

Sources:
Left Image: https://datafloq.com
Middle Image:https://europeansting.com
Right Image: https://medium.com
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 6
Increase of Computational Power
•• Moore’s
Transistors are Law: “the number of transistors inMoore’s
the building a dense
Law integrated
Transistors per Square Millimeter by Year
circuit
blocks (IC) doubles about every two years.”
of CPUs
•• Transistors: Small “blocks” in the computer doing “simple”
Numbers of transistors are
calculations
correlated with computational
• power
On the right image: x-axis = Years from 1970 – 2020 , y-axis =
logarithmic scale of transistors per mm^2
• Number of transistors has been
• increasing,
You see thethe
hence amount of transistors mm^2 increases linearly BUT
computational power
on a logarithmic scale, that is exponential!
•• ThisThat meansisexponential
phenomenon stated in growth of computation power!
Moore’s law

Source: https://medium.com
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 7
Computational Power of today’s supercomputers
TOP500 Supercomputers (2020)
•June
2020 Increase in computational power especially noticeablePflop/s
System Specs Site Country Cores in
Rmax
Power

1 Supercomputer
Fugaku Fujitsu A64FX (48C, 2.2GHz), Tofu Interconnect D RIKEN R-CCS Japan 7,288,072 415.5 28.3

•2 Super
Summit Computers
IBM POWER9 (22C, have
3.07GHz),millions
NVIDIA Volta GV100 of cores (2020)
(80C), Dual Rail Mellanox EDR Infiniband
DOE/SC/ORNL USA 2,414,592 148.6 10.1

•3 Leaders
Sierra (2020) are(22C,Japan,
IBM POWER9 3.1GHz), NVIDIAUSA,
Tesla V100 China
(80C), Dual Rail Mellanox EDR Infiniband
DOE/NNSA/LLNL USA 1,572,480 94.6 7.4

•4 Germany is Shenwei
Sunway TaihuLight
on the SW2601013.
Interconnect
place
(260C, 1.45GHz) Custom
NSCC in Wuxi China 10,649,600 93 15.4

5 Tianhe-2A Intel Ivy Bridge (12C, 2.2GHz NSCC Guangzhou China 4,981,760 61.4 18.5

The best german supercomputer:


ThinkSystem SD650, Xeon Platinum 8174 24C 3.1GHz, Intel
13 SuperMUC-NG Omni-Path
Leibniz RZ Germany 305,856 19 -

Source: https://www.top500.org
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 8
Increase in the Amount of Available Data
• The amount
Other ofbig driving
created datafactor for AI:
increased fromAvailability
Accordingof to
DATA
General Electric (GE), each of
• twoCreated
zettabytesData
in 2010 to 47 zettabytes
jumped from 2 zettabyteits aircraft
(2010) engines
to 47produces
zettabytearound twenty
(2020)
in 2020 terabyte of data per hour
• It comes from all things around us (cars, IoT, aircrafts….)
• General electric: Each engine produces 20TB of data

https://www.statista.com https://www.forbes.com

23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 9


Development of New Algorithms
• 2006 : Hinton showed how to train deep neural networks (deep
learning)
• 2015 : Algorithms are officially better than humans in object
recognition!

• In 2006, Geoffrey Hinton et al. published a • Increasing performance of visual


paper showing how to train a deep neural object recognition
network
• In 2015 machine learning algorithms
• They branded this technique “Deep Learning.” exceed the human capability

Source: A Roadmap for Foundational Research on Artificial Intelligence in Medical Imaging, Curtis P. Langlotz et al., 2018
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 10
Industrial Examples – Real Environments
• (Show
KUKA Video)
Bottleflip Challenge WAYMO driverless driving
Engineers from KUKA tackled the “Bottleflip Challenge”
The robot flips a bottle in the air.
Fun Fact: Robot trained itself in a single night
• (Show Video)
Driving has been traditionally a human job
Autonomous Driving could progress thanks to:
Artificial neural networks and deep learning.
• Challenge called «Bottleflipping» • Autonomous driving
• Robot trained itself over night • A traditional human job is carried out
Source: https://www.youtube.com/watch?v=HkCTRcN8TB0&t by machine learning algorithms
Source: https://www.youtube.com/watch?v=aaOB-ErYq6Y&t

23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 11


Game Examples – Defined Rules
• 2015 : AlphaGo
AlphaGo and (creators:
Go Deep Mind) won against
AlphaStarthe andEuropean
Starcraft 2
champion 5 – 0 [Mr. Fan Hui]
AlphaGo uses:
“Advanced Search Tree”
a Deep Network
Reinforcement Learning
• StarCraft II is one of the most enduring and popular real-time strategy
video games of all time.
•AlphaStar …
Go is considered as the most • StarCraft II is one of the most popular
…challenging
is the first AI to
classical reach the top league of real-time
game a widely popular
strategy E-sport
video games
• In October 2015, AlphaGo won against • AlphaStar is the first AI that reached
without any game
the European Champion restrictions. the top league

Source: https://deepmind.com
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 12
Introduction to Machine Learning
Machine Learning Types

Bilder: TF / Malter
Machine Learning Categories
Machine Learning

Supervised Unsupervised Reinforcement


Learning Learning Learning
The algorithm learns The algorithm finds Learning based on
based on feedback meaningful pattern in positive & negative
from a teacher the data using a metric reward
Example: Example: Example:
Classification of an Finding different buyer Automate movement
image (cat vs. dog) pattern of a robot in the wild
Source: Business Intelligence and its relationship with the Big Data, Data Analytics and Data Science, Armando Arroyo, 2017
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 2
Machine Learning Categories
Machine Learning

Supervised Unsupervised Reinforcement


Learning Learning Learning
The algorithm learns The algorithm finds Learning based on
based on feedback meaningful pattern in positive & negative
from a teacher the data using a metric reward
Example: Example: Example:
Classification of an Finding different buyer Automate movement
image (cat vs. dog) pattern of a robot in the wild
Source: Business Intelligence and its relationship with the Big Data, Data Analytics and Data Science, Armando Arroyo, 2017
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 3
Supervised Learning
The agent observes some example Input (Features) and Output (Label)
pairs and learns a function that maps input to output.

Key Aspects:
• Learning is explicit
• Learning using direct feedback
• Data with labeled output

→ Resolves classification and regression problems

Source: https://www.kdnuggets.com
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 4
Supervised Learning Problems

Classification Regression

Label 𝒚 = Dog
Feature 𝑥1

Label 𝑦
Label 𝒚 = Cat

Feature 𝑥0 Feature 𝑥

Source: https://www.kdnuggets.com
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 5
Regression
Regression is used to predict
a continuous value Regression

Training is based on a set of


input – output pairs (samples)

Label 𝑦
𝒟 = (𝒙𝟎 , 𝒚𝟎 ), (𝐱 𝟏 , 𝒚𝟏 ), … , (𝐱 𝐧 , 𝒚𝒏 )

Sample : (𝒙𝒊 , 𝒚𝒊 )
𝒙𝒊 ∈ ℝ𝑚 is the feature vector of sample 𝑖
Feature 𝑥
𝒚𝒊 ∈ ℝ is the label value of sample 𝑖

23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 6


Regression
Goal:
Find a relationship (function), which Regression
expresses the input and output the best!

That means, we fit a regression

Label 𝑦
model 𝑓 to all samples:
𝑓 𝒙𝒊 = 𝒚𝒊 , ∀(𝒙𝒊 , 𝒚𝒊 ) ∈ 𝒟

In this case 𝑓 is a linear regression


Feature 𝑥
model! (Black Line)

23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 7


Example: Predicting Surface Quality
In this example, we consider a milling process. We take speed and feed
as input features and the value of surface roughness is predicted
Milling Process Input: Speed, Feed Output: Surface Quality

https://de.wikipedia.org/wiki/Zerspanen

Objective: Realization:
• Prediction of the surface quality • Speed and Feed data is gathered and the
based on the production parameters surface roughness measured for some trials
• By using regression algorithms surface
quality is predicted
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 8
Example: Prediction of Remaining Useful Life (RUL)
“Health Monitoring and Prediction (HMP)” system (developed by BISTel)
Objective:
• Machine components degrade over time
• This causes malfunctions in the production line
• Replace component before the malfunction!

Realization:
• Use data of malfunctioning systems
• Fit a regression model to the data and predict
the RUL
• Use the model on a active systems and apply
necessary maintenance

Source: https://www.bistel.com/
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 9
Classification
Classification is used to predict
the class of the input Classification

Label 𝒚 = Dog
Important:

Feature 𝑥1
The output belongs to only one class

Sample : 𝒔𝐢 = (𝑥𝑖 , 𝑦𝑖 )
𝑦𝑖 ∈ 𝐿 is the label of sample 𝑖 Label 𝒚 = Cat

Feature 𝑥0
In this example: 𝐿 = {𝐶𝑎𝑡, 𝐷𝑜𝑔}
Source: https://scikit-learn.org/
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 10
Classification
Goal:
Find a way to divide the input into the Classification
output classes!
Label 𝒚 = Dog

Feature 𝑥1
That means, we find a decision
boundary 𝑓 for all samples:
𝑓 𝒙𝒊 = 𝒚𝒊 , ∀(𝒙𝒊 , 𝒚𝒊 ) ∈ 𝒟
Label 𝒚 = Cat
In this case 𝑓 is a linear decision
Feature 𝑥0
boundary! (Black Line)

Source: https://scikit-learn.org/
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 11
Example: Foreign object detection
Objective:
• Foreign object detection on a production part
• After a production step a chip can remain
on the piston rod
• Quality assurance: Only parts without chip
are allowed for the next production step
Realization:
• A camera system records 3,000 pictures
• All images are labeled by a human
Piston rod Piston rod with chip
• A machine learning classification algorithm without chip
was trained to differentiate between chip
or no chip situations

Source: Implementation and potentials of a machine vision system in a series production using deep learning and low-cost hardware, Hubert
Würschinger et al., 2020
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 12
Classification algorithms
Below, you Nearest
Input Data can see example
Linear datasets
RBF andGaussian
the application of algorithms
Different different use
algorithms.Neighbors SVM SVM Processes different ways to classify
data

Therefore:
The algorithms perform
different on the datasets

Example:
Linear SVM’s has bad
results in the second
dataset
Reason: No linear way to
distinguish data
The lower right value shows the classification accuracy
Source: https://scikit-learn.org/
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 13
Classification vs. Regression
Both areRegression
Supervised Learning Problems i.e.Classification
require labeled
datasets in Machine learning. But •the
• The output are continuous or real
difference between both is
The output variable must be a discrete
how they are used for different machine
values learning problems.
value (class)
• We try to fit a regression model, which • We try to find a decision boundary,
can predict the output more accurately which can divide the dataset into different
classes.
• Regression algorithms can be used to • Classification Algorithms can be used to
solve the regression problems such as solve classification problems such as
Weather Prediction, House price Hand-written digits(MNIST), Speech
prediction, Stock market prediction etc. Recognition, Identification of cancer cells,
Defected or Undefected solar cells etc.

Source: https://scikit-learn.org/
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 14
Machine Learning Categories
Machine Learning

Supervised Unsupervised Reinforcement


Learning Learning Learning
The algorithm learns The algorithm finds Learning based on
based on feedback meaningful pattern in positive & negative
from a teacher the data using a metric reward
Example: Example: Example:
Classification of an Finding different buyer Automate movement
image (cat vs. dog) pattern of a robot in the wild
Source: Business Intelligence and its relationship with the Big Data, Data Analytics and Data Science, Armando Arroyo, 2017
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 15
Unsupervised Learning
Unsupervised learning observes some example Input (Features) – No
Labels! - and finds patterns based on a metric

Key Aspects:
• Learning is implicit
• Learning using indirect feedback
• Methods are self-organizing

Resolves clustering and dimensionality reduction problems

23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 16


Example for Clustering
Iris Flower data set
Contains 50 samples of the flowers Iris setosa, Iris virginica and Iris
versicolor
Measured parameters:
petal length, petal width, sepals length, sepals width

Iris setosa Iris versicolor Iris virginica


Source: Fisher's Iris data set
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 17
Clustering
Goal: Identify similar samples and assign them the same label

Mostly used for data analysis, 2.5


data exploration, and/or
data preprocessing 2.0

Petal width
1.5

Data should be: 1.0


• Homogeneous in the cluster 0.5
(Intra-cluster distance)
0.0
• Heterogenous between the cluster 1 2 3 4 5 6 7
(Inter-cluster distance) Petal length
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 18
Clustering
Goal: Identify similar samples and assign them the same label

Mostly used for data analysis, 2.5


data exploration, and/or
data preprocessing 2.0

Petal width
1.5

Data should be: 1.0


• Homogeneous in the cluster 0.5 Cluster 1
(Intra-cluster distance) Cluster 2
0.0
• Heterogenous between the cluster 1 2 3 4 5 6 7
(Inter-cluster distance) Petal length
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 19
Clustering vs. Classification
Clustering does not have labeled data and may not find all classes!
Classification Clustering
2.5 2.5

2.0 2.0

Petal width
Petal width

1.5 1.5

1.0 1.0
Iris virginica
0.5 Iris versicolor 0.5 Cluster 1
Iris setsoa Cluster 2
0.0 0.0
1 2 3 4 5 6 7 1 2 3 4 5 6 7
Petal length Petal length
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 20
Clustering algorithms
K-Means Affinity Mean Shift Spectral Ward Different clustering
Propagation Clustering methods produce
different results
e.g. some algorithms “find”
more clusters than others

Example:
K-Means performs “well” on
the third dataset, but not on
dataset one and two
Reason:
K-Means can only identify
“circular” clusters

Source: https://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_comparison.html
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 21
Curse of Dimensionality
“As the number of features or dimensions grows, the amount of data we
need to generalize accurately grows exponentially.” – Charles Isbell

The intuition in lower dimensions does not hold in higher dimensions:


• Almost all samples are close to at least one boundary
• Distances (e.g. euclidean) between all samples are similar
• Features might be wrongly correlated with outputs
• Finding decision boundaries becomes more complex
→ Problems become much more difficult to solve!

Source: Chen L. (2009) Curse of Dimensionality. In: LIU L., ÖZSU M.T. (eds) Encyclopedia of Database Systems. Springer, Boston, MA.
https://doi.org/10.1007/978-0-387-39940-9_133
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 22
Example: Curse of Dimensionality
The production system has N sensors attached with either the input
set to “On” or “Off”

Question: How many samples do we need, to have all possible sensor


states in the dataset?

23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 23


Example: Curse of Dimensionality
The production system has N sensors attached with either the input
set to “On” or “Off”

Question: How many samples do we need, to have all possible sensor


states in the dataset?

𝑁 =1 : 𝐷 = 21 = 2
𝑁 = 10 : 𝐷 = 210 = 1024
𝑁 = 100 : 𝐷 = 2100 = 1.2 × 1030

For N = 100, the number of points are even more than the number of atoms
in the universe!
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 24
Dimensionality Reduction
The goal:
S0 S1 S2 S3 S4 S5 S6 S7 S8
Transform the samples from a high to a Sample0 0.2 0.1 11.1 2.2 Off 7 1.1 0 1.e-1
lower dimensional representation! Sample1 1.2 -0.1 3.1 -0.1 On 9 2.3 -1 1.e-4
Sample2 2.7 1.1 0.1 0.1 Off 10 4.5 -1 1.e-9
Sample3 3.1 0.1 1.1 0.2 Off 1 6.6 -1 1.e-1
Ideally:
Find a representation, which solves
your problem!
T0 T1 T2 T3

Typical Approaches: Sample0 11.3 0.1 -1 7.8

• Feature Selection Sample1


Sample2
4.3
2.8
-0.1
1.1 -1
1 6.8
7.1
• Feature Extraction Sample3 4.2 0.1 1 6.9

23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 25


Dimensionality Reduction
The goal:
S0 S1 S2 S3 S4 S5 S6 S7 S8
Transform the samples from a high to a Sample0 0.2 0.1 11.1 2.2 Off 7 1.1 0 1.e-1
lower dimensional representation! Sample1 1.2 -0.1 3.1 -0.1 On 9 2.3 -1 1.e-4
Sample2 2.7 1.1 0.1 0.1 Off 10 4.5 -1 1.e-9
Sample3 3.1 0.1 1.1 0.2 Off 1 6.6 -1 1.e-1
Ideally:
Find a representation, which solves
Identical
your problem!
T0 T1 T2 T3

Typical Approaches: Sample0 11.3 0.1 -1 7.8

• Feature Selection Sample1


Sample2
4.3
2.8
-0.1
1.1 -1
1 6.8
7.1
• Feature Extraction Sample3 4.2 0.1 1 6.9

23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 26


Dimensionality Reduction
The goal:
S0 S1 S2 S3 S4 S5 S6 S7 S8
Transform the samples from a high to a Sample0 0.2 0.1 11.1 2.2 Off 7 1.1 0 1.e-1
lower dimensional representation! Sample1 1.2 -0.1 3.1 -0.1 On 9 2.3 -1 1.e-4
Sample2 2.7 1.1 0.1 0.1 Off 10 4.5 -1 1.e-9
Sample3 3.1 0.1 1.1 0.2 Off 1 6.6 -1 1.e-1
Ideally:
Find a representation, which solves
your problem! Applied a function f
T0 T1 T2 T3

Typical Approaches: Sample0 11.3 0.1 -1 7.8

• Feature Selection Sample1


Sample2
4.3
2.8
-0.1
1.1 -1
1 6.8
7.1
• Feature Extraction Sample3 4.2 0.1 1 6.9

23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 27


Machine Learning Categories
Machine Learning

Supervised Unsupervised Reinforcement


Learning Learning Learning
The algorithm learns The algorithm finds Learning based on
based on feedback meaningful pattern in positive & negative
from a teacher the data using a metric reward
Example: Example: Example:
Classification of an Finding different buyer Automate movement
image (cat vs. dog) pattern of a robot in the wild
Source: Business Intelligence and its relationship with the Big Data, Data Analytics and Data Science, Armando Arroyo, 2017
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 28
Reinforcement Learning
Reinforcement learning observes some example Input (Features) – No
Labels! - and finds the optimal action i.e maximizes its future reward

Key Aspects:
• Learning is implicit
• Learning using indirect feedback
based on trials and reward signals
• Actions are affecting future measurements (i.e. inputs)

Resolves control and decision problems


i.e. controlling agents in games or robots
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 29
Reinforcement Learning
Goal: Agents should take actions in an environment which maximize the
cumulative reward.

To achieve this RL uses reward and punishment signals based on the


previous actions to optimize the model.

Source: Reinforcement Learning: An Introduction, Richard S. Sutton, Andrew G. Barto, 2018


23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 30
Reinforcement Learning vs. (Un)supervised Learning
• No (Un)supervised
teacher/supervisor, only reward signalsReinforcement Learning
Learning
• • Delayed feedback,
The feedback is given bynot instantaneous• (credit
a supervisor assignment
The feedback problem)
is given by a reward signal
• Learning
or a metricby interaction between environment and agent over time
• • Agent’s
Feedbackactions affect the environment:
is instantaneous • Feedback can be delayed (credit
assignment problem)
• Actions have consequences!!!
• Learning by using static data (no re- • Training is based on trials i.e. interaction
→ non
1. recording of data i.i.d. aspect.
necessary) between environment and agent (re-
• Active Learning process: the actions that recording necessary)
the agent takes affect the
• subsequent
Prediction does not affect
data future
it receives. • The prediction (actions) affect future
measurements – The data is assumed measurements i.e. the measurements are
Independent Identically Distributed (i.i.d) no necessarily i.i.d!

Source: Reinforcement Learning: An Introduction, Richard S. Sutton, Andrew G. Barto, 2018


23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 31
Example: Flexible handling of components
Google designed an experiment to handle objects using a robot
After training the RL algorithm: The robots succeeded in 96% of the grasp
attempts across 700 trial grasps on previously unseen objects
Training objects:
Training and data
acquisition:
• 7 industrial robots
• 10 GPU's
• Numerous CPU's
• 580,000 gripping
attempts

Source: https://ai.googleblog.com/2018/06/scalable-deep-reinforcement-learning.html
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 32
Introduction to Machine Learning
Machine Learning Pipeline

Bilder: TF / Malter
The Machine Learning Pipeline
A concept that provides guidance in a machine learning project
• Step-by-step process
• Each step has a goal and a method

There exist multiple pipelines and concepts,


however the idea behind all pipelines and steps are the same!

The 8-step machine learning pipeline by Artasanchez & Joshi :


Problem Data Data Data Model Model Model Perfomance
Definition Ingestion Preparation Segregation Training Evaluation Deployment Monitoring

Source: Artificial Intelligence with Python - Second Edition, by Alberto Artasanchez, Prateek Joshi
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 2
1. Problem Definition
Our aim with machine learning is to develop a solution for a problem. In order to develop a satisfying
solution, we need to define the problem. This definition of the problem lays the foundation to our solution. It
shows us what kind of data we need, what kind of algorithms we can use.
Examples:
• If we are trying to find a faulty equipment, we have a classification problem
• If we are trying to predict a continuous number, we have a regression problem

Prediction of the energy consumption of a milling machine – problem definition

• Want to predict the necessary energy to conduct a process step


• The prediction should be made on the basis of the milling parameters
• Therefore we want to get transparency
• Further this should allow the optimization of the parameters

Problem Data Data Data Model Model Model Perfomance


Definition Ingestion Preparation Segregation Training Evaluation Deployment Monitoring

23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 3


2. Data Ingestion
• After defining the problem, we have to define which data we need to describe the problem
• The data should represent the problem and the information needed to successfully predict the target
value
• If we have defined the data we need, we can gather them in databases or record them with trials

Prediction of the energy consumption of a milling machine – data ingestion


Data: • Data can be collected with
• Feed rate sensors
• Speed • Trials can be conducted on an
• Path existing machine
• Energy consumption

Problem Data Data Data Model Model Model Perfomance


Definition Ingestion Preparation Segregation Training Evaluation Deployment Monitoring

23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 4


3. Data Preparation
We gathered the data and created our dataset. But it is highly unlikely that we can use our data as soon as
we gather it. We have to prepare it for our machine learning solution. For example:
• We can get rid of instances with missing values or outliers
• We can use dimensionality reduction for higher dimensional dataset,
• We can normalize the data to transform our data to a common scale.
The quality of a machine learning algorithm is highly dependent on the quality of data!

Prediction of the energy consumption of a milling machine – data preparation


• We have enough instances with no outliers
In our dataset we have outliers or missing values
and missing values • Therefore the easiest way to treat the
instances with missing values and outliers
is to delete them

Problem Data Data Data Model Model Model Perfomance


Definition Ingestion Preparation Segregation Training Evaluation Deployment Monitoring

23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 5


An Example Dataset and Terminology
Value of the instance
Attribute for the attribute Label

Tennis
Day Weather Temp. Humidity Wind
recommended?
1 Sunny Hot High Weak No
2 Sunny Hot High Strong No
3 Cloudy Hot High Weak Yes
4 Rainy Mild High Weak Yes
… … … … … …

Instance Target Value

23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 6


4. Data Segregation
Before we train a model we have to separate our data set:
• We have to separate the target variable
• Training Set: This set is used to fit our model
• Validation Set: Validation set is used to test our fit of the model and tune it’s hyperparameters
• Test Set: Test set is used to evaluate the performance of the final version of our model

Prediction of the energy consumption of a milling machine – data segregation


• We use an training/test split of
Attributes/ independent variable: Target variable:
80%/20%
• Feed rate • Energy consumption
• Because we do not perform
• Speed
hyperparameter optimization we
• Path
need no validation set

Problem Data Data Data Model Model Model Perfomance


Definition Ingestion Preparation Segregation Training Evaluation Deployment Monitoring

23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 7


5. Model Training
Next step is to select an algorithm to train our data. So the question is which algorithm to use:
• Labeled data → supervised learning algorithms
• Prediction of a discreet value → regression algorithm
• Prediction of a class → classification algorithm
• There are numerous algorithms with their strengths and weaknesses
• To find out which algorithm is the best for our data set we have to test them

Prediction of the energy consumption of a milling machine – model training


We use the following regression
We have labeled data We have a discreet value algorithms
→ Supervised Energy consumption (kWh) → Linear regression
problem → Regression problem → Random forest regression
→ Support vector regression

Problem Data Data Data Model Model Model Perfomance


Definition Ingestion Preparation Segregation Training Evaluation Deployment Monitoring

23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 8


6. Model Evaluation
Training and evaluation of the model are iterative Prediction of the energy consumption of a milling
processes: machine – model evaluation
1. First, we train our model with the training set
2. Then evaluate its performance with validation
set with evaluation metrics
3. Based on this information, we tune our
algorithm’s hyperparameters

This iterative process continues unless we decide


that we can’t improve our algorithm anymore

Then we use the test set to see the performance on


unknown data
Random Forest Regressor achieves the best results
→ We choose this model

Problem Data Data Data Model Model Model Perfomance


Definition Ingestion Preparation Segregation Training Evaluation Deployment Monitoring

23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 9


Classification - Confusion Matrix and Accuracy
Confusion Matrix gives us a matrix as output and describes the performance of the model.

There are 4 important terms: Predicted: Predicted:


True Positives: The cases in which we predicted n=165 NO YES
YES, and the actual output was also YES
True Negatives: The cases in which we predicted
Actual: NO 50 10
NO, and the actual output was NO
False Positives: The cases in which we predicted
YES, and the actual output was NO Actual: YES 5 100
False Negatives: The cases in which we predicted
NO, and the actual output was YES Confusion Matrix

Classification Accuracy is what we usually mean,


when we use the term accuracy. It is the ratio of Accuracy = (150/165) = 0.909
number of correct predictions to the total number of
input samples.

23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 10


Regression – Error Metrics
Mean Absolute Error (MAE): 𝑁
• Average difference between the original and predicted 1
values Mean Absolute Error= ෍ 𝑦𝑗 − 𝑦ො𝑗
• Measure how far predictions were from the actual output 𝑁
𝑗=1
• Does not give and idea about the direction of error.

Mean Squared Error (MSE):


• Similar to MAE 𝑁
• It takes the average of the square of the difference 1 2
between original and predicted value
Mean Squared Error= ෍ 𝑦𝑗 − 𝑦ො𝑗
𝑁
• Larger errors become more pronounced, so that the 𝑗=1
model can focus on larger errors.

23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 11


7. Model Deployment
We have trained our model, evaluated it with evaluation metrics and tuned the hyperparameters with the
help of our validation set. After finalization of our model, we evaluated it with test set. Finally we are ready to
deploy our model to our problem. Here we have often to consider the following issues:
• Real time requirements
• Robust hardware (sensors and processor)

prediction of the energy consumption of a milling machine – model deployment


We are using the
model to predict the Axis=2
Predicted energy
energy consumption Feed=800
consumption = 0.046
bast on production Path=60
parameters

Problem Data Data Data Model Model Model Perfomance


Definition Ingestion Preparation Segregation Training Evaluation Deployment Monitoring

23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 12


8. Performance Monitoring
We must monitor the performance of our model and make sure it still produces satisfactory results.
Unsatisfactory results might be caused by different reasons. For example,
• Sensors can malfunction and provide wrong data
• The data can be out of the trained range of the model
Monitoring the performance can mean the monitoring of the input data regarding any changes as well as the
comparison of the prediction and the actual value after defined time periods

prediction of the energy consumption of a milling machine – performance monitoring

We measure the actual We calculate the Within defined range: model is working
energy consumption of error of the sufficiently
10 work steps after each predictions of the Out defined range: model is working insufficient
maintenance procedure energy consumption → stop using and root cause analysis

Candidate
Problem Data Data Data Model Model Perfomance
Model
Definition Ingestion Preparation Segregation Training Deployment Monitoring
Evaluation

23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 13


Introduction to Machine Learning
Summary

Bilder: TF / Malter
Summary
In this chapter we talked about:
• The history of machine learning and recent trends
• The different types of machine learning types such as supervised learning, unsupervised learning and
reinforcement learning
• The steps involved in a common machine learning pipeline

After studying this chapter, you should be able to:


• Identify the type of machine learning needed to solve a given problem
• Understand the differences of regression and classification tasks
• Setup a generic project pipeline following containing all relevant steps from problem definition to
performance monitoring

23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 2


Machine Learning for Engineers
Linear Regression - Motivation

Bilder: TF / Malter
Motivation
Linear regression is used in multiple
scenarios each and every day!

Use Cases:
• Trend Analysis in financial
markets, sales and business
• Computer Vision problems
Financial development in FXCM
i.e. registration & localization
problems
• etc.

Source: https://www.tradingview.com/script/OZZpxf0m-Linear-Regression-Trend-Channel-with-Entries-Alerts/
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 9
When do we use it?
The problem has to be simple:
• Dataset is small
• Linear model is enough i.e. Trend Analysis
• Linear models are the basis for complex models
i.e. Deep Networks

→ Let’s have a look at regression


i.e. prediction of a real-valued output

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 10


Example: Growth Reference Dataset (WHO)
Age Height The dataset contains Age (5y – 12y) and Height
4.1 years 108 cm information from people in the USA
5.2 years 112 cm
5.6 years 108 cm
5.7 years 116 cm
6.2 years 112 cm
6.3 years 116 cm
6.6 years 120 cm
6.7 years 122 cm

Source: https://www.who.int/toolkits/growth-reference-data-for-5to19-years/indicators/height-for-age
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 11
Example: Growth Reference Dataset (WHO)
Age Height The dataset contains Age (5y – 12y) and Height
4.1 years 108 cm information from people in the USA
5.2 years 112 cm
5.6 years 108 cm We want to answer Questions like:
5.7 years 116 cm What is the height of my child,
6.2 years 112 cm when it is 14 years old?
6.3 years 116 cm
6.6 years 120 cm
What is the height of my child,
6.7 years 122 cm
when it is 30 years old?

Source: https://www.who.int/toolkits/growth-reference-data-for-5to19-years/indicators/height-for-age
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 12
Example: 1.Visualize the data
Age Height
145
4.1 years 108 cm
5.2 years 112 cm 140

Height in (cm)
5.6 years 108 cm 130
5.7 years 116 cm 120
6.2 years 112 cm 115
6.3 years 116 cm 110
6.6 years 120 cm
6.7 years 122 cm 5 6 7 8 9 10 11 12
Age in years

Source: https://www.who.int/toolkits/growth-reference-data-for-5to19-years/indicators/height-for-age
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 13
Example: 1.Visualize the data
Age Height
145
4.1 years 108 cm
5.2 years 112 cm 140

Height in (cm)
5.6 years 108 cm 130
5.7 years 116 cm 120
6.2 years 112 cm 115
6.3 years 116 cm 110
6.6 years 120 cm
6.7 years 122 cm 5 6 7 8 9 10 11 12
Age in years

Source: https://www.who.int/toolkits/growth-reference-data-for-5to19-years/indicators/height-for-age
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 14
Example: 2.Fit a model
145
140

Height in (cm)
130
120
115
110

5 6 7 8 9 10 11 12
Age in years

Source: https://www.who.int/toolkits/growth-reference-data-for-5to19-years/indicators/height-for-age
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 15
Example: 2.Fit a model
What “model” i.e. function
describes our data? 145

• Linear model 140

Height in (cm)
130
• Polynomial model
• Gaussian model 120
115
110

5 6 7 8 9 10 11 12
Age in years

Source: https://www.who.int/toolkits/growth-reference-data-for-5to19-years/indicators/height-for-age
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 16
Example: 2.Fit a model
What “model” i.e. function
describes our data? 145

• Linear model 140

Height in (cm)
130
• Polynomial model
• Gaussian model 120
115
110

Answer: Linear Model! 5 6 7 8 9 10 11 12


Age in years

Source: https://www.who.int/toolkits/growth-reference-data-for-5to19-years/indicators/height-for-age
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 17
Example: 3.Answer your Questions
What is the height of my child,
when it is 14 years old? 270

→ 165 cm 230

Height in (cm)
190
150
110
70

0 5 10 15 20 25 30 35
Age in years

Source: https://www.who.int/toolkits/growth-reference-data-for-5to19-years/indicators/height-for-age
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 18
Example: 3.Answer your Questions
What is the height of my child,
when it is 14 years old? 270

→ 165 cm 230

Height in (cm)
190

What is the height of my child, 150


when it is 30 years old? 110

→ 270 cm 70

0 5 10 15 20 25 30 35
Age in years

Source: https://www.who.int/toolkits/growth-reference-data-for-5to19-years/indicators/height-for-age
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 19
Example: Actual Answer
What is the height of my child,
when it is 14 years old?
(We estimated 165 cm)
→ ~165 cm

What is the height of my child,


when it is 30 years old?
(We estimated 270 cm)
→ ~175 cm

Source: https://www.who.int/toolkits/growth-reference-data-for-5to19-years/indicators/height-for-age
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 20
Next in this Lecture:
• What is the Mathematical Framework?
• How do we fit a linear model?

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 21


Machine Learning for Engineers
Linear Regression - Mathematics

Bilder: TF / Malter
Overview
• Overall Picture
• The Linear Model
• Optimization
• Basis Function Expansion

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 24


Overview
• Overall Picture
• The Linear Model
• Optimization
• Basis Function Expansion

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 25


Overall Picture
Linear Regression is a method to fit linear models to our data!

145
140

Height in (cm)
130
120
115
110

5 6 7 8 9 10 11 12
Age in years
Note: Bold variables are vectors and/or matrices, non-bold variables are scalars!
Note: x (input) is called the independent/predictor variable and y (output) is called dependent/outcome variable
Source: https://www.who.int/toolkits/growth-reference-data-for-5to19-years/indicators/height-for-age
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 26
Overall Picture
Linear Regression is a method to fit linear models to our data!

145
The linear model:
140
f 𝐱 = w0 ⋅ 𝑥0 + w1 ⋅ 𝑥1 = 𝐲

Height in (cm)
130
120
115
110

5 6 7 8 9 10 11 12
Age in years
Note: Bold variables are vectors and/or matrices, non-bold variables are scalars!
Note: x (input) is called the independent/predictor variable and y (output) is called dependent/outcome variable
Source: https://www.who.int/toolkits/growth-reference-data-for-5to19-years/indicators/height-for-age
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 27
Overall Picture
Linear Regression is a method to fit linear models to our data!

145
The linear model:
140
f 𝐱 = w0 ⋅ 𝑥0 + w1 ⋅ 𝑥1 = 𝐲

Height in (cm)
130

𝑥0 1 120
𝐱= 𝑥 =
1 Age in Years 115
y = Height 110
w = Weights
5 6 7 8 9 10 11 12
Age in years
Note: Bold variables are vectors and/or matrices, non-bold variables are scalars!
Note: x (input) is called the independent/predictor variable and y (output) is called dependent/outcome variable
Source: https://www.who.int/toolkits/growth-reference-data-for-5to19-years/indicators/height-for-age
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 28
Overall Picture
Linear Regression is a method to fit linear models to our data!

145
The linear model (in our example):
140
f 𝐱 = 70 ⋅ 𝑥0 + 6.5 ⋅ 𝑥1 = 𝐲

Height in (cm)
130
120
115
110

5 6 7 8 9 10 11 12
Age in years
Note: Bold variables are vectors and/or matrices, non-bold variables are scalars!
Note: x (input) is called the independent/predictor variable and y (output) is called dependent/outcome variable
Source: https://www.who.int/toolkits/growth-reference-data-for-5to19-years/indicators/height-for-age
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 29
Overall Picture
Linear Regression is a method to fit linear models to our data!

145
The linear model (in our example):
140
f 𝐱 = 70 ⋅ 𝑥0 + 6.5 ⋅ 𝑥1 = 𝐲

Height in (cm)
130
→ Finding a good 𝑤0 and 𝑤1 is 120
called fitting! 115
110

5 6 7 8 9 10 11 12
Age in years
Note: Bold variables are vectors and/or matrices, non-bold variables are scalars!
Note: x (input) is called the independent/predictor variable and y (output) is called dependent/outcome variable
Source: https://www.who.int/toolkits/growth-reference-data-for-5to19-years/indicators/height-for-age
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 30
Overview
• Overall Picture
• The Linear Model
• Optimization
• Basis Function Expansion

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 31


The Linear Model
The previous description of the linear model is not accurate!
Reason: Real Systems produce noise!
145
140

Height in (cm)
130
120 𝜖𝑖
115
110

5 6 7 8 9 10 11 12
Age in years
Note: In this case we assume that the noise 𝜖𝑖 is Gaussian distributed
Source: Machine Learning: A Probabilistic Perspective, Kevin P. Murphy, p.19
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 32
The Linear Model
The previous description of the linear model is not accurate!
Reason: Real Systems produce noise!
145
140
Linear model with noise:

Height in (cm)
130
f 𝐱 = w0 ⋅ 𝑥0 + w1 ⋅ 𝑥1 + 𝜖𝑖
120 𝜖𝑖
115
110

5 6 7 8 9 10 11 12
Age in years
1) In this case we assume that the noise 𝜖𝑖 is Gaussian distributed
Source: Machine Learning: A Probabilistic Perspective, Kevin P. Murphy, p.19
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 33
The Linear Model
The previous description of the linear model is not accurate!
Reason: Real Systems produce noise!
145
140
Linear model with noise:

Height in (cm)
130
f 𝐱 = w0 ⋅ 𝑥0 + w1 ⋅ 𝑥1 + 𝜖𝑖
120 𝜖𝑖
1) 115
The 𝜖𝑖 and the summation 𝝐 of
all samples is called Residual Error! 110

5 6 7 8 9 10 11 12
Age in years
1) In this case we assume that the noise 𝜖𝑖 is Gaussian distributed
Source: Machine Learning: A Probabilistic Perspective, Kevin P. Murphy, p.19
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 34
The Residual Error 𝝐
The residual error 𝜖 is not part of the input!
It is a random variable! Gaussian Distribution
0.5

0.4
Typically: We assume 𝜖 is Gaussian

Probability
Distributed (i.e. Gaussian noise) 0.3
1)
1 𝜖−𝜇 2 0.2
1 −2 𝜎
𝑝 𝜖 = 𝑒
𝜎 2𝜋 0.1

-3 -2 -1 0 1 2 3 4
But other distributions are also possible! Event
1) The short form is 𝜖 ~ 𝒩(𝜇, 𝜎 2 )
Note: In our use-case 𝜇 = 0 and 𝜎 is small. That means our samples are only “slightly” deviate
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 35
The distribution of 𝑦
The noise is Gaussian:
𝑦1
𝑝 𝜖 =𝒩 𝜖 0, 𝜎 2 ) 145
140 𝑝(𝐲1 |𝐱1 )

Height in (cm)
“Inserting” the linear function: 130

𝑝 𝐲 𝐱, 𝐰, 𝜎) = 𝒩 𝐲 𝐰 T 𝐱, 𝜎 2 ) 120
𝑦0
115 𝑝(𝐲0 |𝐱 0 )
110
This is the conditional probability 𝑥0 𝑥1
density for the target variable 𝐲!
5 6 7 8 9 10 11 12
Age in years

Note for later use: 𝑝 𝐲 𝐱, 𝛉) = 𝑝 𝐲 𝐱, 𝐰, 𝜎)


13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 36
General Linear Model
A more general formulation:
𝐷

𝑓 𝐱 = ෍ 𝑤𝑗 𝑥𝑗 + 𝜖 = 𝐲
𝑗=1

Vector Notation:
𝑓 𝐱 = 𝐰T𝐱 + 𝜖 = 𝐲

Note: Bold variables are vectors and/or matrices, non-bold variables are scalars!
Source: Machine Learning: A Probabilistic Perspective, Kevin P. Murphy, p.19
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 37
What parameters do we have to estimate?
The model is linear:
𝐷

𝑓 𝐱 = ෍ 𝑤𝑗 𝑥𝑗 + 𝜖 = 𝐲
𝑗=1

Note: In this case, we assume that the noise is Gaussian distributed! i.e. 𝜖 ~ 𝒩(𝜇, 𝜎 2 )
Note: We assume 𝜇 = 0 : That means we only have to estimate 𝜎! Homework: Why can we assume 𝜇 = 0?
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 38
What parameters do we have to estimate?
The model is linear:
𝐷

𝑓 𝐱 = ෍ 𝑤𝑗 𝑥𝑗 + 𝜖 = 𝐲
𝑗=1

The model parameters are:


𝛉 = 𝑤0 , … , 𝑤𝑛 , 𝜎 = {𝐰, 𝜎}

Note: In this case, we assume that the noise is Gaussian distributed! i.e. 𝜖 ~ 𝒩(𝜇, 𝜎 2 )
Note: We assume 𝜇 = 0 : That means we only have to estimate 𝜎! Homework: Why can we assume 𝜇 = 0?
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 39
Overview
• Overall Picture
• The Linear Model
• Optimization
• Basis Function Expansion

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 40


What are optimal model parameters?
We define the optimal parameters as:

𝛉∗ = argmax 𝑝(𝒟|𝛉)
𝛉

That means:
• We want to find the optimal parameters 𝛉∗
• We search over all possible 𝛉
• We select the 𝛉 which most likely generated our training dataset 𝒟

Reminder: 𝜃 are the parameters, 𝒟 is the training dataset!


13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 41
Intuitive Example
𝛉∗ = argmax 𝑝(𝒟|𝛉)
0.5 𝜽
Model: 𝒟 ~ 𝒩(𝜇, 𝜎 2 )
Distribution of 𝒟

0.4
Para.: 𝛉 = {𝜇, 𝜎}
0.3
𝛉 = {9, 0.9} : Bad
0.2

0.1

0 1 2 3 4 5 6 7 8 9 10
Event
Reminder: 𝜃 are the parameters, 𝒟 is the training dataset!
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 42
Intuitive Example
𝛉∗ = argmax 𝑝(𝒟|𝛉)
0.5 𝛉
Model: 𝒟 ~ 𝒩(𝜇, 𝜎 2 )
Distribution of 𝒟

0.4
Para.: 𝛉 = {𝜇, 𝜎}
0.3
𝛉 = {9, 0.9} : Bad
0.2
𝛉 = {4, 0.9} : Better
0.1

0 1 2 3 4 5 6 7 8 9 10
Event
Reminder: 𝜃 are the parameters, 𝒟 is the training dataset!
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 43
Intuitive Example
𝛉∗ = argmax 𝑝(𝒟|𝛉)
0.5 𝛉
Model: 𝒟 ~ 𝒩(𝜇, 𝜎 2 )
Distribution of 𝒟

0.4
Para.: 𝛉 = {𝜇, 𝜎}
0.3
𝛉 = {9, 0.9} : Bad
0.2
𝛉 = {4, 0.9} : Better
0.1 𝛉 = {5, 1.5} : Best

0 1 2 3 4 5 6 7 8 9 10
Event
Reminder: 𝜃 are the parameters, 𝒟 is the training dataset!
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 44
Maximum Likelihood Estimation
This process is called Maximum Likelihood Estimation (MLE)

𝛉∗ = argmax 𝑝(𝒟|𝛉)
𝛉

Important: 𝑙 𝛉 = 𝑝 𝒟 𝛉 is called Likelihood!

Note: It can also written as 𝑝(𝒟|𝜽) = 𝑙(𝜽)


13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 45
The Likelihood Function

We assume each sample is independent, identically distributed (i.i.d)!

𝓛 𝛉 =𝑝 𝒟𝛉
= 𝑝 𝒔0 , 𝒔1 , … , 𝒔𝑛 𝛉
∗ 𝑝 𝒔 𝛉 ⋅ 𝑝 𝒔 | 𝛉 ⋅ … ⋅ 𝑝(𝐬 𝛉
= 0 1 𝑛
𝑁

= ෑ 𝑝(𝐬𝑖 |𝛉)
𝑖=1

→ This is the basis to judge, how good our parameters are!

Note: In our case a sample 𝒔𝑖 is 𝐱 𝑖 and 𝐲𝑖 ; Just assume both are “tied” together i.e. like a vector
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 46
The Log-Likelihood
The product of small numbers in a computer is unstable!

We transform the likelihood into the quasi-equivalent Log-Likelihood:


𝑁 𝑁

𝓛 𝛉 = ෑ 𝑝 𝐬𝑖 𝛉 ⇔ ෍ log[𝑝 𝐬𝑖 𝛉 ] = 𝓵(𝛉)
𝑖=1 𝑖=1

* The logarithm „just“ scales the problem. Likelihood and Log-Likelihood both have to be maximized!
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 47
The conditional distribution 𝑝(𝐬𝑖 |𝛉)
We know the conditional is the Gaussian distribution of 𝐲 (Slide 35):

𝑝 𝒔𝑖 |𝛉 = 𝑝 𝐲𝑖 𝐱 𝑖 , 𝐰, 𝜎) = 𝒩 𝐲𝑖 𝐰 T 𝐱 𝑖 , 𝜎 2 )

We insert this into the log-likelihood:


𝑁

෍ log[𝑝 𝐬𝑖 𝛉 ]
𝑖=1

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 48


The conditional distribution 𝑝(𝐬𝑖 |𝛉)
We know the conditional is the Gaussian distribution of 𝐲 (Slide 35):

𝑝 𝒔𝑖 |𝛉 = 𝑝 𝐲𝑖 𝐱 𝑖 , 𝐰, 𝜎) = 𝒩 𝐲𝑖 𝐰 T 𝐱 𝑖 , 𝜎 2 )

We insert this into the log-likelihood:


𝑁 𝑁 1
1 2 1 T𝐱 2
෍ log[𝑝 𝐬𝑖 𝛉 ] = ෍ log − 𝐲𝑖 −𝐰 𝑖
𝑒 2𝜎2
𝑖=1 2𝜋𝜎 2
𝑖=1

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 49


The conditional distribution 𝑝(𝐬𝑖 |𝛉)
We know the conditional is the Gaussian distribution of 𝐲 (Slide 35):

𝑝 𝒔𝑖 |𝛉 = 𝑝 𝐲𝑖 𝐱 𝑖 , 𝐰, 𝜎) = 𝒩 𝐲𝑖 𝐰 T 𝐱 𝑖 , 𝜎 2 )

We insert this into the log-likelihood:


𝑁 𝑁 1
1 2 1 T𝐱 2
෍ log[𝑝 𝐬𝑖 𝛉 ] = ෍ log − 𝐲𝑖 −𝐰 𝑖
𝑒 2𝜎2
𝑖=1 2𝜋𝜎 2
𝑖=1
𝑁
1 2 𝑁
= − 2 ෍ 𝐲𝑖 − 𝐰 T 𝐱 𝑖 − log(2𝜋𝜎 2 )
2𝜎 2
𝑖=1
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 50
The conditional distribution 𝑝(𝐬𝑖 |𝛉)
We know the conditional is the Gaussian distribution of 𝐲 (Slide 35):

𝑝 𝒔𝑖 |𝛉 = 𝑝 𝐲𝑖 𝐱 𝑖 , 𝐰, 𝜎) = 𝒩 𝐲𝑖 𝐰 T 𝐱 𝑖 , 𝜎 2 )

We insert this into the log-likelihood:


𝑁 𝑁 1
1 2 1 T𝐱 2
෍ log[𝑝 𝐬𝑖 𝛉 ] = ෍ log − 𝐲𝑖 −𝐰 𝑖
𝑒 2𝜎2
𝑖=1 2𝜋𝜎 2
𝑖=1
𝑁
1 2 𝑁
= − 2 ෍ 𝐲𝑖 − 𝐰 T 𝐱 𝑖 − log(2𝜋𝜎 2 )
2𝜎 2
𝑖=1
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 51
Interpreting the Loss
We can “simplify” the current log-likelihood:

𝑁
1 2 𝑁
𝓵 𝛉 = − 2 ෍ 𝐲𝑖 − 𝐰 T 𝐱 𝑖 − log(2𝜋𝜎 2 )
2𝜎 2
𝑖=1

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 52


Interpreting the Loss
We can “simplify” the current log-likelihood:

Constant Constant

𝑁
1 2 𝑁
𝓵 𝛉 = − 2 ෍ 𝐲𝑖 − 𝐰 T 𝐱 𝑖 − log(2𝜋𝜎 2 )
2𝜎 2
𝑖=1

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 53


Interpreting the Loss
We can “simplify” the current log-likelihood:

Constant Constant

𝑁
1 2 𝑁
𝓵 𝛉 = − 2 ෍ 𝐲𝑖 − 𝐰 T 𝐱 𝑖 − log(2𝜋𝜎 2 )
2𝜎 2
𝑖=1

Residual Sum of Squares

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 54


Interpreting the Loss
We can “simplify” the current log-likelihood:

Constant Constant

𝑁
1 2 𝑁
𝓵 𝛉 = − 2 ෍ 𝐲𝑖 − 𝐰 T 𝐱 𝑖 − log(2𝜋𝜎 2 )
2𝜎 2
𝑖=1

Residual Sum of Squares

→ We only have to minimize the Residual Sum of Squares!

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 55


The Residual Sum of Squares
We minimize the loss term:
𝑁
2
𝐿 𝛉 = ෍ 𝐲𝑖 − 𝐰 T 𝐱 𝑖
𝑖=1

For linear regression: The loss is


shaped like a bowl with a unique
minimum!
𝐿(𝛉)
𝑤0
The cross indicates the optimal 𝑤1
parameters i.e. where the loss is
minimal
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 56
Optimization
Multiple ways to find the minimum:
• Analytical Solution
• Gradient Descend
• Newtons Method

1)
… and of course many, many more methods!

1) Examples: Particle Swarm Optimization, Genetic Algorithms, etc.


13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 57
Optimization
Multiple ways to find the minimum:
• Analytical Solution
• Gradient Descend
• Newtons Method

1)
… and of course many, many more methods!

1) Examples: Particle Swarm Optimization, Genetic Algorithms, etc.


13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 58
Analytical Solution
1)
For the analytical solution, we want a “simpler” form:
𝑁
2
𝐿 𝛉 = ෍ 𝐲𝑖 − 𝐰 T 𝐱 𝑖
𝑖=1

𝟏 𝐓
NLL 𝛉 = 𝐲 − 𝐱𝐰 𝐲 − 𝐱𝐰
𝟐

1) This form uses the “Negative Log Likelihood”, which can also be derived from the likelihood
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 59
Analytical Solution
1)
For the analytical solution, we want a “simpler” form:
𝟏 𝐓
NLL 𝛉 = 𝐲 − 𝐱𝐰 𝐲 − 𝐱𝐰
𝟐

We find the minimum conventionally by using the derivative:


NLL′ 𝛉 = 𝐱 T 𝐱𝐰 − 𝐱 T 𝐲

Equating the gradient to zero:


𝐱 𝑇 𝐱𝐰 = 𝐱 T 𝐲

1) This form uses the “Negative Log Likelihood”, which can also be derived from the likelihood
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 60
Analytical Solution
The solution i.e. the minimum is: 𝑁
−1 T
𝐰 = 𝐱T𝐱 𝐱 𝐲 𝐱 T 𝐲 = ෍ 𝐱 𝑖 𝐲𝑖
𝑖=1
𝑁
In practice too time
𝐱 T 𝐱 = ෍ 𝐱 𝑖 𝐱 𝑖𝑇
consuming to compute!
𝑖=1
N 2
𝑥𝑖,1 ⋅⋅⋅ 𝑥𝑖,1 ⋅ 𝑥𝑖,𝐷
Reason: =෍ ⫶ ⋱ ⫶
The more training data, 2
𝑖=1 𝑥𝑖,𝐷 ⋅ 𝑥𝑖,1 ⋅⋅⋅ 𝑥𝑖,𝐷
the longer the calculation!

Note: 𝑁 : Amount of Samples in the Training Dataset ; 𝐷 : Amount of Dimensions (length of the input vector)
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 61
What is the quality of the fit?
The goodness of fit should be evaluated to verify the learned model!

A typical measure for quality is : 𝑅2

It is a measure for the percentage of variability of the dependent


variable (𝐲) in the data explained by the model!

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 62


The 𝑅𝟐 – Measure
The 𝑅𝟐 – Measure is calculated as:

Let 𝑦ො be the mean of the labels, then


• Total sum of squares: 𝑆𝑆𝑡𝑜𝑡 = σ𝑖 𝐲𝑖 − 𝐲ො 2
2
• Residual sum of squares: 𝑆𝑆𝑟𝑒𝑠 = σ𝑖 𝐲𝑖 − 𝑓(𝐱 𝑖 )
𝑆𝑆𝑟𝑒𝑠
• Coefficient of determination: 𝑅2 = 1 −
𝑆𝑆𝑡𝑜𝑡

→ This is different from the Mean Squared Error (MSE)!

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 63


Linear Regression: Evaluation

Model can completely (100%) Model can partially (91.3%) Model cannot explain any
explain variations in 𝑦 explain variations in y, but still (0%) variation in y, because
considered good it's its own average

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 64


Overview
• Overall Picture
• The Linear Model
• Optimization
• Basis Function Expansion

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 65


Can we fit the following dataset?
House Prices in Springfield (USA)

House prices in (Mio. € )


1.6
1.5
1.4
1.3
1.2
1.1

1910 1940 1970 2000


Years from 1900 - 2020

Source: https://www.who.int/toolkits/growth-reference-data-for-5to19-years/indicators/height-for-age
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 66
Polynomial Regression
House Prices in Springfield (USA)
The answer is – of course – yes!

House prices in (Mio. € )


1.6
1.5
Polynomial regression fits a
1.4
nonlinear relationship between
input of 𝐱 and the corresponding 1.3
conditional mean of 𝐲 1.2
1.1
Example of a polynomial model:
1910 1940 1970 2000
𝑦 = 𝑤0 + 𝑤1 𝑥 + 𝑤2 𝑥 2 + ⋯ + 𝑤𝑚 𝑥 𝑚 + 𝜖 Years from 1900 - 2020

Note: 𝑚 is called the degree of the polynomial


13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 67
How can we use polynomials in linear regression?
Trick: Basis function expansion!

We can just apply a function before we predict our “linear line”:


𝐷

𝑓 𝐱 = ෍ 𝑤𝑗 𝚽𝒋 (𝒙) + 𝜖 = 𝐲
𝑗=1

That means: We transform our Input-Space by using the function 𝚽!

→ The estimation from “normal” linear regression is the same!


Source: Machine Learning: A Probabilistic Perspective, Kevin P. Murphy, p.19
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 68
Example: Polynomial Regression
House Prices in Springfield (USA)
The Basis Function for our example:

House prices in (Mio. € )


1.6
6 T
𝚽 𝑥 = 1 𝑥 𝑥2 𝑥3 𝑥4 𝑥5 𝑥 1.5
1.4
1)
The model parameters are: 1.3
𝛉 = {𝑤0 , 𝑤1 , 𝑤2 , 𝑤3 , 𝑤4 , 𝑤5 , 𝑤6 , 𝜎} 1.2
1.1

1910 1940 1970 2000


Years from 1900 - 2020

1) We assume that the noise is Gaussian distributed!


Note: The input here is NOT a vector! It is a scalar! However, the result is a 7-dim vector!
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 69
Complexity of Polynomial Regression
The greater degrees of the polynomial, the more complex relationships
we can express!

Degree 𝑚 = 2 Degree 𝑚 = 13
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 70
Basis Functions
We can use any arbitrary function, with one condition:

The resulting vector has to have a constant “length”


i.e. linear in respect to the parameters!

Possible basis functions:


• 𝚽 𝐱 = 1 𝑥0 𝑥1 … 𝑥𝐷 T
T
• 𝚽 𝐱 = 1 𝑥 𝑥2 … 𝑥𝑚
T
• 𝚽 𝐱 = 1 𝑥0 𝑥1 𝑥0 𝑥1 𝑥02 𝑥12 𝑥22
• etc.
Note: That is the reason, why polynomial regression can be considered linear
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 71
Examples for Basis Functions
T T
𝚽 𝐱 = 1 𝑥0 𝑥1 𝚽 𝐱 = 1 𝑥0 𝑥1 𝑥2 𝑥12 𝑥22

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 72


Thank you for listening!
Machine Learning for Engineers
Logistic Regression - Motivation

Bilder: TF / Malter
Motivation
Logistic regression is the application of
linear regression to classification!

Use Cases:
• Credit Scoring
• Medicine
• Text Processing
• etc.

→ Especially useful for “explainability”


Source: https://activewizards.com/blog/5-real-world-examples-of-logistic-regression-application
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 75
When do we use it?
The problem has to be simple:
• Dataset is small
• Linear model is enough
• Basis for complex models

→ Let’s have a look at classification


i.e. prediction of a categorial-valued output

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 76


Example: Iris Flower Dataset
Label Petal Petal Contains 50 samples of the flowers Iris setosa,
Width Length Iris virginica and Iris versicolor
Setosa 5.0 mm 9.2 mm
Versi. 9.2 mm 26.1 mm
We want to answer Questions like:
Setosa 7.7 mm 18.9 mm
My flower has a Petal Width of 7mm and a Petal
Versi. 9.1 mm 32.1 mm
Length of 15mm.
Setosa 7.9 mm 15.5 mm
Setosa 5.7 mm 12 mm
Setosa 2.5 mm 13.5 mm Is my flower a Iris Setosa or Iris Versicolor?
Versi. 15.0 mm 39 mm

Source: Fisher, R.. “THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS.” Annals of Human Genetics 7 (1936): 179-188.
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 77
Example: Transform the Label
Label Petal Petal
Width Length
Setosa 5.0 mm 9.2 mm
Versi. 9.2 mm 26.1 mm
Setosa 7.7 mm 18.9 mm
Versi. 9.1 mm 32.1 mm
Setosa 7.9 mm 15.5 mm
Setosa 5.7 mm 12 mm
Setosa 2.5 mm 13.5 mm
Versi. 15.0 mm 39 mm

Source: Fisher, R.. “THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS.” Annals of Human Genetics 7 (1936): 179-188.
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 78
Example: Transform the Label
Label Petal Petal
Width Length
0 5.0 mm 9.2 mm
1 9.2 mm 26.1 mm
0 7.7 mm 18.9 mm
1 9.1 mm 32.1 mm
0 7.9 mm 15.5 mm
0 5.7 mm 12 mm
0 2.5 mm 13.5 mm
1 15.0 mm 39 mm

Source: Fisher, R.. “THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS.” Annals of Human Genetics 7 (1936): 179-188.
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 79
Example: 1.Visualize the data
Label Petal Petal
Width Length 15.0

Petal Width in mm
0 5.0 mm 9.2 mm 12.5
1 9.2 mm 26.1 mm 10.0
0 7.7 mm 18.9 mm 7.5
1 9.1 mm 32.1 mm
5.0
0 7.9 mm 15.5 mm
2.5
0 5.7 mm 12 mm
0 2.5 mm 13.5 mm 10 14 18 23 27 31 36 40
1 15.0 mm 39 mm Petal Length in mm

Source: Fisher, R.. “THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS.” Annals of Human Genetics 7 (1936): 179-188.
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 80
Example: 1.Visualize the data
Label Petal Petal
Width Length 15.0

Petal Width in mm
0 5.0 mm 9.2 mm 12.5
1 9.2 mm 26.1 mm 10.0
0 7.7 mm 18.9 mm 7.5
1 9.1 mm 32.1 mm
5.0
0 7.9 mm 15.5 mm Iris Setosa (0)
2.5 Iris Versicolor (1)
0 5.7 mm 12 mm
0 2.5 mm 13.5 mm 10 14 18 23 27 31 36 40
1 15.0 mm 39 mm Petal Length in mm

Source: Fisher, R.. “THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS.” Annals of Human Genetics 7 (1936): 179-188.
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 81
Example: 2. Find a decision boundary
15.0

Petal Width in mm
12.5
10.0
7.5
5.0
2.5

10 14 18 23 27 31 36 40
Petal Length in mm

Note: Setosa Versicolor


Source: Fisher, R.. “THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS.” Annals of Human Genetics 7 (1936): 179-188.
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 82
Example: 2. Find a decision boundary
What decision boundary i.e. function
separates the data? 15.0

Petal Width in mm
• Linear boundary 12.5
10.0
• Polynomial boundary
7.5
• Gaussian boundary
5.0
2.5

10 14 18 23 27 31 36 40
Petal Length in mm

Note: Setosa Versicolor


Source: Fisher, R.. “THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS.” Annals of Human Genetics 7 (1936): 179-188.
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 83
Example: 2. Find a decision boundary
What decision boundary i.e. function
separates the data? 15.0

Petal Width in mm
• Linear boundary 12.5
10.0
• Polynomial boundary
7.5
• Gaussian boundary
5.0
2.5

Answer: Linear Decision Boundary! 10 14 18 23 27 31 36 40


Petal Length in mm

Note: Setosa Versicolor


Source: Fisher, R.. “THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS.” Annals of Human Genetics 7 (1936): 179-188.
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 84
Example: 3. Answer your Question
15.0

Petal Width in mm
12.5
10.0
7.5
5.0
2.5

10 14 18 23 27 31 36 40
Petal Length in mm

Note: Setosa Versicolor


Source: Fisher, R.. “THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS.” Annals of Human Genetics 7 (1936): 179-188.
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 85
Example: 3. Answer your Question
My flower has a Petal Width of
7mm and a Petal Length of 15.0

Petal Width in mm
15mm. 12.5
10.0

Is my Flower a Iris Setosa or Iris 7.5


Versicolor? 5.0
2.5

10 14 18 23 27 31 36 40
Petal Length in mm

Note: Setosa Versicolor


Source: Fisher, R.. “THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS.” Annals of Human Genetics 7 (1936): 179-188.
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 86
Example: 3. Answer your Question
My flower has a Petal Width of
7mm and a Petal Length of 15.0

Petal Width in mm
15mm. 12.5
10.0

Is my Flower a Iris Setosa or Iris 7.5


Versicolor? 5.0
2.5
→ Iris Setosa
10 14 18 23 27 31 36 40
Petal Length in mm

Note: Setosa Versicolor | Reason: The point is on the “left side” of the decision boundary
Source: Fisher, R.. “THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS.” Annals of Human Genetics 7 (1936): 179-188.
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 87
Next in this Lecture:
• What is the Mathematical Framework?
• How do we classify using a linear model?

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 88


Thank you for listening!
Machine Learning for Engineers
Logistic Regression

Bilder: TF / Malter
Overview
• The Logistic Model
• Optimization

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 91


Overview
• The Logistic Model
• Optimization

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 92


Logistic Regression
How do we describe the linear
model as decision boundary? 15.0

Petal Width in mm
12.5
10.0
7.5
5.0
2.5

10 14 18 23 27 31 36 40
Petal Length in mm

Note: Setosa Versicolor


Source: Fisher, R.. “THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS.” Annals of Human Genetics 7 (1936): 179-188.
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 93
Logistic Regression
How do we describe the linear
model as decision boundary? 15.0

Petal Width in mm
12.5
“Uncertain”
10.0
Thumb-Rule: “Certain”
7.5
The larger the distance of the
input 𝐱 to the decision boundary, 5.0

the more certain it belongs to 2.5


either Setosa (left of the line) or
10 14 18 23 27 31 36 40
Versicolor (right of the line)! Petal Length in mm

Note: Setosa Versicolor


Source: Fisher, R.. “THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS.” Annals of Human Genetics 7 (1936): 179-188.
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 94
From distance to “probability”
The linear model:
𝐷 15.0

Petal Width in mm
𝑓 𝐱 = ෍ 𝑤𝑗 𝑥𝑗 12.5
f(x)
𝑗=1 10.0
f(x)
1) 7.5
Calculates a signed distance,
between the input and the linear 5.0
model 2.5

→ How do we get a “probability”? 10 14 18 23 27 31 36 40


Petal Length in mm

Note: Setosa Versicolor | 1) negative (when left of the line), positive (when right of the line)
Source: Fisher, R.. “THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS.” Annals of Human Genetics 7 (1936): 179-188.
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 95
The sigmoid function
The sigmoid (logit, logistic) function
maps to the range [0,1]! 1.0
0.82

Output 𝜇 𝑥, 𝐰
0.66
That means the model is now:
0.5
1
𝜇 𝐱, 𝐰 = 𝐓𝐱 0.33
1 + 𝑒 −𝐰
0.17

→ „Only“ the probability for the −7.5 −5−2.5 0 2.5 5 7.5 10


event versicolor… Input 𝐰 𝐓 𝐱

Fun fact: The sigmoid function is sometimes lovingly called “squashing function”
Note: Here we already inserted the function f! Homework: What is the general form of the sigmoid function?
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 96
Bernoulli distribution
The Bernoulli distribution can model
both events (yes-or-no event): 1.0

𝑝 y 𝐱, 𝐰) = Ber y 𝜇 𝐱, 𝐰 )
𝑦 1−𝑦
= 𝜇 𝐱, 𝐰 1 − 𝜇 𝐱, 𝐰 0.5

Heads
Mrs. B
Mr. A
Problem:

Tails
How do we get the label,
Voting Fair Coin
based on the calculated probability?

Note: We use the above distribution for the MLE estimation! Basically we replace “𝑝 𝐬𝑖 𝛉 ” in the log-likelihood with this!
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 97
The Decision Rule
Based on 𝜇(𝐱, 𝐰) we can decide,
which class is more “likely”! 1.0
0.82

Output 𝜇 𝑥, 𝐰
0.66
The Decision Rule is:
1, if 𝜇(𝐱, 𝐰) > 0.5 0.5
𝑦=ቊ
0, if 𝜇(𝐱, 𝐰) ≤ 0.5 0.33
0.17

Question: −7.5 −5−2.5 0 2.5 5 7.5 10


How do we find the optimal Input 𝐰 𝐓 𝐱
parameters?
Note: Setosa Versicolor | 1) negative (when left of the line), positive (when right of the line)
Source: Fisher, R.. “THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS.” Annals of Human Genetics 7 (1936): 179-188.
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 98
Overview
• The Logistic Model
• Optimization

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 99


Constructing the Loss
Reminder: The Log-Likelihood
𝑁

𝓵(𝛉) = ෍ log[𝑝 y𝑖 𝐱 𝑖 , 𝛉 ]
𝑖=1

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 100


Constructing the Loss
Reminder: The Log-Likelihood Bernoulli Distribution
𝑁
𝑝 y 𝐱, 𝐰) =
𝓵(𝛉) = ෍ log[𝑝 y𝑖 𝐱 𝑖 , 𝛉 ] 1−𝑦
𝑖=1
= 𝜇 𝐱, 𝐰 𝑦 1 − 𝜇 𝐱, 𝐰
𝑁
𝑦𝑖 1−𝑦𝑖
* ෍ log 𝜇 𝐱 𝑖 , 𝐰
= 1 − 𝜇 𝐱𝑖 , 𝐰
𝑖=1

* We just insert the Probability from Slide 99


13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 101
Constructing the Loss
Reminder: The Log-Likelihood Bernoulli Distribution
𝑁
𝑝 y 𝐱, 𝐰) =
𝓵(𝛉) = ෍ log[𝑝 y𝑖 𝐱 𝑖 , 𝛉 ] 1−𝑦
𝑖=1
= 𝜇 𝐱, 𝐰 𝑦 1 − 𝜇 𝐱, 𝐰
𝑁
𝑦𝑖 1−𝑦𝑖
* ෍ log 𝜇 𝐱 𝑖 , 𝐰
= 1 − 𝜇 𝐱𝑖 , 𝐰
𝑖=1
𝑁

= ෍ 𝑦𝑖 log 𝜇 𝐱 𝑖 , 𝐰 + (1 − 𝑦𝑖 ) log 1 − 𝜇 𝐱 𝑖 , 𝐰
𝑖=1

* We just insert the Probability from Slide 99


13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 102
The Cross Entropy Loss
Instead of Maximizing the Log-Likelihood, we minimize the Negative Log-
Likelihood!

This Loss is called the Cross Entropy:


𝑁

𝐿(𝛉) = − ෍ 𝑦𝑖 log 𝜇 𝐱 𝑖 , 𝐰 + (1 − 𝑦𝑖 ) log 1 − 𝜇 𝐱 𝑖 , 𝐰


𝑖=1

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 103


The Cross Entropy Loss
Instead of Maximizing the Log-Likelihood, we minimize the Negative Log-
Likelihood!

This Loss is called the Cross Entropy:


𝑁

𝐿(𝛉) = − ෍ 𝑦𝑖 log 𝜇 𝐱 𝑖 , 𝐰 + (1 − 𝑦𝑖 ) log 1 − 𝜇 𝐱 𝑖 , 𝐰


𝑖=1

• Unique minimum
• No analytical solution possible!
→ Optimization with Gradient descent.

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 104


Optimization Method: Gradient Descend
Observation:
The loss is like a
mountainous landscape!

Idea:
We find the minimum, by “walking
down” the slope of the mountain 𝐿(𝛉)
𝑤0
𝑤1

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 105


Optimization Method: Gradient Descend
Approach:
1. Start with random weights
2. Calculate: The direction of
steepest descend
3. Step in that direction
4. Repeat from step 2
𝐿(𝛉)
Result:
𝑤0
Some local minimum is reached! 𝑤1

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 106


Example: Gradient descend
Algorithm:
1 Repeat for each iteration 𝑖:
𝐿(𝜃𝑖 )
2 Calculate loss 𝐿(𝜃𝑖 )

Loss 𝐿(𝜃)
𝜃𝑖
Parameter 𝜃

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 107


Example: Gradient descend
Algorithm:
1 Repeat for each iteration 𝑖:
𝐿(𝜃𝑖 )
2 Calculate loss 𝐿(𝜃𝑖 )
3 Calculate gradient ∇𝐿(𝜃𝑖 )

Loss 𝐿(𝜃)
𝜃𝑖
Parameter 𝜃

Note: The gradient in matrix-vector form is: 𝐗 𝑇 (𝛍 − 𝐲)


13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 108
Example: Gradient descend
Algorithm:
1 Repeat for each iteration 𝑖:
𝐿(𝜃𝑖 )
2 Calculate loss 𝐿(𝜃𝑖 )
3 Calculate gradient ∇𝐿(𝜃𝑖 )

Loss 𝐿(𝜃)
The Gradient for Logistic Regression:

∇𝐿(𝜃𝑖 ) = ෍ 𝜇 𝐱 𝐢 , 𝐰 − 𝑦𝑖 𝐱 𝑖
𝑖
𝜃𝑖
Parameter 𝜃

Note: The gradient in matrix-vector form is: 𝐗 𝑇 (𝛍 − 𝐲)


13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 109
Example: Gradient descend
Algorithm:
1 Repeat for each iteration 𝑖:
𝐿(𝜃𝑖 )
2 Calculate loss 𝐿(𝜃𝑖 )
3 Calculate gradient ∇𝐿(𝜃𝑖 )

Loss 𝐿(𝜃)
4 𝜃𝑖+1 = 𝜃𝑖 − 𝜂 ⋅ ∇𝐿(𝜃𝑖 )

𝜃𝑖 𝜃𝑖+1
Parameter 𝜃

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 110


Example: Gradient descend
Algorithm:
1 Repeat for each iteration 𝑖:
𝐿(𝜃𝑖 )
2 Calculate loss 𝐿(𝜃𝑖 )
3 Calculate gradient ∇𝐿(𝜃𝑖 )

Loss 𝐿(𝜃)
4 𝜃𝑖+1 = 𝜃𝑖 − 𝜂 ⋅ ∇𝐿(𝜃𝑖 )
𝜂 is called the learning rate
• If 𝜂 is too large
→ Overshooting the minimum
𝜃𝑖 𝜃𝑖+1
• If 𝜂 is too small Parameter 𝜃
→ Minimum is not reached
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 111
Example: Gradient descend
Algorithm:
1 Repeat for each iteration 𝑖:
𝐿(𝜃𝑖 )
2 Calculate loss 𝐿(𝜃𝑖 )
3 Calculate gradient ∇𝐿(𝜃𝑖 )

Loss 𝐿(𝜃)
4 𝜃𝑖+1 = 𝜃𝑖 − 𝜂 ⋅ ∇𝐿(𝜃𝑖 )

𝜃𝑖 𝜃𝑖+1
Parameter 𝜃

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 112


Example: Gradient descend
Algorithm:
1 Repeat for each iteration 𝑖:
𝐿(𝜃𝑖 )
2 Calculate loss 𝐿(𝜃𝑖 )
3 Calculate gradient ∇𝐿(𝜃𝑖 )

Loss 𝐿(𝜃)
4 𝜃𝑖+1 = 𝜃𝑖 − 𝜂 ⋅ ∇𝐿(𝜃𝑖 ) 𝐿(𝜃𝑖+1 )

5 𝑖 =𝑖+1

𝜃𝑖 𝜃𝑖+1
Parameter 𝜃

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 113


Example: Gradient descend
Algorithm:
1 Repeat for each iteration 𝑖:
𝐿(𝜃𝑖 )
2 Calculate loss 𝐿(𝜃𝑖 )
3 Calculate gradient ∇𝐿(𝜃𝑖 )

Loss 𝐿(𝜃)
4 𝜃𝑖+1 = 𝜃𝑖 − 𝜂 ⋅ ∇𝐿(𝜃𝑖 ) 𝐿(𝜃𝑖+1 )

5 𝑖 =𝑖+1
𝐿(𝜃𝑖+2 )

𝜃𝑖 𝜃𝑖+1 𝜃𝑖+2
Parameter 𝜃

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 114


Example: Gradient descend
Algorithm:
1 Repeat for each iteration 𝑖:
𝐿(𝜃𝑖 )
2 Calculate loss 𝐿(𝜃𝑖 )
3 Calculate gradient ∇𝐿(𝜃𝑖 )

Loss 𝐿(𝜃)
4 𝜃𝑖+1 = 𝜃𝑖 − 𝜂 ⋅ ∇𝐿(𝜃𝑖 ) 𝐿(𝜃𝑖+1 )

5 𝑖 =𝑖+1
𝐿(𝜃𝑖+2 )
𝐿(𝜃𝑖+3 )
Minimum

𝜃𝑖 𝜃𝑖+1 𝜃𝑖+2 𝜃𝑖+3


Parameter 𝜃

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 115


Example: Gradient descend
Algorithm:
1 Repeat for each iteration 𝑖:
𝐿(𝜃𝑖 )
2 Calculate loss 𝐿(𝜃𝑖 )
3 Calculate gradient ∇𝐿(𝜃𝑖 )

Loss 𝐿(𝜃)
4 𝜃𝑖+1 = 𝜃𝑖 − 𝜂 ⋅ ∇𝐿(𝜃𝑖 ) 𝐿(𝜃𝑖+1 )

5 𝑖 =𝑖+1
𝐿(𝜃𝑖+2 )
𝐿(𝜃𝑖+3 )
Minimum

Repeat the process until the loss is minimal! 𝜃𝑖 𝜃𝑖+1 𝜃𝑖+2 𝜃𝑖+3
Parameter 𝜃

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 116


General weaknesses of Gradient Descend

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 117


General weaknesses of Gradient Descend
Loss
• Multiple local minima are common

Minimum

Parameter
Minimum

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 118


General weaknesses of Gradient Descend
Loss
• Multiple local minima are common
• Into which the network converges to
depends heavily on random initialization

Minimum

Parameter
Minimum

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 119


General weaknesses of Gradient Descend
• Multiple local minima are common
• Into which the network converges to
depends heavily on random initialization
• Success depends on learning rate 𝜂

https://de.serlo.org/mathe/funktionen/wichtige-funktionstypen-ihre-eigenschaften/polynomfunktionen-beliebigen-grades/polynom
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 120
Thank you for listening!
Machine Learning for Engineers
Overfitting and Underfitting

Bilder: TF / Malter
Overfitting and Underfitting
Optimal function vs.
Estimated function
For 𝑀 = 0 and 𝑀 = 1 the function fails to
model the data, as the chosen model
is too simple (Underfitting)

Source: Machine Learning: A Probabilistic Perspective, Kevin P. Murphy, p.19


13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 123
Overfitting and Underfitting
Optimal function vs.
Estimated function
For 𝑀 = 0 and 𝑀 = 1 the function fails to
model the data, as the chosen model
is too simple (Underfitting)

For 𝑀 = 9 the function exactly models the


given training data, as the chosen model
Is too complex (Overfitting)

Source: Machine Learning: A Probabilistic Perspective, Kevin P. Murphy, p.19


13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 124
Overfitting and Underfitting
Optimal function vs.
Estimated function
For 𝑀 = 0 and 𝑀 = 1 the function fails to
model the data, as the chosen model
is too simple (Underfitting)

For 𝑀 = 9 the function exactly models the


given training data, as the chosen model
Is too complex (Overfitting)

For 𝑀 = 3 the function closely matches


the expected function
Source: Machine Learning: A Probabilistic Perspective, Kevin P. Murphy, p.19
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 125
Overfitting and Underfitting
UNDERFITTING OVERFITTING

Error on the Error on the


training data training data is
very high very low

Testing error is Testing error is


high high
Examples are Example is 𝑀 = 9
𝑀 = 0 and 𝑀 = 1

Source: Machine Learning: A Probabilistic Perspective, Kevin P. Murphy, p.19


13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 126
Complexity vs. Generalization Error
Plotting over all complexities, Underfitting Overfitting Training Set
typically reveals a “sweet spot” Testing Set

Prediction Error
(i.e. an ideal complexity)

The prediction error is affected


by two kind of errors:
• Bias error Generalization error

• Variance error
Ideal Complexity
Complexity
(Minimum
Testing Loss)
Note: Complexity does not mean parameters! It means the mathematical complexity! i.e. the ability of the model to capture a relationship!
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 127
Complexity vs. Generalization Error
Bias:  High Bias Low Bias →
Error induced by simplifying

Prediction Error
assumptions of the model

Complexity

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 128


Complexity vs. Generalization Error
Bias:  High Bias Low Bias →
Error induced by simplifying  Low Variance High Variance →

Prediction Error
assumptions of the model

Variance:
Error induced by differing
training data i.e. how sensitive
is the model to “noise”

Complexity

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 129


Complexity vs. Generalization Error
Bias:  High Bias Low Bias →
Error induced by simplifying  Low Variance High Variance →

Prediction Error
assumptions of the model

Variance:
Error induced by differing
training data i.e. how sensitive
is the model to “noise”

Both errors oppose each other! Complexity


When reducing bias, you increase variance!
(and vice versa)
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 130
Complexity vs. Generalization Error
The sweet spot is located at  High Bias Low Bias →
the minimum of the test curve  Low Variance High Variance →

Prediction Error
It has both a low bias and a
low variance!

This spot is – usually – found


using hyperparameter search
Ideal Complexity
Complexity
(Minimum
Testing Loss)
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 131
Hyperparameter search
1)
Approach: Calculate the prediction error on the validation set and select
the hyperparameters with the smallest loss

Example for hyperparameters:


• Amount of degrees in a polynomial
• Learning rate 𝜂
• Different Basis Functions (See BFE)
• In Neural Networks: The amount of neurons

→ There exists multiple methods to solve this!

1) We use this instead of the test set. This is important! You only touch the test set for the final evaluation!
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 132
Hyperparameter search methods
There exist multiple ways to find good hyperparameters…
… manual search (Trial-and-Error)
… random search
… grid search
… Bayesian methods
… etc.

→ In practice you typically use Trial-and-Error & Grid-Search

Note: Basically every known optimization technique can be used. Examples: Particle Swarm Optimization, Genetic Algorithms, Ant Colony Optimization…
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 133
Hyperparameter search on the data splits
Typically you split the data into a training and testing split

Hyperparameter search is forbidden on the testing split!

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 134


Hyperparameter search on the data splits
Typically you split the data into a training and testing split

Hyperparameter search is forbidden on the testing split!

Solution:
Splitting the training data further into a train split and validation split

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 135


Hyperparameter search on the data splits
Typically you split the data into a training and testing split

Hyperparameter search is forbidden on the testing split!

Solution:
Splitting the training data further into a train split and validation split

If the dataset is large “enough” →

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 136


Hyperparameter search on the data splits
Typically you split the data into a training and testing split

Hyperparameter search is forbidden on the testing split!

Solution:
Splitting the training data further into a train split and validation split

If the dataset is large “enough” →


If the dataset is small … →

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 137


Cross Validation
Problem of the Static Split:
The data1) is split into 80% train set and 20% validation set

If you are “unlucky”: The validation split is not representative!


i.e. your prediction error is a bad estimate for the
generalization performance!

→ Solution: Cross Validation!

1) We assume the test data is already split and put away. A typical split of all the data is 80% train – 10% validation - 10% testing
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 138
K-Fold Cross Validation
Approach: Training data (90%) Test 10%
Split the training data into 𝑘-folds
(in our example 𝑘 = 5)
Fold 1 Fold 2 Fold 3 Fold 4 Fold 5

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 139


K-Fold Cross Validation
Approach: Training data (90%) Test 10%
Split the training data into 𝑘-folds
(in our example 𝑘 = 5)
Fold 1 Fold 2 Fold 3 Fold 4 Fold 5

For each split 𝑖:


1. Train on all splits, Valid. Training data 𝑖=1
except the 𝑖-th Fold
2. Evaluate on the 𝑖-th Fold
Finally: Average over all results

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 140


K-Fold Cross Validation
Approach: Training data (90%) Test 10%
Split the training data into 𝑘-folds
(in our example 𝑘 = 5)
Fold 1 Fold 2 Fold 3 Fold 4 Fold 5

For each split 𝑖:


1. Train on all splits, Valid. Training data 𝑖=1
except the 𝑖-th Fold Valid. Training data 𝑖=2
2. Evaluate on the 𝑖-th Fold
Finally: Average over all results

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 141


K-Fold Cross Validation
Approach: Training data (90%) Test 10%
Split the training data into 𝑘-folds
(in our example 𝑘 = 5)
Fold 1 Fold 2 Fold 3 Fold 4 Fold 5

For each split 𝑖:


1. Train on all splits, Valid. Training data 𝑖 =1
except the 𝑖-th Fold Valid. Training data 𝑖 =2
2. Evaluate on the 𝑖-th Fold Training data Valid. Training data 𝑖 =3
Finally: Average over all results Training data Valid. 𝑖 =4
Training data Valid. 𝑖 =5

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 142


Cross Validation
K-Fold Cross validation is the typical form of cross validation

When k=1, the process is called Leave-one out cross validation (or LOOCV)

There exist other variants as well, like:


• Grouped Cross Validation
• Nested Cross Validation
• Stratified Cross Validation

→ Each has its unique use-case!

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 143


Thank you for listening!
Machine Learning for Engineers
Support Vector Machines – Problem Statement

Bilder: TF / Malter
Recap: Logistic Regression
Goal: Find a linear decision Classification
boundary separating the two classes
Label 𝒚 = Dog
Problem: Multiple linear decision

Feature 𝑥1
boundaries solve the problem with
equal accuracy
Question: Which one is the best?
Label 𝒚 = Cat D
A B C D
Feature 𝑥0
A B C

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 2


Intuition behind Support Vector Machines
A is closer to Dog at the top and to Classification
Cat at the bottom
Label 𝒚 = Dog
B is far away from both Dog and Cat

Feature 𝑥1
throughout the boundary
C always is closer to Dog than Cat
D almost touches Dog at the bottom
and Cat at the top Label 𝒚 = Cat D
Feature 𝑥0
A B C

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 3


Intuition behind Support Vector Machines
B is far away from both Dog and Cat Classification
throughout the boundary
Label 𝒚 = Dog
To put this in more mathematical

Feature 𝑥1
terms, we say it has the largest
margin 𝑚 𝑚
𝑚
Updated Goal: Find linear decision
boundary with the largest margin. Label 𝒚 = Cat
Feature 𝑥0
B

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 4


Mathematical Model for Decision Boundary
Mathematically, any hyperplane is
given by the equation
𝒘⋅𝒙−𝑏 =0 𝒘

Feature 𝑥1
𝒘 is the normal vector

𝒙 is any arbitrary feature vector


𝑏
𝑏 𝒘
is the distance to the origin
𝒘
Feature 𝑥0

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 5


Mathematical Model for Margins
Decision boundary is given by
𝒘⋅𝒙−𝑏 =0 2
𝒘
𝒘

Feature 𝑥1
Furthermore, by definition, the margin
boundaries are given by
𝒘 ⋅ 𝒙 − 𝑏 = +1
𝑏
𝒘 ⋅ 𝒙 − 𝑏 = −1 𝒘
2 Feature 𝑥0
is thus the distance between the
𝒘
two margins → maximize
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 6
Mathematical Model for Classes
Two-class problem 𝑦1 , … , 𝑦𝑛 = ±1
𝑦 = +1
All training data 𝒙1 , … , 𝒙𝑛 needs to be 2
correctly classified outside margin 𝒘
𝒘

Feature 𝑥1
𝒘 ⋅ 𝒙𝑖 − 𝑏 ≥ 1 if 𝑦𝑖 = +1
𝒘 ⋅ 𝒙𝑖 − 𝑏 ≤ −1 if 𝑦𝑖 = −1 𝑦 = −1
𝑏
Due to the label choice, this can be 𝒘
simplified as Feature 𝑥0
𝑦𝑖 𝒘 ⋅ 𝒙𝑖 − 𝑏 ≥ 1

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 7


Optimization of Support Vector Machines
This leads us to a constrained
𝑦 = +1
optimization problem.
2
2 𝒘
1. Maximizing the margin , which 𝒘

Feature 𝑥1
𝒘
1 2
is equal to minimizing 𝒘
2
1
min 𝒘 2 𝑦 = −1
𝒘,𝑏 2 𝑏
𝒘
2. Subject to no misclassification on
Feature 𝑥0
the training data
𝑦𝑖 𝒘 ⋅ 𝒙𝑖 − 𝑏 ≥ 1

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 8


Machine Learning for Engineers
Support Vector Machines – Optimization

Bilder: TF / Malter
Optimization of Support Vector Machines
In the previous section, we derived the constrained optimization problem
for Support Vector Machines (SVMs):
1 2
min 𝒘
𝒘,𝑏 2

s. t. 𝑦𝑖 𝒘 ⋅ 𝒙𝑖 − 𝑏 ≥ 1

This is a quadratic programming problem, minimizing a quadratic function


subject to some inequality constraints.

⮩ We can thus introduce Lagrange multipliers

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 10


Introduction of Lagrange Multipliers
For each of the 𝑛 inequality constraints, introduce 𝛼𝑖 Lagrange multiplier1):
𝑛
1 2
ℒ 𝒘, 𝑏, 𝜶 = 𝒘 − ෍ 𝛼𝑖 [𝑦𝑖 𝒘 ⋅ 𝒙𝑖 − 𝑏 − 1]
2
𝑖=1

The Lagrange multipliers need to be maximized, resulting in the following


derived optimization problem:
min max ℒ 𝒘, 𝑏, 𝜶
𝒘,𝑏 𝜶

1) While not relevant for the exam, a more detailed derivation is available in C. Bishop, Pattern Recognition and Machine Learning. New York:
Springer, 2006, Appendix E.
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 11
Primal versus Dual Formulation
This optimization problem is called the primal formulation, with solution 𝑝∗ :
𝑝∗ = min max ℒ 𝒘, 𝑏, 𝜶
𝒘,𝑏 𝜶

Alternatively, there’s also the dual formulation, with solution 𝑑 ∗ :


𝑑 ∗ = max min ℒ 𝒘, 𝑏, 𝜶
𝜶 𝒘,𝑏

Since Slater’s condition holds for this convex optimization problem, we can
guarantee that 𝑝∗ = 𝑑 ∗ and solve the dual problem instead.

⮩ We can thus first solve the minimization problem for 𝒘 and 𝑏

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 12


Solution using Partial Derivatives
We now want to minimize the following function for 𝒘 and 𝑏:
𝑛
1 2
ℒ 𝒘, 𝑏, 𝜶 = 𝒘 − ෍ 𝛼𝑖 [𝑦𝑖 𝒘 ⋅ 𝒙𝑖 − 𝑏 − 1]
2
𝑖=1
𝜕ℒ 𝜕ℒ
At the minimum, we know that the derivatives and need to be zero:
𝜕𝒘 𝜕𝑏
𝑛 𝑛
𝜕ℒ
= 𝒘 − ෍ 𝛼𝑖 𝑦𝑖 𝒙𝑖 = 0 ⇒ 𝒘 = ෍ 𝛼𝑖 𝑦𝑖 𝒙𝑖
𝜕𝒘
𝑖=1 𝑖=1
𝑛 𝑛
𝜕ℒ
= − ෍ 𝛼𝑖 𝑦𝑖 = 0 ⇒ ෍ 𝛼𝑖 𝑦𝑖 = 0
𝜕𝑏
𝑖=1 𝑖=1

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 13


Solution using Partial Derivatives
We then eliminate 𝒘 and 𝑏 by inserting both equations into the function:
𝑛 𝑛
1
ℒ 𝒘, 𝑏, 𝜶 = ෍ 𝛼𝑖 − ෍ 𝑦𝑖 𝑦𝑗 𝛼𝑖 𝛼𝑗 (𝒙𝑖 ⋅ 𝒙𝑗 )
2
𝑖=1 𝑖,𝑗=1

This directly leads us to the remaining maximization problem for 𝜶:


max ℒ 𝒘, 𝑏, 𝜶
𝜶
𝑛

s. t. 𝛼𝑖 ≥ 0, ෍ 𝛼𝑖 𝑦𝑖 = 0
𝑖=1
This quadratic programming problem can be solved using sequential
minimal optimization, but which is not a topic of this lecture.
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 14
The Karush–Kuhn–Tucker Conditions
The given problem fulfills the Karush-Kuhn-Tucker conditions1):
𝛼𝑖 ≥ 0
𝑦𝑖 𝑤 ⋅ 𝑥 − 𝑏 − 1 ≥ 0
𝛼𝑖 𝑦𝑖 𝑤 ⋅ 𝑥 − 𝑏 − 1 = 0

Based on the last condition, we can derive that either 𝛼𝑖 = 0 or 𝑦𝑖 (𝑤 ⋅ 𝑥 −


𝑏) = 1 for each of the training point.

💡 When 𝛼𝑖 is non-zero, the training point is on the margin and thus a so-
called “support vector”.
1) While not relevant for the exam, a more detailed derivation is available in C. Bishop, Pattern Recognition and Machine Learning. New York:
Springer, 2006, Appendix E.
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 15
Optimization Summary
The dual formulation leads to a quadratic programming problem on 𝜶:1)
𝑛 𝑛
1
max ℒ 𝒘, 𝑏, 𝜶 = ෍ 𝛼𝑖 − ෍ 𝑦𝑖 𝑦𝑗 𝛼𝑖 𝛼𝑗 (𝒙𝑖 ⋅ 𝒙𝑗 )
𝜶 2
𝑖=1 𝑖,𝑗=1

The solution 𝜶∗ can then be used to compute 𝒘 and make predictions on


previously unseen data points 𝒙:2)
𝑛

𝑦 = sgn 𝒘 ⋅ 𝒙 − 𝑏 = sgn ෍ 𝛼𝑖∗ 𝑦𝑖 𝒙𝑖 ⋅ 𝒙


𝑖=1

1) This is simplified for the purposes of this summary. For the additional constraints that are also required refer to slide 14
2) Remember the equation for 𝒘 resulting from the derivative in slide 13
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 16
Machine Learning for Engineers
Support Vector Machines – Non-Linearity and the Kernel Trick

Bilder: TF / Malter
Recap: Basis Functions
House Prices in Springfield (USA)
For linear regression we transform

House prices in (Mio. € )


1.6
the input space using a polynomial
basis function 𝚽: 1.5

T
1.4
𝚽 𝑥 = 1 𝑥 2 3 4 5
𝑥 𝑥 𝑥 𝑥 𝑥 6
1.3

This can lead to 1.2


1.1
• Computational issues depending
on the number of points 1910 1940 1970 2000
Years from 1900 - 2020
• Memory issues depending on the
dimensionality of the output
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 18
Introduction of Basis Functions
We’ve previously derived the optimization and prediction functions for SVM.
Next, we can also introduce a basis function 𝚽(𝒙) to allow for non-linearity.
𝑛 𝑛
1
ℒ 𝒘, 𝑏, 𝜶 = ෍ 𝛼𝑖 − ෍ 𝑦𝑖 𝑦𝑗 𝛼𝑖 𝛼𝑗 (𝚽(𝒙𝑖 ) ⋅ 𝚽(𝒙𝑗 ))
2
𝑖=1 𝑖,𝑗=1

𝑦 = sgn 𝒘 ⋅ 𝒙 − 𝑏 = sgn ෍ 𝛼𝑖∗ 𝑦𝑖 𝚽(𝒙𝑖 ) ⋅ 𝚽(𝒙)


𝑖=1

💡 Note how the basis function is applied at each of the dot products.

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 19


Solving Computational Issues with Sparsity
There are 𝑛2 summands for the optimization and 𝑛 for the prediction, that
is, both functions naively scale with the number of points 𝑛.
𝑛 𝑛
1
ℒ 𝒘, 𝑏, 𝜶 = ෍ 𝛼𝑖 − ෍ 𝑦𝑖 𝑦𝑗 𝛼𝑖 𝛼𝑗 (𝚽(𝒙𝑖 ) ⋅ 𝚽(𝒙𝑗 ))
2
𝑖=1 𝑖,𝑗=1

𝑦 = sgn 𝒘 ⋅ 𝒙 − 𝑏 = sgn ෍ 𝛼𝑖∗ 𝑦𝑖 𝚽(𝒙𝑖 ) ⋅ 𝚽(𝒙)


𝑖=1

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 20


Solving Computational Issues with Sparsity
However, from the Karush-Kuhn-Tucker conditions we know that 𝛼𝑖 is non-
zero only for the few support vectors.
𝑛 𝑛
1
ℒ 𝒘, 𝑏, 𝜶 = ෍ 𝛼𝑖 − ෍ 𝑦𝑖 𝑦𝑗 𝛼𝑖 𝛼𝑗 (𝚽(𝒙𝑖 ) ⋅ 𝚽(𝒙𝑗 ))
2
𝑖=1 𝑖,𝑗=1

𝑦 = sgn 𝒘 ⋅ 𝒙 − 𝑏 = sgn ෍ 𝛼𝑖∗ 𝑦𝑖 𝚽(𝒙𝑖 ) ⋅ 𝚽(𝒙)


𝑖=1

Most summands are thus zero. This property is called sparsity and greatly
simplifies the computations.
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 21
Solving Memory Issues with Kernel Trick
Since the basis function 𝚽: ℝ𝑛 ↦ ℝ𝑚 is applied explicitly on each of the
points, the memory scales with the output dimensionality m.
𝑛 𝑛
1
ℒ 𝒘, 𝑏, 𝜶 = ෍ 𝛼𝑖 − ෍ 𝑦𝑖 𝑦𝑗 𝛼𝑖 𝛼𝑗 (𝚽(𝒙𝑖 ) ⋅ 𝚽(𝒙𝑗 ))
2
𝑖=1 𝑖,𝑗=1

𝑦 = sgn 𝒘 ⋅ 𝒙 − 𝑏 = sgn ෍ 𝛼𝑖∗ 𝑦𝑖 𝚽(𝒙𝑖 ) ⋅ 𝚽(𝒙)


𝑖=1

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 22


Solving Memory Issues with Kernel Trick
However, instead of explicitly computing 𝚽(𝒙𝑖 ) ⋅ 𝚽(𝒙𝑗 ), we can instead
replace it with a kernel function 𝐾(𝒙𝑖 , 𝒙𝑗 ).
𝑛 𝑛
1
ℒ 𝒘, 𝑏, 𝜶 = ෍ 𝛼𝑖 − ෍ 𝑦𝑖 𝑦𝑗 𝛼𝑖 𝛼𝑗 𝐾(𝒙𝑖 , 𝒙𝑗 )
2
𝑖=1 𝑖,𝑗=1

𝑦 = sgn 𝒘 ⋅ 𝒙 − 𝑏 = sgn ෍ 𝛼𝑖∗ 𝑦𝑖 𝐾(𝒙𝑖 , 𝒙𝑗 )


𝑖=1

The kernel function usually doesn’t require explicit computation of the basis
function. This method of replacement is called the kernel trick.
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 23
The Linear Kernel
The kernel function is given by:
1
𝐾 𝒙𝑖 , 𝒙𝑗 = 2 𝒙𝑖 ⋅ 𝒙𝑗
2𝜎
• 𝜎 is a length-scale parameter

• Feature space mapping is basically


just the identity function

• Only useful for linearly separable


classification Image from https://scikit-
learn.org/stable/auto_examples/svm/plot_svm_kernels.html

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 24


The Radial Basis Function Kernel
The kernel function is given by:
2
𝒙𝑖 − 𝒙𝑗
𝐾 𝒙𝑖 , 𝒙𝑗 = exp −
2𝜎 2
• 𝜎 is a length-scale parameter

• Useful for non-linear classification


with clusters

• Probably the most popular kernel


Image from https://scikit-
learn.org/stable/auto_examples/svm/plot_svm_kernels.html

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 25


The Polynomial Kernel
The kernel function is given by:
𝑑
1
𝐾 𝒙𝑖 , 𝒙𝑗 = 2
𝑥𝑖 ⋅ 𝑥𝑗 + 𝑟
2𝜎

• 𝜎 is a length-scale parameter

• 𝑟 is a free parameter for the trade-off


between lower- and higher-order

• 𝑑 is the degree of the polynomial, a


Image from https://scikit-
typical choice is 𝑑 = 2 learn.org/stable/auto_examples/svm/plot_svm_kernels.html

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 26


Machine Learning for Engineers
Support Vector Machines – Hard Margin and Soft Margin

Bilder: TF / Malter
Hard Margin and Soft Margin
Up until now we have always implicitly
𝑦 = +1
assumed classes are linearly separable
2
⮩ So-called hard margin 𝒘
𝒘

Feature 𝑥1
What happens when the classes 𝜉2
overlap and there is no solution? 𝜉1
𝑦 = −1
⮩ So-called soft margin 𝑏
𝒘
Introduce slack variables 𝜉1 , 𝜉2 that Feature 𝑥0
move points onto the margin

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 28


Soft Margin – Slack Variables
Slack variable 𝜉𝑖 is the distance
𝑦 = +1
required to correctly classify point 𝒙𝑖
2
If correctly classified and outside margin 𝒘
𝒘

Feature 𝑥1
• We know 𝑦𝑖 𝒘 ⋅ 𝒙𝑖 − 𝑏 ≥ 1 👍
𝜉1
• No correction, thus 𝜉𝑖 = 0 𝜉2
𝑦 = −1
𝑏
If incorrectly classified or inside margin
𝒘
• We know 𝑦𝑖 𝒘 ⋅ 𝒙𝑖 − 𝑏 < 1 👎 Feature 𝑥0
• Move by 𝜉𝑖 = 1 − 𝑦𝑖 (𝒘 ⋅ 𝒙𝑖 − 𝑏)

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 29


Updated Optimization Problem
We can update the original optimization problem with the slack variables 𝝃,
which leads us to the following constrained optimization problem:
𝑛
1 2
1
min 𝒘 + 𝜆 ෍ 𝜉𝑖
𝒘,𝑏,𝝃 2 𝑛
𝑖=1

𝑠. 𝑡. 𝑦𝑖 𝒘 ⋅ 𝒙𝑖 − 𝑏 ≥ 1 − 𝜉𝑖

The optimization procedure and kernel trick can similarly be derived for
this, but we’ll spare you the details 😉.

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 30


Machine Learning for Engineers
Support Vector Machines – Regression

Bilder: TF / Malter
Intuition behind Support Vector Regression
The Support Vector Machine is a Regression
method for classification.
Let us now turn to regression. 𝜖

Label 𝑦
With training data 𝒙1 , 𝑦1 , … , (𝒙𝑛 , 𝑦𝑛 ),
we want to find a function of form:
𝜖
𝑦 =𝒘⋅𝒙+𝑏
For Support Vector Regression, we
Feature 𝑥
want to keep everything in a 𝜖-tube

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 32


Keeping Everything inside The Tube
Thus, for each point (𝒙𝑖 , 𝑦𝑖 ), the Regression
following conditions need to hold:
𝑦𝑖 ≤ 𝒘 ⋅ 𝒙𝑖 + 𝑏 + 𝜖 𝜖

Label 𝑦
𝑦𝑖 ≥ 𝒘 ⋅ 𝒙𝑖 + 𝑏 − 𝜖 𝜉+ 𝜉−
For outliers, we introduce slack 𝜖
variables 𝜉 + and 𝜉 − :
𝑦𝑖 ≤ 𝒘 ⋅ 𝒙𝑖 + 𝑏 + 𝜖 + 𝜉𝑖+
Feature 𝑥
𝑦𝑖 ≥ 𝒘 ⋅ 𝒙𝑖 + 𝑏 − 𝜖 − 𝜉𝑖−

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 33


Optimization of Support Vector Regression
This leads us to another constrained Regression
optimization problem.
1
𝜖
1. Minimizing 𝒘 2 and the slack
2

Label 𝑦
+
variables 𝜉𝑖 and 𝜉𝑖− 𝜉+ 𝜉−
𝑛
1
min 𝒘 2 + 𝐶 ෍( 𝜉𝑖+ + 𝜉𝑖− ) 𝜖
𝒘,𝑏 2
𝑖=1
𝐶 is a free parameter specifying
the trade-off for outliers Feature 𝑥

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 34


Optimization of Support Vector Regression
This leads us to another constrained Regression
optimization problem.
𝜖
2. Subject to all 𝑦𝑖 being inside the

Label 𝑦
tube specified by 𝜖 𝜉+ 𝜉−
𝑦𝑖 ≤ 𝒘 ⋅ 𝒙𝑖 + 𝑏 + 𝜖 + 𝜉𝑖+
𝑦𝑖 ≥ 𝒘 ⋅ 𝒙𝑖 + 𝑏 − 𝜖 − 𝜉𝑖− 𝜖
𝜉𝑖+ ≥ 0
𝜉𝑖− ≥ 0
Feature 𝑥

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 35


Pros and Cons of Support Vector Regression
Advantages Regression
• Less susceptible to outliers when
𝜖
compared to linear regression

Label 𝑦
𝜉+ 𝜉−
Disadvantages

• Requires careful tuning of the free 𝜖


parameters 𝜖 and 𝐶

• Doesn’t scale to large data sets Feature 𝑥

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 36


Machine Learning for Engineers
Support Vector Machines – Summary

Bilder: TF / Malter
Summary of Support Vector Machines
Support Vector Machine Advantages

• Elegant to optimize, since they have one local and global optimum

• Efficiently scales to high-dimensional features due to the sparsity and


the kernel trick

Support Vector Machine Disadvantages

• Difficult to choose good combination of kernel function and other free


parameters, such as regularization coefficient 𝑪

• Does not scale to large amount of training data


13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 38
References
1. D. Sontag. (2016). Introduction To Machine Learning: Support Vector Machines [Online].
Available: https://people.csail.mit.edu/dsontag/courses/ml16/
2. A. Zisserman. (2015). Lecture 2: The SVM classifier [Online]. Available:
https://www.robots.ox.ac.uk/~az/lectures/ml/
3. K. Murphy, Machine Learning: A Probabilistic Perspective. Cambridge, MA, USA: MIT Press,
2012.
4. B. Scholkopf and A. J. Smola, Learning with Kernels: Support Vector Machines,
Regularization, Optimization, and Beyond. Cambridge, MA, USA: MIT Press, 2018.
5. A. Kowalczyk, Support Vector Machines Succintly. Morrisville, NC, USA: Syncfusion, 2017.

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 39


Machine Learning for Engineers
Principal Component Analysis – Applications

Bilder: TF / Malter
Application: Image Compression
We can save data more efficiently
by applying PCA!

The MaD logo (266 × 266 × 3 px)


can be compressed from 212 kB to
Original 32 Components (97.27%)
• 32 components : 8 kB
• 16 components : 4 kB
• 8 components : 2 kB

Can be used for other data as well! 16 Components (93.62%) 8 Components (85.50%)
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 17
Application: Facial Recognition
Eigenfaces (1991) are a fast and efficient
way to recognize faces in a gallery of facial
images.
General Idea:
1. Generate a set of eigenfaces (see right)
based on a gallery
2. Compute the weight for each eigenface
based on the query face image
3. Use the weights to classify (and identify)
the person
Turk, M. and A. Pentland. “Face recognition using eigenfaces.” Proceedings. 1991 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (1991): 586-591.
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 18
Application: Anomaly Detection
Detecting anomalous traffic in IP Networks is crucial for administration!
General Idea:
Transform the time series of IP packages (requests, etc.) using a PCA
→ The first k (typically 10) components represent
“normal” behavior
→ The components “after” k, represent
anomalies and/or noise
Packages mapped primarily to the latter component are classified as
anomalous!
Anukool Lakhina, Konstantina Papagiannaki, Mark Crovella, Christophe Diot, Eric D. Kolaczyk, and Nina Taft. 2004. Structural analysis of network traffic flows. SIGMETRICS Perform. Eval.
Rev. 32, 1 (June 2004), 61–72. DOI:https://doi.org/10.1145/1012888.1005697
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 19
Limitation: The Adidas Problem
Let the data points be distributed like 𝑤2
the Adidas logo, with three implicit 𝑤1
classes (teal, red, blue stripe).

Feature 2
The principal directions 𝑤1 and 𝑤2
found by PCA are given.
The first principal direction 𝑤1 does
not preserve the information to
classify the points!
Feature 1

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 20


Machine Learning for Engineers
Principal Component Analysis – Intuition

Bilder: TF / Malter
Intuition behind Principal Component Analysis
Consider the following data with Twin Heights
height of two twins ℎ1 and ℎ2 .
Now assume that

Twin height ℎ2
a) We can only plot one-dimensional
figures due to limitations1
b) We have complex models that
can only handle a single value1
Twin height ℎ1

1) These are obviously constructed limitations for didactical concerns. However, in real life you will also face limitations in visualization
(maximum of three dimensions) or model feasibility.
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 2
Intuition behind Principal Component Analysis
How do we find a transformation from Twin Heights
two values to one value?
In this specific case we could

Twin height ℎ2
1
a) Keep the average height (ℎ1 +
2
ℎ2 ) as one value
b) Discard the difference in height
ℎ1 − ℎ2
Twin height ℎ1

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 3


Intuition behind Principal Component Analysis
In this specific case we could Twin Heights
1
a) Keep the average height (ℎ1 +
2

Height difference
ℎ2 ) as one value
b) Discard the difference in height
ℎ1 − ℎ2
💡 This corresponds to a change in
basis or coordinate frame in the plot.
Average height
⮩ Rotation by matrix 𝑨

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 4


Principal Component Analysis as Rotation
The Principal Component Analysis (PCA) is essentially a change in basis
using a rotation matrix 𝑨.
Twin Heights Twin Heights

Height difference
Twin height ℎ2

Rotation

1 1 1
𝑨=
2 1 −1
Twin height ℎ1 Average height
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 5
Principal Component Analysis: Preserving Distances
💡 Idea: Preserve the distances Twin Heights
between each pair in the lower
dimension

Height difference
• Points that are far apart in the
original space remain far apart
• Points that are close in the original
space remain close

Average height

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 6


Principal Component Analysis: Preserving Variance
The previous idea is closely coupled Twin Heights
with the “variance”.

Height difference
• Find those axes and dimensions

Low variance
with high variance
• Discard the axes and dimensions
with increasingly low variance
High variance
Goal: Algorithmically find directions
with high/low variance in data1 Average height

1) In this trivial case we could hand-engineer such a transformation. However, this is not generally the case.
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 7
Machine Learning for Engineers
Principal Component Analysis – Mathematics

Bilder: TF / Malter
Description of Input Data
Let 𝑿 be a 𝑛 × 𝑝 matrix of data Twin Heights
points, where
• 𝑛 is number of points (here 15)

Twin height ℎ2
• 𝑝 is number of features (here 2)
1.75 1.77
1.67 1.68
𝑿=
⋮ ⋮
2.01 1.98 Twin height ℎ1

15.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 9


Description of Input Data
Let 𝑪 = 𝑿𝑇 𝑿 be the 𝑝 × 𝑝 covariance Twin Heights
matrix of data points1, where
• 𝑿 are data points

Twin height ℎ2
• 𝑝 is number of features (here 2)
The covariance matrix generalizes
the variance 𝜎 2 of a Gaussian
distribution 𝒩(𝜇, 𝜎 2 ) to higher
dimensions. Twin height ℎ1

1) Technically the covariance matrix of the transposed data points 𝑿𝑇 after centering to zero mean.
15.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 10
Description of Desired Output Data
We’re interested in new axes that Twin Heights
rotate the system, here 𝑤1 and 𝑤2
These are columns of a matrix

Twin height ℎ2
𝜆2
𝑤2 𝜆1
0.7 0.7 𝑤1
𝑾=
0.7 −0.7
𝑤1 𝑤2

We’re also interested in the


“variance” of each axis, 𝜆1 and 𝜆2 Twin height ℎ1

𝜆1 = 2, 𝜆2 = 1

15.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 11


Relationship to Eigenvectors and -values
We’re in luck as this corresponds to Twin Heights
the eigenvectors and –values of 𝑪:
𝑪 = 𝑾𝑳𝑾𝑇

Twin height ℎ2
𝜆2
𝑤2 𝜆1
• 𝑾 is 𝑝 × 𝑝 matrix, where each 𝑤1
column is an eigenvector
• 𝑳 is a 𝑝 × 𝑝 matrix, with all
eigenvalues on diagonal in
decreasing order Twin height ℎ1

15.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 12


Relationship to Singular Value Decomposition
Alternatively, one can also perform a Twin Heights
Singular Value Decomposition (SVD)
on the data points 𝑿:

Twin height ℎ2
𝜆2
𝑿= 𝑼𝑺𝑾𝑇 𝑤2 𝜆1
𝑤1
• 𝑼 is 𝑛 × 𝑛 unitary matrix
• 𝑺 is 𝑛 × 𝑝 matrix of singular values
𝜎1 , … , 𝜎𝑝 on diagonal
Twin height ℎ1
• 𝑾 is 𝑝 × 𝑝 matrix of singular
vectors 𝒘1 , … , 𝒘2

15.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 13


Relationship to Singular Value Decomposition
Let us insert 𝑿 = 𝑼𝑺𝑾𝑇 into the Twin Heights
definition of the covariance matrix 𝑪:
𝑪 = 𝑿𝑇 𝑿

Twin height ℎ2
𝜆2
𝑇 𝑇 𝜆1
= 𝑼𝑺𝑾 𝑼𝑺𝑾𝑇 𝑤2
𝑤1
= 𝑾𝑺𝑇 𝑼𝑇 𝑼𝑺𝑾𝑇
= 𝑾𝑺2 𝑾𝑇
• Eigenvalues correspond to
squared singular values 𝜆𝑖 = 𝜎𝑖2 Twin height ℎ1
• Eigenvectors directly correspond
to singular vectors
15.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 14
Summary of Algorithm
1. Find principal directions 𝑤1 , … 𝑤𝑝 Twin Heights
and eigenvalues 𝜆1 , … , 𝜆𝑝

Twin height ℎ2
2. Project data points into new
coordinate frame using 𝑤2
𝑤1
𝑻 = 𝑿𝑾
3. Keep the 𝑞 most important
dimensions as determined by
𝜆1 , … 𝜆𝑞 (which are sorted by Twin height ℎ1
variance)

15.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 15


Machine Learning for Engineers
Principal Component Analysis – Summary

Bilder: TF / Malter
Summary of Principal Component Analysis
• Rotate coordinate system such that all axes are sorted from most
variance to least variance
• Required axes 𝑾 determined using either
• Eigenvectors and –values of covariance matrix 𝑪 = 𝑾𝑳𝑾𝑇
• Singular Value Decomposition (SVD) of data points 𝑿 = 𝑼𝑺𝑾𝑇

• Subsequently drop axes with least variance


• Variance-based feature selection has limitations (see. Adidas problem)

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 22


References
1. P. Liang. (2009). Practical Machine Learning: Dimensionality Reduction [Online]. Available:
https://people.eecs.berkeley.edu/~jordan/courses/294-fall09/lectures/dimensionality/
2. K. Murphy, Machine Learning: A Probabilistic Perspective. Cambridge, MA, USA: MIT Press,
2012.

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 23


Machine Learning for Engineers
Deep Learning – Perceptron

Bilder: TF / Malter
The Human Brain
The human brain is our reference for
an intelligent agent, that
a) … contains different areas
specialized for some tasks (e.g.,
the visual cortex)
b) … consists of neurons as the
fundamental unit of “computation”

⮩ Let us have a closer look at the inner workings of a neuron 💡

Image from https://ucresearch.tumblr.com/post/138868208574


13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 2
The Human Neuron – Signaling Mechanism
1. Excitatory
stimuli reach
the neuron
⮩ Input
2. Threshold is
reached
3. Neuron fires
and triggers
action potential
⮩ Output

Image from https://www.khanacademy.org/science/biology/human-biology/neuron-nervous-system/a/the-synapse


13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 3
The Perceptron – Computational Model of a Neuron
1. Let’s start by adding some basic components (input and output), we’re
subsequently going to build the computational model step by step

Input 𝑥1

Input 𝑥2 Output 𝑦1

Input 𝑥3

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 4


The Perceptron – Computational Model of a Neuron
2. Next, let’s add some weights to select and deselect input channels as
not all are relevant to our model neuron

Input 𝑥1
⋅ 𝑤1
Input 𝑥2 Output 𝑦1
⋅ 𝑤2
Input 𝑥3
⋅ 𝑤3

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 5


The Perceptron – Computational Model of a Neuron
3. Then, let’s add all the excitatory stimuli to the resting potential to
determine the current potential of our model neuron

Input 𝑥1
⋅ 𝑤1
Input 𝑥2 Output 𝑦1
⋅ 𝑤2 +
Input 𝑥3
⋅ 𝑤3 𝑏

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 6


The Perceptron – Computational Model of a Neuron
4. Finally, let’s apply a threshold function 𝜎 (usually the sigmoid) to
determine whether to send an action potential in the output

Input 𝑥1
⋅ 𝑤1
Input 𝑥2 Output 𝑦1
⋅ 𝑤2 +
Input 𝑥3
⋅ 𝑤3 𝑏 1
𝜎 𝑥 =
1 + 𝑒 −𝑥

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 7


The Perceptron – Computational Model of a Neuron
5. We can now write a perceptron as a mathematical function mapping the
inputs 𝑥1 , 𝑥2 , 𝑥3 to the output 𝑦1 using channel weights 𝑤1 , 𝑤2 , 𝑤3 , 𝑏:

𝑦1 = 𝜎 𝑤1 ⋅ 𝑥1 + 𝑤2 ⋅ 𝑥2 + 𝑤3 ⋅ 𝑥3 + 𝑏 = 𝜎(෍ 𝑤𝑖 ⋅ 𝑥𝑖 + 𝑏)
𝑖
Input 𝑥1
⋅ 𝑤1
Input 𝑥2 Output 𝑦1
⋅ 𝑤2 +
Input 𝑥3
⋅ 𝑤3 𝑏 1
𝜎 𝑥 =
1 + 𝑒 −𝑥
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 8
The Perceptron – Signaling Mechanism
Given a perceptron with parameters
𝑤1 = 4, 𝑤2 = 7, 𝑤3 = 11, 𝑏 = −10 and equation
𝑦1 = 𝜎 𝑤1 ⋅ 𝑥1 + 𝑤2 ⋅ 𝑥2 + 𝑤3 ⋅ 𝑥3 + 𝑏

Input 𝒙𝟏 = 𝟏, 𝒙𝟐 = 𝟎, 𝒙𝟑 = 𝟎 (“100”) Input 𝒙𝟏 = 𝟏, 𝒙𝟐 = 𝟏, 𝒙𝟑 = 𝟏 (“111”)


𝑦1 = 𝜎 4 ⋅ 1 + 7 ⋅ 0 + 11 ⋅ 0 − 10 𝑦1 = 𝜎 4 ⋅ 1 + 7 ⋅ 1 + 11 ⋅ 1 − 10
= 𝜎 4 + 0 + 0 − 10 = 𝜎 −6 = 𝜎 4 + 7 + 11 − 10 = 𝜎 12
= 0.00 = 1.00
⮩ Output 𝑦1 not activated ⮩ Output 𝑦1 is activated
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 9
Generalizing the Threshold: Activation Function
Sigmoid Hyperbolic Tangent Rectified Linear Unit

1
𝜎 𝑥 = 𝜎 𝑥 = tanh(𝑥) 𝜎 𝑥 = max(𝑥, 0)
1 + 𝑒 −𝑥

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 10


Machine Learning for Engineers
Deep Learning – Multilayer Perceptron

Bilder: TF / Malter
The Perceptron – A Recap
In the last section we learned about the perceptron, a computational model
representing a neuron.
Inputs 𝑥1 , … , 𝑥𝑛 𝑥1
Output 𝑦1 𝑥2 𝑦1
Computation 𝑦1 = 𝜎(σ𝑛𝑖=1 𝑤𝑖 ⋅ 𝑥𝑖 + 𝑏) 𝑥3

⮩ We can combine more than one perceptron for complex models 💡

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 12


Combining Multiple Perceptron – Individual Notation
𝑥1 Let’s now combine three perceptron with
different outputs 𝑦1 , 𝑦2 and 𝑦3 .
𝑥2 𝑦1
The computations for each these are:
𝑥3
𝑥1 𝑦1 = 𝜎 𝒘𝟏𝟏 𝑥1 + 𝒘𝟏𝟐 𝑥2 + 𝒘𝟏𝟑 𝑥3 + 𝒃𝟏
𝑥2 𝑦2 𝑦2 = 𝜎 𝒘𝟐𝟏 𝑥1 + 𝒘𝟐𝟐 𝑥2 + 𝒘𝟐𝟑 𝑥3 + 𝒃𝟐
𝑥3 𝑦3 = 𝜎 𝒘𝟑𝟏 𝑥1 + 𝒘𝟑𝟐 𝑥2 + 𝒘𝟑𝟑 𝑥3 + 𝒃𝟑
𝑥1
Note how they all have their own parameters
𝑥2 𝑦3 (weights 𝑤 and bias 𝑏)
𝑥3
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 13
Combining Multiple Perceptron – Matrix Notation
𝑥1 The outputs 𝑦, weights 𝑤, inputs 𝑥 and bias 𝑏 of
each perceptron are matrices and vectors.
𝑥2 𝑦1
We can thus rewrite the three computations1) as:
𝑥3
𝑥1 𝑦1 𝑤11 𝑤12 𝑤13 𝑥1 𝑏1
𝑦2 = 𝑤21 𝑤22 𝑤23 𝑥2 + 𝑏2
𝑥2 𝑦2 𝑦3 𝑤31 𝑤32 𝑤33 𝑥3 𝑏3
𝑥3
𝑥1 Or in a more simplified form:

𝑥2 𝑦3 𝒚 = 𝜎(𝑾 ⋅ 𝒙 + 𝒃)

𝑥3 1) This is a perfect opportunity to brush up your linear algebra skills. As an optional


homework assignment, you can verify that the given matrix multiplication holds.
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 14
Combining Multiple Perceptron – Single Layer
Instead of depicting each computation as a node
𝑥1 𝑦1 or circle, we can also
• Depict each value (input 𝑥 and output 𝑦) as a
𝑥2 𝑦2 node or circle
• Depict each weighted connection as an arrow
𝑥3 𝑦3
This is the typical graphical representation1), with
𝑾, 𝒃 the underlying computation remaining
𝒚 = 𝜎(𝑾 ⋅ 𝒙 + 𝒃)

1) Note how the bias is implicitly assumed and not visualized.


| Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 15
Combining Multiple Perceptron – Multiple Layers
There’s no reason to limit ourselves to a
𝑥1 𝑦1 single layer.
𝑧1 We can chain multiple layers, with each
𝑥2 𝑦2 output being the input of the next:
𝑧2 𝒚 = 𝜎(𝑾1 ⋅ 𝒙 + 𝒃1 )
𝑥3 𝑦3
𝒛 = 𝜎(𝑾2 ⋅ 𝒚 + 𝒃2 )
𝑾1 , 𝒃1 𝑾2 , 𝒃2
Combining these leads us to:
𝒛 = 𝜎(𝑾2 ⋅ 𝜎 𝑾1 ⋅ 𝒙 + 𝒃1 + 𝒃2 )

| Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 16


Multilayer Perceptron – A Summary
• A multilayer perceptron is a neural
𝑥1 𝑦1 network with multiple perceptron
𝑧1 • It is oftentimes organized into layers
𝑥2 𝑦2
• Each layer has its own set of
𝑧2 parameters (weights 𝑾𝑖 and bias 𝒃𝑖 )
𝑥3 𝑦3
• The underlying computation is a
1 1 2 2
𝑾 ,𝒃 𝑾 ,𝒃 matrix multiplication described by
𝒚𝑖+1 = 𝜎(𝑾𝑖 ⋅ 𝒚𝑖 + 𝒃𝑖 )

| Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 17


Machine Learning for Engineers
Deep Learning – Loss Function

Bilder: TF / Malter
How do our models learn?
• Up until now, the parameters 𝑾𝑖 and 𝒃𝑖 of the multilayer perceptron
were assumed as given
• We now aim to learn the parameters 𝑾𝑖 and 𝒃𝑖 based on some
example input and output data
• Let us assume we have a dataset of 𝒙 and corresponding 𝒚
1.2 0.2
3.2 0.2
𝒙1 , 𝒚1 = (1.4, ) and 𝒙2 , 𝒚2 = (0.4, ) and …
0.2 0.2
1.3 0.3

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 19


How do our models learn?
• We now want to ensure that the parameters 𝑾𝑖 and 𝒃𝑖 can correctly
predict the 𝒚𝑖 based on the 𝒙𝑖
• To do this, we apply 𝒙𝑖 on our model and compare the result with 𝒚𝑖

1.2 = 𝑥1
?
𝑦1 = 2.9 ֞ 3.2
1.4 = 𝑥2
?
𝑦2 = 0.2 ֞ 0.2
1.3 = 𝑥3
𝑾1 , 𝒃1 𝑾2 , 𝒃2
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 20
The Loss Function
• ෝ𝑖 and the
We need a comparison metric between the predicted outputs 𝒚
expected outputs 𝒚𝑖
• This is called the loss function, which usually depends on the type of
problem the multilayer perceptron solves
• For regression, a common metric is the mean squared error
• For classification, a common metric is the cross entropy

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 21


Mean Squared Error
• For regression, each output 𝑦1 , … , 𝑦𝑛 is a real-valued target
• The difference 𝑦ො𝑖 − 𝑦𝑖 tells us something about how far off we are, with
the mean square error computed as
𝑛
1 2
ෝ, 𝒚 = ෍ 𝑦ො𝑖 − 𝑦𝑖
𝐿 𝒚
𝑛
𝑖=1

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 22


Cross Entropy – Class Probability and Softmax
• For classification, each output 𝑦1 , … , 𝑦𝑛 represents a class energy
• If the target class index is 𝑖, then only 𝑦𝑖 = 1 and all other 𝑦𝑗 = 0, 𝑖 ≠ 𝑗
(this is also called one-hot encoding)
• Example for class 1: 𝑦 = 1 0 0 𝑇
0 0
𝑇
• Example for class 3: 𝑦 = 0 0 1 0 0

• To ensure that we have probabilities, we apply a softmax transformation


exp(𝑦ො𝑖 )
𝑦ത𝑖 = 𝑛
σ𝑗=1 exp(𝑦ො𝑗 )

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 23


Cross Entropy – Negative Log Likelihood
• After the transformation, each 𝑦ത1 , … , 𝑦ത𝑛 represents a class probability
• The negative log − log 𝑦ത𝑖 for the target class index 𝑖 tells us how far off
we are, with the cross entropy calculated as
ෝ, 𝒚 = − log 𝑦ത𝑖
𝐿 𝒚
exp(𝑦ො𝑖 )
= − log 𝑛
σ𝑗=1 exp(𝑦ො𝑗 )

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 24


How do our models learn? – A Summary
• How wrong is the current set of parameters 𝜽? Forward Pass
• How should we change the set of parameters by ∇𝜽? Backward Pass

∇𝜽 ← Backward Pass
1.2 = 𝑥1
?
𝑦1 = 2.9 ֞ 3.2
1.4 = 𝑥2
?
𝑦2 = 0.2 ֞ 0.2
1.3 = 𝑥3
Forward Pass → 𝜽
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 25
Machine Learning for Engineers
Deep Learning – Gradient Descent

Bilder: TF / Malter
Parameter Optimization
• At this stage, we can compute the gradient of the parameters ∇𝜽
• The gradient tells us how we need to change the current parameters 𝜽
in order to make fewer errors on the given data
• We can use this in an iterative algorithm called gradient descent, with
the central equation being
𝜽𝑖+1 = 𝜽𝑖 − 𝜇 ⋅ ∇𝜽𝑖
• The learning rate 𝜇 tells us how quickly we should change the current
parameters 𝜽𝑖
⮩ Let us have a closer look at an illustrative example 💡
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 27
Gradient Descent – An Example
• Let us assume that the error function is
𝑓 𝜃 = 𝜃2
• The gradient of the error function is the
derivative 𝑓 ′ 𝜃 = 2 ⋅ 𝜃
• We aim to find the minimum error at
the location 𝜃 = 0
• Our initial estimate for the location is
𝜃1 = 2
• The learning rate is 𝜇 = 0.25

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 28


Gradient Descent – An Example
• 1. Step: 𝜃1 = 2, ∇𝜃1 = 4
𝜃 2 = 𝜃1 − 𝜇∇𝜃1 = 2 − 0.25 ⋅ 4 = 1

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 29


Gradient Descent – An Example
• 1. Step: 𝜃1 = 2, ∇𝜃1 = 4
𝜃 2 = 𝜃1 − 𝜇∇𝜃1 = 2 − 0.25 ⋅ 4 = 1
• 2. Step: 𝜃 2 = 1, ∇𝜃 2 = 2
𝜃 3 = 𝜃 2 − 𝜇∇𝜃 2 = 1 − 0.25 ⋅ 2 = 0.5

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 30


Gradient Descent – An Example
• 1. Step: 𝜃1 = 2, ∇𝜃1 = 4
𝜃 2 = 𝜃1 − 𝜇∇𝜃1 = 2 − 0.25 ⋅ 4 = 1
• 2. Step: 𝜃 2 = 1, ∇𝜃 2 = 2
𝜃 3 = 𝜃 2 − 𝜇∇𝜃 2 = 1 − 0.25 ⋅ 2 = 0.5
• 3. Step: 𝜃 3 = 0.5, ∇𝜃 3 = 1
𝜃 4 = 𝜃 3 − 𝜇∇𝜃 3 = 0.5 − 0.25 ⋅ 1 = 0.25

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 31


Gradient Descent – An Example
• 1. Step: 𝜃1 = 2, ∇𝜃1 = 4
𝜃 2 = 𝜃1 − 𝜇∇𝜃1 = 2 − 0.25 ⋅ 4 = 1
• 2. Step: 𝜃 2 = 1, ∇𝜃 2 = 2
𝜃 3 = 𝜃 2 − 𝜇∇𝜃 2 = 1 − 0.25 ⋅ 2 = 0.5
• 3. Step: 𝜃 3 = 0.5, ∇𝜃 3 = 1
𝜃 4 = 𝜃 3 − 𝜇∇𝜃 3 = 0.5 − 0.25 ⋅ 1 = 0.25
• We incrementally approach the
expected target 𝜃 = 0

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 32


Gradient Descent – Limitations
• The learning rate 𝜇 can lead to slow
convergence if not properly configured
(see example for 𝜇 = 0.95)

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 33


Gradient Descent – Limitations
• The learning rate 𝜇 can lead to slow
convergence if not properly configured
(see example for 𝜇 = 0.95)
• The starting parameters 𝜃1 can lead to
a solution stuck in local minimum (see
example for 𝑓 𝜃 = 0.5𝜃 4 − 0.25𝜃 2 +
0.1𝜃)
• There’s no guarantee to finding an
optimal solution (a global minimum)

Global Local
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 34
Gradient Descent – A Summary
• Gradient Descent incrementally adjusts the parameters 𝜽 based on the
gradient ∇𝜽 of the parameters
1. For each iteration
2. Compute the error of the parameters 𝜽
3. Compute the gradient of the parameters ∇𝜽
4. Update parameters using 𝜽𝑖+1 = 𝜽𝑖 − 𝜇 ⋅ ∇𝜽𝑖

• There is no guarantee that the algorithm converges to the global


optimum (e.g., due to learning rate 𝜇 or initial parameters 𝜽1 )

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 35


Machine Learning for Engineers
Deep Learning – Learning Process

Bilder: TF / Malter
How do our models learn?
• We’ve just talked about the concept of gradient descent, but largely
avoided the details for the function 𝑓(𝜽)1)
• The goal is to minimize the error or loss function across the complete
dataset 𝒙1 , 𝒚1 , … , (𝒙𝑛 , 𝒚𝑛 )
𝑛 𝑛

𝑓 𝜽 = ෍ 𝐿(ෝ
𝒚𝑖 , 𝒚𝑖 ) = ෍ 𝐿(𝑔(𝒙𝑖 , 𝜽), 𝒚𝑖 )
𝑖=1 𝑖=1

• The function 𝑔(𝒙𝑖 , 𝜽) encapsulates the computations of a multilayer


perceptron with parameters 𝜽

1) In the previous section the function was 𝑓 𝜃 = 𝜃 2 . For multilayer perceptron, the function is quite obviously different.
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 37
How do our models learn?
• We thus need to loop over the training data 𝒙1 , 𝒚1 , … , (𝒙𝑛 , 𝒚𝑛 ) to
compute sums over individual data points
• This results in a slight modification to the gradient descent algorithm
1. For each epoch – Loop over training data
2. For each batch – Loop over pieces of training data
3. Compute the error of the parameters 𝜽
4. Compute the gradient of the parameters ∇𝜽
5. Update parameters using 𝜽𝑖+1 = 𝜽𝑖 − 𝜇 ⋅ ∇𝜽𝑖

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 38


The Concept of Epochs
• Gradient descent requires multiple iterations to reach a minimum
• The optimization function 𝑓 𝜽 sums the loss functions 𝐿(ෝ
𝒚𝑖 , 𝒚𝑖 ) of each
individual data point 𝒙𝑖 , 𝒚𝑖 in the training data
• We need to go through the training data multiple times to find suitable
parameters 𝜽 for our problem
⮩ Each such iteration is called an epoch

⮩ Requires the computation of the full sum, which is expensive 💡

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 39


The Concept of Batches
• Instead of computing the full sum of one epoch in one go, we instead
break it apart into multiple smaller pieces
• These pieces are usually of a specified size (e.g., 64), which tells us
how many training data points to use for each piece
• One epoch then consists of processing all pieces
⮩ Each such piece of the training data is called a batch
Batch1) (30 samples) Full training data (120 samples)

1) Note how we need 4 batches to go through all of the training data in this specific example.
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 40
The Learning Process – A Summary
• The parameters 𝜽 are optimized by iterating through the training data
• For practical reasons, this is split into epochs and batches

Batch (30 samples) Full training data (120 samples)


Epoch 1 1. iteration 2. iteration 3. iteration 4. iteration

Epoch 2 5. iteration 6. iteration 7. iteration 8. iteration

Epoch 3 9. iteration 10. iteration 11. iteration 12. iteration



13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 41
Machine Learning for Engineers
Deep Learning – Convolution

Bilder: TF / Malter
The Visual Cortex
We’ve previously learned that the
brain has specialized regions.
• The visual cortex is in charge of
processing visual information
collected from the retinae
• However, cortical cells only
respond to stimuli of a small
receptive fields

⮩ Let’s us look at the implications this has on neural networks 💡


Image from https://link.springer.com/referenceworkentry/10.1007/978-1-4614-7320-6_360-2
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 43
Multilayer Perceptron versus Cortical Neuron
• Multilayer perceptron’s neurons are connected to all previous neurons
• Cortical neurons are only connected to a small receptive field

Standard Cortical
neuron neuron
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 44
The Convolution Operation
The output computation now only depends on a subset of inputs:
𝑦11 = 𝑤11 𝑥11 + 𝑤12 𝑥12 + 𝑤13 𝑥13 + 𝑤21 𝑥21 + 𝑤22 𝑥22 + 𝑤23 𝑥23
+ 𝑤31 𝑥31 + 𝑤32 𝑥32 + 𝑤33 𝑥33
𝑥11 𝑥12 𝑥13 𝑥14 𝑥15

𝑥21 𝑥22 𝑥23 𝑥24 𝑥25 𝑤11 𝑤12 𝑤13 𝑦11

𝑥31 𝑥32 𝑥33 𝑥34 𝑥35 ⋆ 𝑤21 𝑤22 𝑤23 =


𝑥41 𝑥42 𝑥43 𝑥44 𝑥45 𝑤31 𝑤32 𝑤33

𝑥51 𝑥52 𝑥53 𝑥54 𝑥55


13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 45
The Convolution Operation
The output computation now only depends on a subset of inputs:
𝑦12 = 𝑤11 𝑥12 + 𝑤12 𝑥13 + 𝑤13 𝑥14 + 𝑤21 𝑥22 + 𝑤22 𝑥23 + 𝑤23 𝑥24
+ 𝑤31 𝑥32 + 𝑤32 𝑥33 + 𝑤33 𝑥34
𝑥11 𝑥12 𝑥13 𝑥14 𝑥15

𝑥21 𝑥22 𝑥23 𝑥24 𝑥25 𝑤11 𝑤12 𝑤13 𝑦11 𝑦12

𝑥31 𝑥32 𝑥33 𝑥34 𝑥35 ⋆ 𝑤21 𝑤22 𝑤23 =


𝑥41 𝑥42 𝑥43 𝑥44 𝑥45 𝑤31 𝑤32 𝑤33

𝑥51 𝑥52 𝑥53 𝑥54 𝑥55


13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 46
The Convolution Operation
The output computation now only depends on a subset of inputs:
𝑦13 = 𝑤11 𝑥13 + 𝑤12 𝑥14 + 𝑤13 𝑥15 + 𝑤21 𝑥23 + 𝑤22 𝑥24 + 𝑤23 𝑥25
+ 𝑤31 𝑥33 + 𝑤32 𝑥34 + 𝑤33 𝑥35
𝑥11 𝑥12 𝑥13 𝑥14 𝑥15

𝑥21 𝑥22 𝑥23 𝑥24 𝑥25 𝑤11 𝑤12 𝑤13 𝑦11 𝑦12 𝑦13

𝑥31 𝑥32 𝑥33 𝑥34 𝑥35 ⋆ 𝑤21 𝑤22 𝑤23 =


𝑥41 𝑥42 𝑥43 𝑥44 𝑥45 𝑤31 𝑤32 𝑤33

𝑥51 𝑥52 𝑥53 𝑥54 𝑥55


13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 47
The Convolution Operation
The output computation now only depends on a subset of inputs:
𝑦23 = 𝑤11 𝑥23 + 𝑤12 𝑥24 + 𝑤13 𝑥25 + 𝑤21 𝑥33 + 𝑤22 𝑥34 + 𝑤23 𝑥35
+ 𝑤31 𝑥43 + 𝑤32 𝑥44 + 𝑤33 𝑥45
𝑥11 𝑥12 𝑥13 𝑥14 𝑥15

𝑥21 𝑥22 𝑥23 𝑥24 𝑥25 𝑤11 𝑤12 𝑤13 𝑦11 𝑦12 𝑦13

𝑥31 𝑥32 𝑥33 𝑥34 𝑥35 ⋆ 𝑤21 𝑤22 𝑤23 = 𝑦23

𝑥41 𝑥42 𝑥43 𝑥44 𝑥45 𝑤31 𝑤32 𝑤33

𝑥51 𝑥52 𝑥53 𝑥54 𝑥55


13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 48
The Convolution Operation
The output computation now only depends on a subset of inputs:
𝑦22 = 𝑤11 𝑥22 + 𝑤12 𝑥23 + 𝑤13 𝑥24 + 𝑤21 𝑥32 + 𝑤22 𝑥33 + 𝑤23 𝑥34
+ 𝑤31 𝑥42 + 𝑤32 𝑥43 + 𝑤33 𝑥44
𝑥11 𝑥12 𝑥13 𝑥14 𝑥15

𝑥21 𝑥22 𝑥23 𝑥24 𝑥25 𝑤11 𝑤12 𝑤13 𝑦11 𝑦12 𝑦13

𝑥31 𝑥32 𝑥33 𝑥34 𝑥35 ⋆ 𝑤21 𝑤22 𝑤23 = 𝑦22 𝑦23

𝑥41 𝑥42 𝑥43 𝑥44 𝑥45 𝑤31 𝑤32 𝑤33

𝑥51 𝑥52 𝑥53 𝑥54 𝑥55


13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 49
The Convolution Operation
The output computation now only depends on a subset of inputs:
𝑦21 = 𝑤11 𝑥21 + 𝑤12 𝑥22 + 𝑤13 𝑥23 + 𝑤21 𝑥31 + 𝑤22 𝑥32 + 𝑤23 𝑥33
+ 𝑤31 𝑥41 + 𝑤32 𝑥42 + 𝑤33 𝑥43
𝑥11 𝑥12 𝑥13 𝑥14 𝑥15

𝑥21 𝑥22 𝑥23 𝑥24 𝑥25 𝑤11 𝑤12 𝑤13 𝑦11 𝑦12 𝑦13

𝑥31 𝑥32 𝑥33 𝑥34 𝑥35 ⋆ 𝑤21 𝑤22 𝑤23 = 𝑦21 𝑦22 𝑦23

𝑥41 𝑥42 𝑥43 𝑥44 𝑥45 𝑤31 𝑤32 𝑤33

𝑥51 𝑥52 𝑥53 𝑥54 𝑥55


13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 50
The Convolution Operation
The output computation now only depends on a subset of inputs:
𝑦31 = 𝑤11 𝑥31 + 𝑤12 𝑥32 + 𝑤13 𝑥33 + 𝑤21 𝑥41 + 𝑤22 𝑥42 + 𝑤23 𝑥43
+ 𝑤31 𝑥51 + 𝑤32 𝑥52 + 𝑤33 𝑥53
𝑥11 𝑥12 𝑥13 𝑥14 𝑥15

𝑥21 𝑥22 𝑥23 𝑥24 𝑥25 𝑤11 𝑤12 𝑤13 𝑦11 𝑦12 𝑦13

𝑥31 𝑥32 𝑥33 𝑥34 𝑥35 ⋆ 𝑤21 𝑤22 𝑤23 = 𝑦21 𝑦22 𝑦23

𝑥41 𝑥42 𝑥43 𝑥44 𝑥45 𝑤31 𝑤32 𝑤33 𝑦31

𝑥51 𝑥52 𝑥53 𝑥54 𝑥55


13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 51
The Convolution Operation
The output computation now only depends on a subset of inputs:
𝑦32 = 𝑤11 𝑥32 + 𝑤12 𝑥33 + 𝑤13 𝑥34 + 𝑤21 𝑥42 + 𝑤22 𝑥43 + 𝑤23 𝑥44
+ 𝑤31 𝑥52 + 𝑤32 𝑥53 + 𝑤33 𝑥54
𝑥11 𝑥12 𝑥13 𝑥14 𝑥15

𝑥21 𝑥22 𝑥23 𝑥24 𝑥25 𝑤11 𝑤12 𝑤13 𝑦11 𝑦12 𝑦13

𝑥31 𝑥32 𝑥33 𝑥34 𝑥35 ⋆ 𝑤21 𝑤22 𝑤23 = 𝑦21 𝑦22 𝑦23

𝑥41 𝑥42 𝑥43 𝑥44 𝑥45 𝑤31 𝑤32 𝑤33 𝑦31 𝑦32

𝑥51 𝑥52 𝑥53 𝑥54 𝑥55


13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 52
The Convolution Operation
The output computation now only depends on a subset of inputs:
𝑦33 = 𝑤11 𝑥33 + 𝑤12 𝑥34 + 𝑤13 𝑥35 + 𝑤21 𝑥43 + 𝑤22 𝑥44 + 𝑤23 𝑥45
+ 𝑤31 𝑥53 + 𝑤32 𝑥54 + 𝑤33 𝑥55
𝑥11 𝑥12 𝑥13 𝑥14 𝑥15

𝑥21 𝑥22 𝑥23 𝑥24 𝑥25 𝑤11 𝑤12 𝑤13 𝑦11 𝑦12 𝑦13

𝑥31 𝑥32 𝑥33 𝑥34 𝑥35 ⋆ 𝑤21 𝑤22 𝑤23 = 𝑦21 𝑦22 𝑦23

𝑥41 𝑥42 𝑥43 𝑥44 𝑥45 𝑤31 𝑤32 𝑤33 𝑦31 𝑦32 𝑦33

𝑥51 𝑥52 𝑥53 𝑥54 𝑥55


13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 53
The Convolutional Neural Network – Multiple Filters
Instead of a single convolution operation we can also apply multiple
convolution operations in parallel on the same input.

Input 𝒙
The different outputs are
concatenated in the
channel dimension.
Kernel 𝒘

Output 𝒚 + + =

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 54


The Convolutional Neural Network – Multiple Layers
To incrementally process the input to relevant features, we again apply
multiple layers of convolutions.
⋆ ⋆
Convolution Convolution
Operation Operation

After many layers the


number of features
Input Image Convolution (64 kernels) Convolution (128 kernels) quickly grows 💡
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 55
Machine Learning for Engineers
Deep Learning – Pooling

Bilder: TF / Malter
Summarizing Features via Pooling
• Deeper in the network the number of features quickly grows
• It makes sense to summarize these features deeper in the network
• From a practical perspective, this reduces the number of computations
• From a theoretical perspective, these higher-level (i.e., global scale)
features don’t require a high spatial resolution

• This process is called pooling and works like a convolution, with the
kernel replaced by a pooling operation

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 57


The Pooling Operation
The output computation again only depends on a subset of inputs, but the
stride (spacing between subsets) is commonly larger:
𝑦11 = max(𝑥11 , 𝑥12 , 𝑥21 , 𝑥22 )

𝑥11 𝑥12 𝑥13 𝑥14

𝑥21 𝑥22 𝑥23 𝑥24 𝑦11


max 𝒙
𝑥31 𝑥32 𝑥33 𝑥34

𝑥41 𝑥42 𝑥43 𝑥44

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 58


The Pooling Operation
The output computation again only depends on a subset of inputs, but the
stride (spacing between subsets) is commonly larger:
𝑦12 = max(𝑥13 , 𝑥14 , 𝑥23 , 𝑥24 )

𝑥11 𝑥12 𝑥13 𝑥14

𝑥21 𝑥22 𝑥23 𝑥24 𝑦11 𝑦12


max 𝒙
𝑥31 𝑥32 𝑥33 𝑥34

𝑥41 𝑥42 𝑥43 𝑥44

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 59


The Pooling Operation
The output computation again only depends on a subset of inputs, but the
stride (spacing between subsets) is commonly larger:
𝑦21 = max(𝑥31 , 𝑥32 , 𝑥41 , 𝑥42 )

𝑥11 𝑥12 𝑥13 𝑥14

𝑥21 𝑥22 𝑥23 𝑥24 𝑦11 𝑦12


max 𝒙
𝑥31 𝑥32 𝑥33 𝑥34 𝑦21

𝑥41 𝑥42 𝑥43 𝑥44

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 60


The Pooling Operation
The output computation again only depends on a subset of inputs, but the
stride (spacing between subsets) is commonly larger:
𝑦22 = max(𝑥33 , 𝑥34 , 𝑥43 , 𝑥44 )

𝑥11 𝑥12 𝑥13 𝑥14

𝑥21 𝑥22 𝑥23 𝑥24 𝑦11 𝑦12


max 𝒙
𝑥31 𝑥32 𝑥33 𝑥34 𝑦21 𝑦22

𝑥41 𝑥42 𝑥43 𝑥44

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 61


The Pooling Operation
There are multiple methods to summarize the features which are
commonly used in Convolutional Neural Networks (CNN):
Example for 𝒏 = 𝟒 General case
• Maximum 𝑦 = max(𝑥1 , 𝑥2 , 𝑥3 , 𝑥4 ) 𝑦 = max(𝑥1 , … , 𝑥𝑛 )
1 1 4 1 𝑛
• Average 𝑦= (𝑥 + 𝑥2 + 𝑥3 + 𝑥4 ) = σ 𝑥 𝑦= σ 𝑥
4 1 4 𝑖=1 𝑖 𝑛 𝑖=1 𝑖

• L2-Norm 𝑦 = 𝑥12 + 𝑥22 + 𝑥32 + 𝑥42 = σ4𝑖=1 𝑥𝑖2 𝑦 = σ𝑛𝑖=1 𝑥𝑖2

⮩ There’s no general consensus what the “best” method is 💡

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 62


The Convolutional Neural Network – Pooling
Adding pooling to the network reduced the number of features deeper in
the network using a (relatively) inexpensive method.

Pooling 2
Pooling 1 Convolution 2
Input Image Convolution 1
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 63
Machine Learning for Engineers
Deep Learning – Applications

Bilder: TF / Malter
Full Image Classification
• Task: Plant leaf disease classification
of 14 different species
• Input: Image of plant leaf
• Output: Plant and disease class
grape with black measles potato with late blight
• Model: EfficientNet
• Ü. Atilaa, M. Uçarb, K. Akyolc, and E. Uçarb, “Plant leaf disease
classification using EfficientNet deep learning model”, 2020.

healthy strawberry healthy tomato


13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 65
Full Image Regression
• Task: Pose estimation based on
human key points (eyes, joints, …)
• Input: Image of human
• Output: Coordinates of 17 key points
• Model: PoseNet
• G. Papandreou, T. Zhu, L.-C. Chen, S. Gidaris, J. Tompson,
and K. Murphy, “PersonLab: Person Pose Estimation and
Instance Segmentation with a Bottom-Up, Part-Based,
Geometric Embedding Model”, 2018.

• Animated image taken from TensorFlow Lite documentation at


https://www.tensorflow.org/lite/examples/pose_estimation/over
view

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 66


Pixel-wise Image Classification (Segmentation)
• Task: Land cover classification in
satellite imagery
• Input: Satellite image
• Output: Land cover class per
individual pixel
• Model: Segnet, Unet
• Z. Han,Y. Dian, H. Xia, J. Zhou, Y. Jian, C. Yao, X. Wang,
and Y. Li, “Comparing Fully Deep Convolutional Neural
Networks for Land Cover Classification with High-Spatial-
Resolution Gaofen-2 Images”, 2020.

13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 67


Pixel-wise Image Regression
• Task: Relative depth
estimation from a single
image
• Input: Any image
• Output: Depth value per
individual pixel
• Model: MiDaS
• R. Ranftl, K. Lasinger, D. Hafner, K.
Schindler, and V. Koltun, “Towards Robust
Monocular Depth Estimation: Mixing
Datasets for Zero-shot Cross-dataset
Transfer”, 2019.
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 68
Introduction to Machine Learning Based on the Use Case Energy Prediction
Table of contents:
1. Introduction

2. Overview and data loading

3. Visualization

4. Missing values and outliers

5. Splitting of the dataset

6. Linear Regression

7. Random Forest Regressor

8. Support Vector Regression

9. Comparison of the results

10. Deployment of the model

1. Introduction
Within this exercise we want to show the implementation of a supervised learning procedure with the necessary pre- and post-processing steps
using the use case of the energy prediction of a machining process.

Our goal is to perform a regression analysis using the data that we have to train different regression models to predict the target variable. In our
use case we want to predict the energy requirement to perform a milling process.

1.1 Motivation for energy prediction


1. Creation of transparency and implementation of energy planning
2. Adaptation and optimization of the process parameters according to the energy requirement
3. Possibility of load management
4. Detection of deviations due to the comparison of the prediction and the actual energy profile

Based on the planned process parameters, the energy required for the milling process is to be forecasted. As a basis for the development of a
regression model, tests were carried out on a milling machine to gain sufficient data for the training.

1.2 Structure of a milling machine


Using the Cartesian coordinate system, a machine can be controlled along each axis. Based on each axis, you typically get the following
movements from the perspective of an operator facing the machine:

X axis allows movement “left” and “right”


Y axis allows movement “forward” and “backward”
Z axis allows movement “up” and “down”
Based on these movements, the right tool and other process parameters (feed, etc.) we can perform the required milling process.

1.3 Deliverables
To complete this exercise successfully, you need to provide certain results. Throughout the notebook you will find questions you need to
answer, and coding tasks where you need to modify existing code or fill in blanks. The answers to the questions need to be added in the
prepared Your answer markdown fields. Coding tasks can be solved by modifying or inserting code in the cells below the task. If necessary, you
can add extra cells to the notebook, as long as the order of the existing cells remains unchanged. Once you are finished with the lab, you can
submit it through the procedure described in the forum. Once the labs are submitted, you will receive feedback regarding the questions. Thus,
the Feedback boxes need to be left empty.

Example:

Question: What do I do if I am stuck solving this lab?

Your answer: Have a look at the forum, maybe some of your peers already experienced similiar issues. Otherwise start a new
discussion to get help!

Feedback: This is a great approach! Besides asking in the forum I'd also suggest asking your tutor.

Solution: The correct solution for the question. The solution well be provided after the review process.

1.4 Resources
If you are having issues while completing this lab, feel free to post your questions in the forum. Your peers as well as our teaching advisors will
screen the forum regularly and answer open questions. This way, the information is available for fellow students encountering the same issues.

Note: Here we also want to promote the work with online resources and thus the independent solution of problems. For some tasks you have to
"google" for the appropriate method yourself. For other tasks the already given code must be adapted or just copied.

2. Data loading and first overview


The given data is stored in a text file containing the following columns,

Axis
Feed [mm/min]
Path [mm]
Energy requirement - Target variable [kJ]

2.1 Loading the data

First we have to load the necessary libraries.


# importing the necessary libraries
import numpy as np
import pandas as pd
from pandas import Series,DataFrame
import matplotlib.pyplot as plt
import seaborn as sns

# setting white grid background


sns.set_style('whitegrid')
#%matplotlib notebook
%matplotlib inline

The next step is to access the prepared data set.

There are different options to import a data set into Google Colab. You can either import/upload from Google Drive or from your own HDD. In
this Notebook the Google Drive folder is used.

For this purpose it is necessary to connect your Google Drive to this Notebook. Execute the following cell and folllow the instructions.

from google.colab import drive


drive.mount('/content/drive')

Mounted at /content/drive

Following we can load the data.

# specification of the path to the input data (this path may vary for you depending on where you have your data file)
df = pd.read_csv(r'./drive/MyDrive/ML4Eng I/ML4Eng_I_Exercise_Pipeline_and_Regression/ML4Eng_I_dataset_energy_measurement.txt')
# if the link doesn't work, you'll need to adjust it depending on where you have stored the dataset in your Google Drive.

2.1 Overview of the data


After that we need an overview of the dataset.

# statistical analysis of the dataset


df.describe()

Axis Feed Path Energy_Requirement

count 224.000000 225.000000 225.000000 226.000000

mean 2.093750 1759.200000 1.644444 0.060063

std 1.447173 887.559998 43.726604 0.163246

min -5.000000 20.000000 -200.000000 -0.272149

25% 1.000000 1000.000000 -30.000000 0.012769

50% 2.000000 2000.000000 10.000000 0.038630

75% 3.000000 2500.000000 40.000000 0.063981

max 15.000000 3000.000000 150.000000 0.900000

Question: Discuss the results of the pandas describe() function.

Your Answer: TBD

Feedback: ...

Solution: The above code gives us a descriptive statistic of the pandas dataframe which summarizes the central tendency,
dispersion and shape of a dataset’s distribution, excluding null values. Now we try to learn more about the data that we have to
understand if we need to make any changes i.e. remove outliers, add missing values etc.

1. Count: The total number of entries in that particular column. Here we see that the different features have a different count
indicating some missing values in the dataset.
2. Mean: The arithmetic mean (or simply mean) of a list of numbers, is the sum of all of the numbers divided by the number of
numbers
3. Standard deviation: A measure of the dispersion or variation in a distribution or set of data
4. Minimum: The smallest mathematical value in that particular section. For the attribute "axis" a value of -5 is listed indicating
an oulier.
5. 25%: The values corresponding to 25% percentile of the dataset
6. 50%: The values corresponding to 50% percentile of the dataset. The 50 % percentile is the same as the median.
7. 75%: The values corresponding to 75% percentile of the dataset
8. Maximum: The largest mathematical value in that particular section. For the attribute "axis" a value of 15 is listed indicating
an oulier.
More information can be found here - https://pandas.pydata.org/pandas-
docs/stable/reference/api/pandas.DataFrame.describe.html

Task: Search and implement a (simple) method in order to show the first rows of the dataset. Try to use Google/ the
documentation to find an appropriate one.

Feedback: ...

# implementation of the method

#############################
# Please add your code here #
#############################

###
# Solution
###
df.head(5)
###
# END Solution
###

Axis Feed Path Energy_Requirement

4 1.0 500.0 10.0 0.009795

9 1.0 500.0 60.0 0.057808

10 1.0 1000.0 10.0 0.010401

11 1.0 1000.0 20.0 0.020560

12 1.0 1000.0 30.0 0.030982

3. Visualization
Following all attributes of the data set are plotted.

Question: Why is visualization of the input data important?

Your Answer: TBD

Feedback: ...

Solution:

1. Identify potential patterns in the data that can help us understand the data.
2. Clarify which factors influence our target variable the most.
3. Helps us fix the dataset in case there are outliers and missing values.
4. Help us to decide which models to use to successfully predict the target variable.

# plotting the target variable "Energy requirement"


%matplotlib inline
plt.figure(figsize=(10,7))
plt.hist(df.Energy_Requirement,bins=20, range = (df.Energy_Requirement.min(), df.Energy_Requirement.max()))
plt.xlabel('Energy requirement [kJ]')
plt.ylabel('Total quantity')
Text(0, 0.5, 'Total quantity')

# plotting the attribute "Path"


%matplotlib inline
plt.figure(figsize=(10,7))
plt.hist(df.Path,bins = 100, range = (df.Path.min(), df.Path.max()))
plt.xlabel('Path [mm]')
plt.ylabel('Total quantity')

Text(0, 0.5, 'Total quantity')

Task: Make same changes to the plot below. For example, you can adjust the number of bins, the name of the axis or the color of
the plot.

Feedback: ...

# plotting the attribute "Axis"


%matplotlib inline
plt.figure(figsize=(10,7))
plt.hist(df.Axis,bins =3, range = (df.Axis.min(), df.Axis.max()))
plt.xlabel('Axis')
plt.ylabel('Total quantity')
Text(0, 0.5, 'Total quantity')

Task: Visualize the last attribute "Feed" according to the privious ones. You can copy the most of the code, nethertheless you
have to do some adjustments.

Feedback: ...

# plotting the attribue "Feed"

#############################
# Please add your code here #
#############################

###
# Solution
###
%matplotlib inline
plt.figure(figsize=(10,7))
plt.hist(df.Feed,bins =100, range = (df.Feed.min(), df.Feed.max()))
plt.xlabel('Feed [mm/min]')
plt.ylabel('Total quantity')
###
# END Solution
###

Text(0, 0.5, 'Total quantity')

4. Missing values and outliers


1. Missing values can be NaN (null values) or breaks in the dataset that do not seem reasonable.
2. Outliers are values in our dataset that stand out from rest of the values in our dataset, an outlier may lie in an abnormal distance from
other values in a distribution.

Missing values and outliers have to be detected and dealt with in order to prepare the data set for the following steps.

4.1 Handling missing value

# Before we deal with missing values we visualize the first 10 instances of the data set.
# Missing values can be recognized here as NaN.
df.head(10)
Axis Feed Path Energy_Requirement

0 -1.0 20.0 120.0 0.600000

1 -5.0 100.0 150.0 0.700000

2 10.0 150.0 130.0 0.800000

3 15.0 50.0 -200.0 0.900000

4 1.0 500.0 10.0 0.009795

5 6.0 NaN 20.0 0.019462

6 NaN 500.0 30.0 0.029309

7 -2.0 500.0 NaN 0.038570

8 NaN 500.0 50.0 0.048310

9 1.0 500.0 60.0 0.057808

Task: Use the the dropna() function to remove all rows with missing values. Use the documantation if you need further
information about this method.

Feedback: ...

# We drop all rows with missing values using the 'dropna function'

#############################
# Please add your code here #
#############################

###
# Solution
###
df = df.dropna(subset=['Axis', 'Feed', 'Path'])
###
# END Solution
###

# After the removal of missing values we visualize the first 10 instances of the data set again.
df.head(10)

Axis Feed Path Energy_Requirement

0 -1.0 20.0 120.0 0.600000

1 -5.0 100.0 150.0 0.700000

2 10.0 150.0 130.0 0.800000

3 15.0 50.0 -200.0 0.900000

4 1.0 500.0 10.0 0.009795

9 1.0 500.0 60.0 0.057808

10 1.0 1000.0 10.0 0.010401

11 1.0 1000.0 20.0 0.020560

12 1.0 1000.0 30.0 0.030982

13 1.0 1000.0 40.0 0.041358

We can see that the rows 5, 6, 7 and 8 have been dropped as they contained some missing values (NaN).

4.2 Handling outliers


Before we can deal with outliers, we have to identify them. There are different methods to detect outliers.

Since we have performed the tests for the independent variables (feed, axis and path) we know the range of these values.

1. Axis: 1 to 3
2. distance: -60 to 60 [mm]
3. Feed rate: 500 to 3000 [mm/min]

All values outside these ranges are therefore outliers. The relevant instances should therefore be deleted.
Question: Why is it important to remove outliers from our dataset?

Your Answer: TBD

Feedback: ...

Solution: Outliers can provide information to our regression model that is different from the information provided by the rest of
the dataset. By removing outliers the regression model will perform better as it only learns the essential information of the
dataset.

# Before removing the ouliers we analize our data set with the desrcibe() method.
df.describe()

Axis Feed Path Energy_Requirement

count 222.000000 222.000000 222.000000 222.000000

mean 2.094595 1776.216216 1.216216 0.060535

std 1.402992 881.272788 43.841117 0.164672

min -5.000000 20.000000 -200.000000 -0.272149

25% 1.000000 1000.000000 -30.000000 0.012699

50% 2.000000 2000.000000 10.000000 0.039071

75% 3.000000 2500.000000 40.000000 0.065277

max 15.000000 3000.000000 150.000000 0.900000

Task: Complete the following code line to remove all outliers for the attribute "Feed".

Feedback: ...

# Values of features outside the range of the known to us are treated as 'Outliers' and removed
# We only include those values of the feature that lie in the particular ranges of the feature
df = df.loc[(df.Axis >= 1) & (df.Axis <= 3) &
(df.Path >= -60) & (df.Path <= 60) &

#############################
# Please add your code here #
#############################

###
# Solution
###
(df.Feed >= 500) & (df.Feed <= 3000)
###
# END Solution
###
]

# After removing the ouliers we analyze the data set again.


df.describe()

Axis Feed Path Energy_Requirement

count 218.000000 218.000000 218.000000 218.000000

mean 2.045872 1807.339450 0.321101 0.047884

std 0.818961 858.431366 39.072043 0.135879

min 1.000000 500.000000 -60.000000 -0.272149

25% 1.000000 1000.000000 -30.000000 0.012676

50% 2.000000 2000.000000 10.000000 0.038600

75% 3.000000 2500.000000 37.500000 0.061995

max 3.000000 3000.000000 60.000000 0.450567


Question: Please describe the changes within the data set.

Your Answer: TBD

Feedback: ...

Solution: Some of the changes that can be observed in the dataset are,

1. Count: We notice that all the features now have the same count or same number of entries meaning there are no missing
values.
2. Mean and standard deviation: We notice that mean and standard deviation values have changed in the dataset indicating
that outliers and missing value problems have been resolved.
3. Minimum and Maximum values: We notice that the minimum and maximum values for the features have changed and are
now within the defined range, again indicating that outliers have been removed.

5. Splitting of the data set


Before we continue with the training, we have to split our data set.

1. Training dataset: the training dataset is used to determine the models parameters based on the data it has seen. Here, the labels are
provided to the model so that it can learn about potential patterns in the data and thus adjust its parameters in such a way that it can
predict data points.

2. Test dataset: This dataset is used to test the performance of the model. It can be used to see if the model is able to perform well on data
which it has never seen before.

3. Target variable: Further we have to separate the target variable from the other attributes.

Splitting of the data can be done using Sklearn libraries.

More information on this can be found here - https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

5.1 Seperation features and target variable

# We separate the features (axis, feed, distance) and store it in 'X_multi' then we store our target varible (energy) in 'Y_target' to t
X_multi = df.drop('Energy_Requirement', 1)

# energy requirement is our target variable


Y_target = df.Energy_Requirement

/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:2: FutureWarning: In a future version of pandas all arguments of DataF

 

5.2 Splitting the data into training and test sets

# The given dataset is divided into training and test datasets.


import sklearn
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = sklearn.model_selection.train_test_split(X_multi, Y_target, random_state=42)

# Checking the shapes of the datasets so that we dont wrongly fit the data
print(X_train.shape, Y_train.shape, X_test.shape, Y_test.shape)

(163, 3) (163,) (55, 3) (55,)

Question: How is the default setting of the train_test_split() method regarding the distribution of the training and test
dataset? Use the documentation for this method! And what are advantages/disadvantages of different distributions (big training
dataset, small test dataset or same size of the data sets?

Your Answer: TBD


Feedback: ...

Solution:

1. Default setting: Usually the training and test split is conducted in an 80/20 ratio. Meaning 80% of the data is used to train
the model and the performance of the model is tested in 20% of the data.
2. Big training dataset, small test dataset: By using a larger training set we ensure that the model captures the patterns in the
data and hence it can perform well. But we do not have enough test data to properly assess the performance of the model.
3. Same size of training and large test dataset: In this case we risk the model can not learn the patterns in the data properly
as it is trained only on a part (half) of the dataset. Such a distribution is normally not used.

6. Linear Regression
Now we can start to use machine learning algorithms to predict the required energy. For that we carry out the follwowing steps:

1. We import a Linear Regression algorithm from the Sklearn library.


2. The fit() function of the Linear Regression model is used to train our model with the training dataset.
3. The predict() function is used to make predictions on a given dataset with the trained model.
4. Calculation of the losses to assess the performance.
5. Visualisation ot the losses.

These steps are similar for the implemenation of different regression algorithms.

6.1 Import the library

from sklearn.linear_model import LinearRegression

6.2 Training of the model

lreg = LinearRegression()

lreg.fit(X_train, Y_train)

LinearRegression()

6.3 Prediction

pred_train = lreg.predict(X_train) # prediction of the training data


pred_test = lreg.predict(X_test) # prediction of unseen data

6.4 Calculate the different losses (Least Square Error, Mean Square Error)

Task: Print the calculated losses for the test data with the print() method and write a short explanatory text.

Feedback: ...

from sklearn.metrics import mean_squared_error


from sklearn.metrics import mean_absolute_error

# We calulate the error for the training and test datasets

# Training data
MSE_linear_Train_Data = mean_squared_error(Y_train, pred_train)
MAE_linear_Train_Data = mean_absolute_error(Y_train, pred_train)

print("The Mean Square Error on the training data is:", MSE_linear_Train_Data)


print("The Mean Absolute Error on the training data is:", MAE_linear_Train_Data)

# Test data / unseen data


MSE_linear_Test_Data = mean_squared_error(Y_test, pred_test)
MAE_linear_Test_Data = mean_absolute_error(Y_test, pred_test)
#############################
# Please add your code here #
#############################

###
# Solution
###
print("\n""The Mean Square Error on the test data is:", MSE_linear_Test_Data)
print("The Mean Absolute Error on the test data is:", MAE_linear_Test_Data)
###
# END Solution
###

The Mean Square Error on the training data is: 0.01167892485881179


The Mean Absolute Error on the training data is: 0.09121547964727232

The Mean Square Error on the test data is: 0.013044389500115314


The Mean Absolute Error on the test data is: 0.0905114994661013

Question: Discuss the used error metrics and the results!

Your Answer: TBD

Feedback: ...

Solution:

1. Mean Square Error (MSE): The Mean Square Error (MSE) loss function penalizes the model for making large errors by
squaring them. Squaring a large quantity makes it even larger, thus helping it to identify the errors but this same property
makes the MSE cost function sensitive to outliers.
2. Mean Absolute Error (MAE): The Mean Absolute Error (MAE) cost is less sensitive to outliers compared to MSE.
3. Results: MAE and MSE are usually higher for the test datset than for the training data set. However, in our case they are
almost simular. The reason can be the distribution of the data set. This is not yet sufficient to evaluate whether this is good
enough for the planned application. On the one hand, an assessment or a target value must be determined by a domain
expert. On the other hand, a analysis of the average percentage deviation would be helpful.

6.5 Residual plots for Linear Regression

1. A residual value is a measure of how much a regression line vertically misses a data point.
2. In a residual plot the residual values are on the vertical axis and the horizontal axis displays the independent variable.
3. Ideally, residual values should be equally and randomly distributed around the horizontal axis.

# We want the data points to be scattered around the horizontal line


%matplotlib inline
plt.figure(figsize=(10,7))
train = plt.scatter(pred_train, (pred_train-Y_train), c='b', alpha=0.5)
test = plt.scatter(pred_test, (pred_test-Y_test), c='r', alpha=0.5)
plt.hlines(y=0, xmin=-0.5, xmax=0.5)
plt.legend((train, test), ('Training', 'Test'),loc='lower left')
plt.title('Residual plot for Linear Regressor')
plt.xlabel("Energy_Requirement - Target variable")
plt.ylabel("Residual Value")
Text(0, 0.5, 'Residual Value')

7. Random Forest Regressor


This is another type of regression algorithm that uses averaging to improve predictive accuracy (summed up over the number of trees) and
controls overfitting.

More on the parameters and information of this regressor can be found here -

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html

7.1 Import the library

# Import Random Forest Regressor from Sklearn


from sklearn.ensemble import RandomForestRegressor
# Instantiate model with 1000 decision trees
rf = RandomForestRegressor(n_estimators = 1000, random_state = 42)

Meaning of some of the hyperparameters used above,

1. n_estimators : This is the number of trees you want to build before taking the maximum voting or averages of predictions. Higher number
of trees can increase the performance but make your code slower.

2. random_state : This parameter makes a solution easy to replicate. A definite value of random_state will always produce same results if
given with same parameters and training data.

7.2 Training of the model

# Train the model on training data


rf.fit(X_train, Y_train)

RandomForestRegressor(n_estimators=1000, random_state=42)

7.3 Prediction

# Use the forest's predict method on the test data


rf_pred_train = rf.predict(X_train)
rf_pred_test = rf.predict(X_test)

7.4 Calculate the different losses (Mean Absolute Error, Mean Square Error)

Task: Complete the following code lines in order to calcute the results according to the Linear Regression!

Feedback: ...

from sklearn.metrics import mean_squared_error


from sklearn.metrics import mean_absolute_error

# We calulate the errors for the training and test datasets.


# We save the errors according to the Linear Regression ("MSE_rf_Test_Data", "MAE_rf_Test_Data").

# Training data
MSE_rf_Train_Data = mean_squared_error(Y_train, rf_pred_train)
MAE_rf_Train_Data = mean_absolute_error(Y_train, rf_pred_train)

print("Mean Square Error on the training data is:", MSE_rf_Train_Data)


print("Mean Absolute Error on the training data is:", MAE_rf_Train_Data)
# Test data / unseen data

#############################
# Please add your code here #
#############################

###
# Solution
###
MSE_rf_Test_Data = mean_squared_error(Y_test, rf_pred_test)
MAE_rf_Test_Data = mean_absolute_error(Y_test, rf_pred_test)
###
# END Solution
###

print("\n""Mean Square Error on the test data is:", MSE_rf_Test_Data)


print("Mean Absolute Error on the test data is:", MAE_rf_Test_Data)

Mean Square Error on the training data is: 6.31350011685033e-05


Mean Absolute Error on the training data is: 0.0026605765337424903

Mean Square Error on the test data is: 0.00017383678946808628


Mean Absolute Error on the test data is: 0.005907323690909195

7.5 Residual plots for Random Forest Regressor

Task: Add a title and the axis labels to the diagram.

Feedback: ...

# We want the data points to be scattered around the horizontal line


%matplotlib inline
plt.figure(figsize=(10,7))
train = plt.scatter(rf_pred_train, (rf_pred_train-Y_train), c='b', alpha=0.5)
test = plt.scatter(rf_pred_test, (rf_pred_test-Y_test), c='r', alpha=0.5)
plt.hlines(y=0, xmin=-0.5, xmax=0.5)
plt.legend((train, test), ('Training', 'Test'),loc='lower left')

#############################
# Please add your code here #
#############################

###
# Solution
###
plt.title('Residual plot for Random Forest Regressor')
plt.xlabel("Energy_Requirement - Target variable")
plt.ylabel("Residual value")
###
# END Solution
###
Text(0, 0.5, 'Residual value')

Here we see that the model performs well on both, the training dataset as well as the test dataset, as the blue and red points are fairly close to
the horizontal line and they are equally distributed.

8. Support Vector Regression (SVR)


1. The SVR tries to best approximate a line beween the features in order to predict the target variable.
2. This type of regressor can be used for linear and non-linear regression problems i.e. the best fitting function can be non-linear as well. The
kernel functions transform the data into a higher dimensional feature space to make it possible to perform the linear separation.

More information on the parameters and kernels used in the SVR can be found here - https://scikit-
learn.org/stable/modules/generated/sklearn.svm.SVR.html

Task: Implement all the steps used for the Linear Regression model (6.1 to 6.5) to the given Support Vector Regressor. Use the
"rbf" kernel. Hint: You just have to copy and partly adapt the existing code!

8.1 Import the library

Feedback: ...

# Import SVR from Sklearn

#############################
# Please add your code here #
#############################

###
# Solution
###
from sklearn.svm import SVR
###
# END Solution
###

8.2 Training of the model

Feedback: ...

# initialize the model and train it

#############################
# Please add your code here #
#############################

###
# Solution
###
svr_regressor = SVR(kernel='rbf')

svr_regressor.fit(X_train, Y_train)
###
# END Solution
###

SVR()

8.3 Prediction

Feedback: ...

# predict the values for the training and test dataset

#############################
# Please add your code here #
#############################
###
# Solution
###
svr_pred_train = svr_regressor.predict(X_train)
svr_pred_test = svr_regressor.predict(X_test)
###
# END Solution
###

8.4 Calculate the different losses (Mean Absolute Error, Mean Square Error)

Feedback: ...

# load the libraries and calculate the losses.


# save the errors according to the Linear Regression ("MSE_svr_Test_Data", "MAE_svr_Test_Data", etc.).

#############################
# Please add your code here #
#############################

###
# Solution
###
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error

# We calulate the error for the training and test datasets

# Training data
MSE_svr_Train_Data = mean_squared_error(Y_train, svr_pred_train)
MAE_svr_Train_Data = mean_absolute_error(Y_train, svr_pred_train)

print("Mean Square Error on the training data is:", MSE_svr_Train_Data)


print("Mean Absolute Error on the training data is:", MAE_svr_Train_Data)

# Test data / unseen data


MSE_svr_Test_Data = mean_squared_error(Y_test, svr_pred_test)
MAE_svr_Test_Data = mean_absolute_error(Y_test, svr_pred_test)

print("\n""Mean Square Error on the test data is:", MSE_svr_Test_Data)


print("Mean Absolute Error on the test data is:", MAE_svr_Test_Data)

###
# END Solution
###

Mean Square Error on the training data is: 0.013595253971989574


Mean Absolute Error on the training data is: 0.08173011271299446

Mean Square Error on the test data is: 0.01713173248228542


Mean Absolute Error on the test data is: 0.0928492482136748

8.5 Residual plots for Support Vector Regressor

Feedback: ...

# visualize the residual plot

#############################
# Please add your code here #
#############################

###
# Solution
###
%matplotlib inline
plt.figure(figsize=(10,7))
train = plt.scatter(pred_train, (svr_pred_train-Y_train), c='b', alpha=0.5)
test = plt.scatter(pred_test, (svr_pred_test-Y_test), c='r', alpha=0.5)
plt.hlines(y=0, xmin=-0.5, xmax=0.5)
plt.legend((train, test), ('Training', 'Test'),loc='lower left')
plt.title('Residual plot for Support Vector Regressor')
plt.xlabel("Energy_Requirement - Target variable")
plt.ylabel("Residual value")
###
# END Solution
###

Text(0, 0.5, 'Residual value')

9. Comparison and results

# visualisation of the results


%matplotlib inline
plt.figure(figsize=(10,7))
plt.bar(['MSE_LR'],[MSE_linear_Test_Data], color=['#4DBEEE'], label="Mean Square Error on Linear Regressor")
plt.bar(['MSE_SVR'],[MSE_svr_Test_Data], color=['#A2142F'], label="Mean Square Error on SVR")
plt.bar(['MSE_RF'],[MSE_rf_Test_Data], color=['#0072BD'], label="Mean Square Error on Random Forest")

plt.bar(['MAE_LR'],[MAE_linear_Test_Data], color=['#D95319'], label="Mean Absolute Error on Linear Regressor")


plt.bar(['MAE_SVR'],[MAE_svr_Test_Data], color=['#EDB120'], label="Mean Absolute Error on SVR")
plt.bar(['MAE_RF'],[MAE_rf_Test_Data], color=['#77AC30'], label="Mean Absolute Error on Random Forest")

plt.xlabel('Error')
plt.ylabel('Error values')
plt.title('Performance of different regression models on the test data')
plt.legend(loc="upper left")
plt.show()

Question: Explain the results you obtained and choose the best model.
Your Answer: TBD

Feedback: ...

Solution: Here we see the performance of the different regression models on the the test dataset. Some of the points that can be
observed are,

1. The Random Forest Regressor has almost zero error on the test dataset. This shows that the model has correctly learned to
predict future values.
2. By comparing our different models we can see that Random Forest Regressor has the highest accuracy follwed by Linear
Regressor and Support Vector Regressor.
3. The accuracy can also be seen by the use of Residual Plots, these plots show us how much the regression line misses the
data points and helps us get a better understanding of whether the model can be further improved.
4. It can be said that the Random Forest Regressor does a good job of learning the pattern of the training dataset and applying
it to the test dataset to achieve a high accuracy. The remaining models do not achieve this accuracy as they may be not
able to learn the patterns effectively enough.
Therefore we choose the Random Forest Regressor.

10. Deployment of the model


Now we want to predict the energy requirement for the following settings of production parameters with the best model.

1. Setting 1: axis = 2, feed = 800 [mm/min], distance = 60 [mm]


2. Setting 2: axis = 3, feed = 2000 [mm/min], distance = 40 [mm]
3. Setting 3: axis = 1, feed = 1200 [mm/min], distance = -20 [mm]

Task: Predict the energy requirement for the given production settings using your best model and the predict() method.

Feedback: ...

# Deployment of the best model for the production settings.

#############################
# Please add your code here #
#############################

###
# Solution
###
print("Predicted Energy Requirement for setting 1 is", rf.predict([[2, 800, 60]]),"kJ.")
print("Predicted Energy Requirement for setting 2 is", rf.predict([[3, 2000, 40]]),"kJ.")
print("Predicted Energy Requirement for setting 3 is", rf.predict([[1, 1200, -20]]),"kJ.")

#You can also save the data points within a variable first.
set = [[2, 800, 60],[3, 2000, 40],[1, 1200, -20]]
print("Predicted Energy Requirement for all settings is", sum(rf.predict(set)),"kJ.")

###
# END Solution
###

/usr/local/lib/python3.7/dist-packages/sklearn/base.py:451: UserWarning: X does not have valid feature names, but RandomForestRegre
"X does not have valid feature names, but"
/usr/local/lib/python3.7/dist-packages/sklearn/base.py:451: UserWarning: X does not have valid feature names, but RandomForestRegre
"X does not have valid feature names, but"
/usr/local/lib/python3.7/dist-packages/sklearn/base.py:451: UserWarning: X does not have valid feature names, but RandomForestRegre
"X does not have valid feature names, but"
Predicted Energy Requirement for setting 1 is [0.05394835] kJ.
Predicted Energy Requirement for setting 2 is [0.28709159] kJ.
Predicted Energy Requirement for setting 3 is [0.02084422] kJ.
Predicted Energy Requirement for all settings is 0.3618841639999991 kJ.
/usr/local/lib/python3.7/dist-packages/sklearn/base.py:451: UserWarning: X does not have valid feature names, but RandomForestRegre
"X does not have valid feature names, but"

 

Question: What is the difference between parameters and hyperparameters of a model? Name an example for both.
Your Answer: TBD

Feedback: ...

Solution: A model parameter is a configuration variable which is internal to the model and whose value can be estimated from
data. • They are the result of the training process
• They are required by the model when making predictions.
• Example Random Forest: the threshold value chosen at each internal node of the decision tree(s)
A model hyperparameter is a configuration that is external to the model and whose value cannot be estimated from data.
• They define the training process.
• They can often be set using heuristics / They are often tuned for a given predictive modeling problem.
• Example Random Forest: the number of decision trees
Exercise 3

Introduction
In the production of electrical drives, a high product quality is needed. As the industry of electric drive production is confronted by trends such
as electric mobility and continuing industrial automation, efficient and flexible processes are needed more than ever. With current quality
monitoring technology, accurate quality checking is not feasible.

Electrical motors mainly consist of the rotor, the stator and the surrounding housing. The production process can be separated into multiple
sub-processes, which can be seen below. The exact sequence of these steps however depends on the motor type. First, the individual
components are manufactured and assembled into subassemblies such as the rotor and the stator. Finally, all components (the housing, the
stator, the rotor as well as bearings and end shields) are assembled and the motor is checked in an end-of-line (EOL) test.

This final assembly is of great importance, as all parts need to be assembled in the correct way, to ensure smooth operation. Therefore, a
quality monitoring system is needed, raising alarm if assembly errors are detected. However, especially in lot-size one production, traditional
computer vision systems might reach their limits and cannot be used anymore.

Thus, in this lab we will build a smart quality monitoring system for the electric drives production. An already existing visual sensor captures
images of the electric motor after assembly. These images show the part from the top, as well from the side perspective. It is now the target to
decide whether the motor is fully assembled, or whether one of multiple defects is present. There is data from three different defects available:
missing cover, missing screw and not screwed. Examples of these defects can be seen below. To achieve this, we will investigate two different
machine learning models: Support Vector Machines (SVM) and Convolutional Neural Networks (CNN).

Further background information can be found in this paper: Mayr et al., Machine Learning in Electric Motor Production - Potentials, Challenges
and Exemplary Applications

Introduction

Outline
This lab is structured into two main parts. In the first part, a subset of the problem will be analyzed step-by-step. Here, only images from the top
view are used and only two of the three defects, the defects missing cover and missing screw are considered. Your task will be to follow along,
fill out missing gaps, and answer problems throughout the notebook.

In the second part, you are tasked to expand the quality monitoring system to also detect the defect not screwed. Therefore, it might be helpful
to also consider images showing the parts in their side perspective. For this part, you are free to choose any of the tools and methods
introduced in the first part, and you can expand as you wish!

Deliverables
For completing this exercise successfully, you need to deliver certain results. Throughout the notebook you will find questions you need to
answer, and coding tasks where you need to modify existing code or fill in blanks. Answers to the questions need to be added in the prepared
Your answer here markdown fields. Coding tasks can be solved by modifying or inserting code in the cells below the task. If needed, you can
add extra cells to the notebook, as long as the order of the existing cells remains unchanged. Once you are finished with the lab, you can submit
it through the procedure described in the forum. Once the labs are submitted, you will receive feedback regarding the questions. Thus, the
Feedback boxes need to be left empty

Example:

Question: What do I do if I am stuck solving this lab?

Your answer: Have a look at the forum, maybe some of your peers already experienced similiar issues. Otherwise start a new
discussion to get help!

Feedback: This is a great approach! Besides asking in the forum I'd also suggest asking your tutor.

Ressources
If you are having issues while completing this lab, feel free to post your questions into the forum. Your peers as well as our teaching advisors
will screen the forum regularly and answer open questions. This way, the information is available for fellow students encountering the same
issues.

# Mount your GDrive - Authentication required - Please follow the instructions


from google.colab import drive
from google.colab.patches import cv2_imshow
import sys, os
drive.mount('/content/gdrive')

# Change the current working directory

# Depending on your directory, you may want to adjust this


dir_path = '/content/gdrive/My Drive/ML4Ing_I/Lab_Image_classification'
sys.path.append(dir_path)
os.chdir(dir_path)

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).

# As in the previous exercises, we'll import commonly used libraries right at the start
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import cv2
import random
import tensorflow as tf
# The check.py script contains the quality gates you can use for selftesting throughout the lab
from scripts.check import *

1 Part one
To achieve the solution mentioned above, we will execute the following steps in this lab:

1. First, we will code the necessary functions for loading and preprocessing of the data. We will also set up some methods that help us
displaying our progress throughout the exercise
2. Second, we will do a short analysis of the existing dataset
3. Afterwards, we will start building our first image classification model using SVMs
4. Once we are familiar and comfortable with SVMs, we will switch to neural networks and try out CNNs
5. Finally, we will introduce data augmentation for improvement of our prediction results.

Section 1.1: Data preprocessing


The data should be located in the folder called data on the same level as this script. Within this folder, two subfolders can be found:

The folder top contains the top view of each motor


The folder side contains the side view of each motor

Each motor is uniquely identified by its filename.

# Loading one image in top view


path = "./data/top/L1_C_3.JPG"
img = cv2.imread(path)
img.shape

(1024, 1024, 3)

With the snippet above, we are able to load the image from the file into a numpy array, while getting its label from the folder path the image is in.
To check the type of a python object, you can use the command type(img) . It should return numpy.ndarray.

type(img)

numpy.ndarray

Next, we want to plot the image. This can be achieved by executing the following cell.

plt.title(path.split('/')[-1]) # Set the filename as image title


plt.imshow(img) # Display the image
plt.show()
By default, open cv assumes the images are encoded in blue, green and red. However, the actual order of the color channels is blue, red and
green. Thus, the channels need to be converted using cv2.COLOR_BGR2RGB .

img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB) # Convert image from bgr to rgb


plt.title(path.split('/')[-1]) # Set the filename as image title
plt.imshow(img) # Display the image
plt.show()

Function for loading multiple images

Now it's your turn. For the further analysis, we need to load all the available images from the given data folder folder. Besides the image, we
need to also find the class of the respective image. The information of the class is encoded in the title of each image. You can use the helper
function get_label_from_name(path) to parse the filename to the class.

Task: Please complete the following function load_features_labels(folder). The function should read the image for a given file,
and return two lists:

features containing all the images as numpy arrays


labels containing the classes of all images

import glob

def get_label_from_name(path):
if "_C_" in path:
return "Complete"
if "_MC_" in path:
return "Missing cover"
if "_MS_" in path:
return "Missing screw"
if "_NS_" in path:
return "Not screwed"
return "n/a" # TODO: Raise error

def load_features_labels(folder, size = (64,32), flatten = True, color = False, identifiers=['NS', 'MS', 'MC', 'C']):
features, labels = [], [] # Empty lists for storing the features and labels
# Iterate over all imagefiles in the given folder
for file in glob.glob(folder + "/*.JPG"):
if any(identifier in file for identifier in identifiers):
#############################
# Please add your code here #
#############################

###
# Solution
###

img = cv2.imread(file)
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)

features.append(img)
labels.append(get_label_from_name(file))

###
# END Solution
###

return features, labels # Return results

If everything works as expected, the function should load 117 features and labels. The execution may take a while.

features, labels = load_features_labels("./data/top")


print("Number of features:", len(features))
print("Number of labels:", len(labels))

# Check data import


quality_gate_111(features, labels)

Number of features: 117


Number of labels: 117
'Quality gate passed :)'

Image preprocessing
Before analyzing the images using machine learning, they need to be preprocessed. We will do preprocessing regarding three aspects:

Image size: As the raw images are available in rather high resolution, it might be beneficial to reduce the image resolution. Opencv
provides the function resize() which works great for that purpose
Image color: In many use cases, the benefit of considering color information might not outway the increased complexity, thus it might be
handy to convert the rgb image to bw. This can easily be done using the cvtColor function from opencv.
Image shape: Only some algorithms are capable of analyzing the 2.5D structure of image data. For the remaining algorithms, which
expect the data to be 1D vector, the image data needs to be flattened from 2.5D to 1D. This can be done using the numpy reshape
functionality.

def image_preprocessing(img, size = (64,32), flatten = True, color = False):


img = cv2.resize(img, size)
if not color:
img = cv2.cvtColor(img, cv2.COLOR_RGB2GRAY)
if flatten:
img = img.reshape(-1)
return img

Task: Please update your load_features_labels(...) function from above to do image wise data preprocessing using the
function image_preprocessing(...) . Note that the images shall have the size of 8x8 pixels and be flattened subsequent to the
resizing. Therefore be mindfull of argument passing between the two functions!

def load_features_labels(folder, size = (64,32), flatten = True, color = False, identifiers=['NS', 'MS', 'MC', 'C']):
features, labels = [], [] # Empty lists for storing the features and labels
# Iterate over all imagefiles in the given folder
for file in glob.glob(folder + "/*.JPG"):
if any(identifier in file for identifier in identifiers):
#############################
# Please add your code here #
#############################

###
# Solution
###

img = cv2.imread(file)
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
img = image_preprocessing(img, size=size, flatten=flatten, color=color)

features.append(img)
labels.append(get_label_from_name(file))
###
# END Solution
###

return features, labels # Return results

################
# Quality gate #
################

features, labels = load_features_labels("./data/top", size=(8, 8), flatten=True, color=False)


quality_gate_112(features, labels)

'Quality gate passed :)'

Section 1.2: First data analysis


Before diving into machine learning, we'll have a look at the data. With the snippet below you can visualize a sample of the image data available
in this lab. It can be observed that the class missing cover is rather distinct to the remaining classes, as the large black plastic cover is missing,
exposing the copper wires. The defect missing screw is definitely harder to spot as the screws are rather small objects and the color difference
between the screw and the empty hole is rather subtle. Finally, the defect not screwed can only be seen as some of the screws are not in the
shade of the respective hole, thus indicating they are not screwed in all the way.

from mpl_toolkits.axes_grid1 import ImageGrid

fig = plt.figure(figsize=(16, 12))


grid = ImageGrid(fig, 111, nrows_ncols=(3, 4), axes_pad=(0.1, 0.3))

features, labels = load_features_labels("./data/top", size=(1024, 1024), flatten=False, color=True)


classes = ['Complete', 'Missing cover', 'Missing screw', 'Not screwed']
for i, ax in enumerate(grid):
selectedClass = classes[i%4] # Select class
images = np.array(features)[np.array(labels)==selectedClass] # Preselect images based on class
image = images[i//4] # Select image
ax.imshow(image) # Plot image
ax.set_title(selectedClass) # Assign class as image title
plt.show()
First, let's investigate the distribution of the available images among the classes.

Task: Please create a plot showing the distribution of the different classes and discuss the distribution in the field below.

from collections import Counter


print(Counter(labels))

Counter({'Not screwed': 47, 'Missing screw': 42, 'Missing cover': 22, 'Complete': 6})

#############################
# Please add your code here #
#############################

###
# Solution
###

label_names = ["Complete", "Missing cover", "Missing screw", "Not screwed"]


values = [Counter(labels)[l] for l in label_names]

plt.pie(values, labels=label_names, startangle=90)


plt.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle.
plt.show()

###
# END Solution
###

Question: Please discuss the class distribution. Which issues and challenges might appear during model training?

Your answer: TBD

Feedback: ...

Section 1.3: Image classification using Support Vector Machines


In this section, we'll use Support Vector Machines (SVM) to try classifying the image dataset. For SVMs, it is necessary to have the data
formatted as 1D vector. Also, as mentioned in the description we are only going to consider the three classes complete, missing cover, and
missing screws.

features, labels = load_features_labels("./data/top", size=(16,16), color=True, flatten=True, identifiers=['MC', 'MS', 'C'])


features = np.asarray(features)
labels = np.asarray(labels)
print("Shape feature vector:", features.shape)
print("Shape label vector:", labels.shape)

Shape feature vector: (70, 768)


Shape label vector: (70,)

As we can see, we still load our 117 images, but the pixel values are now simply reshaped to 1D. Next, we need to separate our data into training
and testing datasets. This can be achieved using the train_test_split() function from sklearn. You can find the documentation here: Link.

Task: Fill in the following code so that 70% of the data is used for training, and the remaining 30% for testing. Also, the datasets
should be stratified by the label vector.

from sklearn.model_selection import train_test_split


######################################
# Please complete the following line #
######################################
#X_train, X_test, y_train, y_test = train_test_split("""Your code goes here""", random_state=42)
###
# Solution
###
X_train, X_test, y_train, y_test = train_test_split(features, labels, train_size=0.7, stratify=labels, random_state=42)

################
# Quality gate #
################

quality_gate_13(X_train, X_test)

'Quality gate passed :)'

from sklearn.svm import SVC


from sklearn.utils.class_weight import compute_sample_weight
clf = SVC(kernel="rbf", gamma=0.01, C=0.0003) # Initialize the SVM
clf.fit(X_train, y_train, sample_weight=compute_sample_weight('balanced', y_train)) # Train the SVM
print("Score:", clf.score(X_test, y_test)) # Test the model

Score: 0.6190476190476191

import seaborn as sns


from sklearn.metrics import confusion_matrix, classification_report
cm = confusion_matrix(y_test, clf.predict(X_test))
ax=sns.heatmap(cm, annot=True)
plt.xlabel('Predicted')
plt.ylabel('Truth')
print(classification_report(y_test, clf.predict(X_test)))

precision recall f1-score support

Complete 0.00 0.00 0.00 2


Missing cover 0.00 0.00 0.00 6
Missing screw 0.62 1.00 0.76 13

accuracy 0.62 21
macro avg 0.21 0.33 0.25 21
weighted avg 0.38 0.62 0.47 21

/usr/local/lib/python3.7/dist-packages/sklearn/metrics/_classification.py:1308: UndefinedMetricWarning: Precision and F-score are ill-de


_warn_prf(average, modifier, msg_start, len(result))
/usr/local/lib/python3.7/dist-packages/sklearn/metrics/_classification.py:1308: UndefinedMetricWarning: Precision and F-score are ill-de
_warn_prf(average, modifier, msg_start, len(result))
/usr/local/lib/python3.7/dist-packages/sklearn/metrics/_classification.py:1308: UndefinedMetricWarning: Precision and F-score are ill-de
_warn_prf(average, modifier, msg_start, len(result))

Section 1.4: Image classification with artificial neural networks


In this section, we will train our first artificial neural network (ANN) for image classification. First, we will have a look at normal ANNs. These
consist of multiple dense layers which can analyze one-dimensional feature vectors. Thus, we need to reshape our 2.5-dimensional image data
to 1D using the flatten option we integrated into our preprocessing function.

a) Image classification using fully connected ANNs


Again, we need to load the data using the flatten=True flag to convert the 2.5D data to 1D.
features, labels = load_features_labels("./data/top", size=(128,128), color=True, flatten=True, identifiers=['MC', 'MS', 'C'])
features = np.asarray(features)
labels = np.asarray(labels)
print("Shape feature vector:", features.shape)
print("Shape label vector:", labels.shape)

Shape feature vector: (70, 49152)


Shape label vector: (70,)

Task: Fill in the following code so that 70% of the data is used for training, and the remaining 30% for testing. Also the datasets
should be stratified by the label vector.

######################################
# Please complete the following line #
######################################
#X_train, X_test, y_train, y_test = train_test_split("""Your code goes here""", random_state=42)

###
# Solution
###
X_train, X_test, y_train, y_test = train_test_split(features, labels, train_size=0.7, stratify=labels, random_state=42)

The labels need to be one hot encoded. In one hot encoding, categorical values are transformed into a binary representation.

OneHotEncoding

# The sklearn preprocessing library contains a variety of useful data preprocessing tools such as one hot encoding
from sklearn.preprocessing import OneHotEncoder
# Display the first label before encoding
print("Label of first sample before OneHot encoding:", y_train[0])
# Create the encoder object
enc = OneHotEncoder(sparse=False) # Generate Encoder
# With the fit_transform function, the encoder is fitted to the existing labels and transforms the dataset into its binary representation
y_train = enc.fit_transform(y_train.reshape(-1, 1))
# Display the first label after encoding
print("Label of first sample after OneHot encoding:", y_train[0])
# Data preprocessing should always be fitted on the training dataset, but applied to both, the training and the testing dataset. Thus the fit
y_test = enc.transform(y_test.reshape(-1, 1))

Label of first sample before OneHot encoding: Missing cover


Label of first sample after OneHot encoding: [0. 1. 0.]

Now, let's define a simple ANN with an input layer, 3 hidden layer and one output layer. In this lab we use the keras library to model the neural
network.

A simple ANN with multiple sequential layers can be created using the Sequential() model. Afterwards, various layers can be added to the
model through the command model.add(LAYER) with LAYER defining the layer to be added. In the first layer, the shape of the input needs to be
specified using the parameter input_shape . This is only necessary in the first, but not in consecutive layers.

Please have a look at the keras documentation regarding the sequential model and the various layers. For now, especially the core layers Dense
and Activation are of interest.

from keras.models import Sequential


from keras.layers import Dense, Activation, Input, Dropout

model = Sequential()
model.add(Dense(32, input_shape = X_train[0].shape))
model.add(Activation("relu"))
model.add(Dense(16))
model.add(Activation("relu"))
model.add(Dense(y_train[0].shape[0]))
model.add(Activation("softmax"))

print(model.summary())

Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense (Dense) (None, 32) 1572896
activation (Activation) (None, 32) 0

dense_1 (Dense) (None, 16) 528

activation_1 (Activation) (None, 16) 0

dense_2 (Dense) (None, 3) 51

activation_2 (Activation) (None, 3) 0

=================================================================
Total params: 1,573,475
Trainable params: 1,573,475
Non-trainable params: 0
_________________________________________________________________
None

Once the model is created, model.summary() displays the architecture of the model. You can see that the created model consists of three
dense layers, each with an activation function. Also, the parameter for each layer are visible. Depending on the selected image size during
preprocessing, the input vector might be rather large, thus the high number of parameters in the first dense layer.

Next, the model needs to be compiled using a loss function and an optimizer . The loss function defines how the loss is computed during
model training, while the optimizer defines how the weights need to be adjusted during backpropagation. You can find more information
regarding the available losses here and regarding the optimizers here.

model.compile(loss = 'categorical_crossentropy', optimizer = 'Adam', metrics = ['accuracy'])

Now, the model can be trained using the datasets defined before.

model.fit(X_train, y_train, epochs = 20, batch_size = 8, validation_split=0.2, verbose = 1)

Epoch 1/20
5/5 [==============================] - 1s 72ms/step - loss: 1412.3744 - accuracy: 0.2564 - val_loss: 646.9340 - val_accuracy: 0.6000
Epoch 2/20
5/5 [==============================] - 0s 17ms/step - loss: 333.3408 - accuracy: 0.5385 - val_loss: 44.7513 - val_accuracy: 0.6000
Epoch 3/20
5/5 [==============================] - 0s 17ms/step - loss: 130.8871 - accuracy: 0.5897 - val_loss: 8.7992 - val_accuracy: 0.7000
Epoch 4/20
5/5 [==============================] - 0s 18ms/step - loss: 46.9059 - accuracy: 0.6154 - val_loss: 107.1110 - val_accuracy: 0.4000
Epoch 5/20
5/5 [==============================] - 0s 16ms/step - loss: 53.4573 - accuracy: 0.6154 - val_loss: 76.1191 - val_accuracy: 0.4000
Epoch 6/20
5/5 [==============================] - 0s 22ms/step - loss: 37.7211 - accuracy: 0.5897 - val_loss: 42.5151 - val_accuracy: 0.6000
Epoch 7/20
5/5 [==============================] - 0s 16ms/step - loss: 15.9705 - accuracy: 0.7692 - val_loss: 12.5873 - val_accuracy: 0.7000
Epoch 8/20
5/5 [==============================] - 0s 17ms/step - loss: 7.3708 - accuracy: 0.8462 - val_loss: 24.3295 - val_accuracy: 0.7000
Epoch 9/20
5/5 [==============================] - 0s 17ms/step - loss: 7.7038 - accuracy: 0.9231 - val_loss: 63.5386 - val_accuracy: 0.5000
Epoch 10/20
5/5 [==============================] - 0s 18ms/step - loss: 2.9461 - accuracy: 0.9487 - val_loss: 55.2533 - val_accuracy: 0.6000
Epoch 11/20
5/5 [==============================] - 0s 17ms/step - loss: 1.6057 - accuracy: 0.9487 - val_loss: 14.6102 - val_accuracy: 0.8000
Epoch 12/20
5/5 [==============================] - 0s 17ms/step - loss: 0.0000e+00 - accuracy: 1.0000 - val_loss: 44.4014 - val_accuracy: 0.6000
Epoch 13/20
5/5 [==============================] - 0s 19ms/step - loss: 2.6602 - accuracy: 0.9231 - val_loss: 24.0462 - val_accuracy: 0.7000
Epoch 14/20
5/5 [==============================] - 0s 16ms/step - loss: 0.5815 - accuracy: 0.9744 - val_loss: 68.7479 - val_accuracy: 0.6000
Epoch 15/20
5/5 [==============================] - 0s 17ms/step - loss: 0.3020 - accuracy: 0.9744 - val_loss: 19.2934 - val_accuracy: 0.9000
Epoch 16/20
5/5 [==============================] - 0s 16ms/step - loss: 0.0000e+00 - accuracy: 1.0000 - val_loss: 18.7606 - val_accuracy: 0.7000
Epoch 17/20
5/5 [==============================] - 0s 17ms/step - loss: 0.0262 - accuracy: 0.9744 - val_loss: 23.5947 - val_accuracy: 0.8000
Epoch 18/20
5/5 [==============================] - 0s 16ms/step - loss: 0.0000e+00 - accuracy: 1.0000 - val_loss: 25.5570 - val_accuracy: 0.8000
Epoch 19/20
5/5 [==============================] - 0s 17ms/step - loss: 8.4530e-05 - accuracy: 1.0000 - val_loss: 26.7151 - val_accuracy: 0.8000
Epoch 20/20
5/5 [==============================] - 0s 19ms/step - loss: 0.3409 - accuracy: 0.9744 - val_loss: 23.9743 - val_accuracy: 0.8000
<keras.callbacks.History at 0x7fcbd09740d0>

You can use the following function to evaluate your model.


def evaluate_model(X_test, y_test, model):
import seaborn as sns
from sklearn.metrics import confusion_matrix, classification_report

y_pred = model.predict(X_test)
res = np.zeros_like(y_pred)
for i in range(len(np.argmax(y_pred, axis=1))):
res[i, np.argmax(y_pred,axis=1)[i]]=1
y_pred = res
cm = confusion_matrix(enc.inverse_transform(y_test), enc.inverse_transform(y_pred))
ax=sns.heatmap(cm, annot=True)
plt.xlabel('Predicted')
plt.ylabel('Truth')
print(classification_report(enc.inverse_transform(y_test), enc.inverse_transform(y_pred), zero_division=0))

evaluate_model(X_test, y_test, model)

precision recall f1-score support

Complete 0.00 0.00 0.00 2


Missing cover 0.50 0.50 0.50 6
Missing screw 0.67 0.77 0.71 13

accuracy 0.62 21
macro avg 0.39 0.42 0.40 21
weighted avg 0.56 0.62 0.59 21

Question: What behavior did you observe while training the model? How can the results be explained?

Your answer: TBD

Feedback: ...

b) Image classification using CNNs


In this section, we are going to explore the usage of CNNs for the given task.

Architecture CNN

First, the data is loaded from file. As CNNs are capable and even excel on analyzing the multiple dimensional aspects of images, the images do
not need to be reshaped in a one-dimensional vector. Thus, we have to set the flag flatten to False . You can see, that the shape of the loaded
images is now a four-dimensional array with (number of samples, width image, height image, color channels image) .

features, labels = load_features_labels("./data/top", size=(512,512), color=True, flatten=False, identifiers=['MC', 'MS', 'C'])


features = np.array(features) # Datatype conversion of feature vector from list to array
labels = np.array(labels) # Datatype conversion of label vector from list to array
print("Shape feature vector:", features.shape)
print("Shape label vector:", labels.shape)

Shape feature vector: (70, 512, 512, 3)


Shape label vector: (70,)

Task: Fill in the following code so that 70% of the data is used for training, and the remaining 30% for testing. Also, the datasets
should be stratified by the label vector. Furthermore, add OneHot Encoding for the labels as seen before.
#######################################
# Please complete the following lines #
#######################################

# def split_data(features, labels):


# return train_test_split("""Your code goes here""")

# def encode_data(y_train, y_test):


# """Your code goes here"""
# return y_train, y_test

###
# Solution
###

def split_data(features, labels):


return train_test_split(features, labels, train_size=0.7, stratify=labels, random_state=42)

def encode_labels(y_train, y_test, returnEncoder=False):


enc = OneHotEncoder(sparse=False) # Generate Encoder
y_train = enc.fit_transform(y_train.reshape(-1, 1)) # Fit and transform training data
y_test = enc.transform(y_test.reshape(-1, 1)) # Transform testing data
if returnEncoder:
return y_train, y_test, enc
else:
return y_train, y_test

################
# Quality gate #
################

X_train, X_test, y_train, y_test = split_data(features, labels)


y_train, y_test = encode_labels(y_train, y_test)
print("Label of first sample after OneHot encoding:", y_train[0])
quality_gate_141(y_train, y_test)

Label of first sample after OneHot encoding: [0. 1. 0.]


'Quality gate passed :)'

from keras.models import Sequential


from keras.layers import Conv2D, Dense, Flatten, Dropout, MaxPooling2D, GlobalMaxPooling2D

model = Sequential()
model.add(Conv2D(8, 5, input_shape = X_train[0].shape, activation = 'relu', padding="same"))
model.add(MaxPooling2D())
model.add(Conv2D(16, 3, activation = 'relu', padding="same"))
model.add(MaxPooling2D())
model.add(Conv2D(32, 3, activation = 'relu', padding="same"))
model.add(GlobalMaxPooling2D())
model.add(Dense(32, activation = 'relu'))
model.add(Dense(y_train[0].shape[0], activation = 'softmax'))

print(model.summary())

Model: "sequential_4"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv2d_9 (Conv2D) (None, 512, 512, 8) 608

max_pooling2d_6 (MaxPooling (None, 256, 256, 8) 0


2D)

conv2d_10 (Conv2D) (None, 256, 256, 16) 1168

max_pooling2d_7 (MaxPooling (None, 128, 128, 16) 0


2D)

conv2d_11 (Conv2D) (None, 128, 128, 32) 4640

global_max_pooling2d_3 (Glo (None, 32) 0


balMaxPooling2D)

dense_9 (Dense) (None, 32) 1056


dense_10 (Dense) (None, 3) 99

=================================================================
Total params: 7,571
Trainable params: 7,571
Non-trainable params: 0
_________________________________________________________________
None

from tensorflow.keras.optimizers import Adam


optimizer=Adam(learning_rate=0.001)
model.compile(loss = 'categorical_crossentropy', optimizer = optimizer, metrics = ['accuracy'])

model.fit(np.array(X_train), np.array(y_train), epochs = 75, batch_size = 32, validation_split=0.1,


verbose = 1, sample_weight=compute_sample_weight('balanced', y_train))

2/2 [==============================] - 9s 3s/step - loss: 0.0334 - accuracy: 1.0000 - val_loss: 0.2650 - val_accuracy: 0.8000
Epoch 48/75 
2/2 [==============================] - 9s 3s/step - loss: 0.0313 - accuracy: 1.0000 - val_loss: 0.2527 - val_accuracy: 0.8000
Epoch 49/75
2/2 [==============================] - 9s 3s/step - loss: 0.0293 - accuracy: 1.0000 - val_loss: 0.2477 - val_accuracy: 0.8000
Epoch 50/75
2/2 [==============================] - 9s 3s/step - loss: 0.0281 - accuracy: 1.0000 - val_loss: 0.2456 - val_accuracy: 0.8000
Epoch 51/75
2/2 [==============================] - 9s 3s/step - loss: 0.0270 - accuracy: 1.0000 - val_loss: 0.2420 - val_accuracy: 0.8000
Epoch 52/75
2/2 [==============================] - 9s 3s/step - loss: 0.0263 - accuracy: 1.0000 - val_loss: 0.2421 - val_accuracy: 0.8000
Epoch 53/75
2/2 [==============================] - 9s 3s/step - loss: 0.0253 - accuracy: 1.0000 - val_loss: 0.2439 - val_accuracy: 0.8000
Epoch 54/75
2/2 [==============================] - 9s 3s/step - loss: 0.0242 - accuracy: 1.0000 - val_loss: 0.2491 - val_accuracy: 0.8000
Epoch 55/75
2/2 [==============================] - 9s 3s/step - loss: 0.0233 - accuracy: 1.0000 - val_loss: 0.2520 - val_accuracy: 0.8000
Epoch 56/75
2/2 [==============================] - 9s 3s/step - loss: 0.0229 - accuracy: 1.0000 - val_loss: 0.2481 - val_accuracy: 0.8000
Epoch 57/75
2/2 [==============================] - 9s 3s/step - loss: 0.0219 - accuracy: 1.0000 - val_loss: 0.2353 - val_accuracy: 0.8000
Epoch 58/75
2/2 [==============================] - 9s 3s/step - loss: 0.0216 - accuracy: 1.0000 - val_loss: 0.2275 - val_accuracy: 0.8000
Epoch 59/75
2/2 [==============================] - 9s 3s/step - loss: 0.0210 - accuracy: 1.0000 - val_loss: 0.2330 - val_accuracy: 0.8000
Epoch 60/75
2/2 [==============================] - 9s 3s/step - loss: 0.0200 - accuracy: 1.0000 - val_loss: 0.2492 - val_accuracy: 0.8000
Epoch 61/75
2/2 [==============================] - 9s 3s/step - loss: 0.0195 - accuracy: 1.0000 - val_loss: 0.2695 - val_accuracy: 0.8000
Epoch 62/75
2/2 [==============================] - 9s 3s/step - loss: 0.0192 - accuracy: 1.0000 - val_loss: 0.2789 - val_accuracy: 0.8000
Epoch 63/75
2/2 [==============================] - 9s 3s/step - loss: 0.0188 - accuracy: 1.0000 - val_loss: 0.2761 - val_accuracy: 0.8000
Epoch 64/75
2/2 [==============================] - 9s 3s/step - loss: 0.0177 - accuracy: 1.0000 - val_loss: 0.2606 - val_accuracy: 0.8000
Epoch 65/75
2/2 [==============================] - 9s 3s/step - loss: 0.0175 - accuracy: 1.0000 - val_loss: 0.2515 - val_accuracy: 0.8000
Epoch 66/75
2/2 [==============================] - 9s 3s/step - loss: 0.0172 - accuracy: 1.0000 - val_loss: 0.2552 - val_accuracy: 0.8000
Epoch 67/75
2/2 [==============================] - 9s 3s/step - loss: 0.0167 - accuracy: 1.0000 - val_loss: 0.2686 - val_accuracy: 0.8000
Epoch 68/75
2/2 [==============================] - 9s 3s/step - loss: 0.0158 - accuracy: 1.0000 - val_loss: 0.2855 - val_accuracy: 0.8000
Epoch 69/75
2/2 [==============================] - 9s 3s/step - loss: 0.0154 - accuracy: 1.0000 - val_loss: 0.3056 - val_accuracy: 0.8000
Epoch 70/75
2/2 [==============================] - 9s 3s/step - loss: 0.0159 - accuracy: 1.0000 - val_loss: 0.3183 - val_accuracy: 0.8000
Epoch 71/75
2/2 [==============================] - 9s 3s/step - loss: 0.0151 - accuracy: 1.0000 - val_loss: 0.3047 - val_accuracy: 0.8000
Epoch 72/75
2/2 [==============================] - 9s 3s/step - loss: 0.0144 - accuracy: 1.0000 - val_loss: 0.2869 - val_accuracy: 0.8000
Epoch 73/75
2/2 [==============================] - 9s 3s/step - loss: 0.0139 - accuracy: 1.0000 - val_loss: 0.2730 - val_accuracy: 0.8000
Epoch 74/75
2/2 [==============================] - 9s 3s/step - loss: 0.0138 - accuracy: 1.0000 - val_loss: 0.2662 - val_accuracy: 0.8000
Epoch 75/75
2/2 [==============================] - 9s 3s/step - loss: 0.0137 - accuracy: 1.0000 - val_loss: 0.2686 - val_accuracy: 0.8000
<keras.callbacks.History at 0x7fcbd015fcd0> 

Evaluate the trained CNN

evaluate_model(X_test, y_test, model)


precision recall f1-score support

Complete 0.00 0.00 0.00 2


Missing cover 0.50 0.33 0.40 6
Missing screw 0.71 0.92 0.80 13

accuracy 0.67 21
macro avg 0.40 0.42 0.40 21
weighted avg 0.58 0.67 0.61 21

Question:

How does the CNN perform compared to the ANN?


What could be reasons for the different performances?

Your answer: TBD

Feedback: ...

Task: With the above starter code, a first improvement in accuracy compared to the SVM and the ANN using only Dense layers
should be visible. However, the network could be further improved by adjusting the hyperparameters. Below you can find the full
snippet from data preprocessing to model training. Play around with the parameters and see whether you can find a model that
shows an even better performance!

Some ideas are:

Explore different sized images (smaller/larger)


How do black and white images compare to the rgb ones?
Adapt the architecture of the neural network:
Change the amount of Conv2D layers
Change the number of filters in each layer
Explore other activation functions
Change the learning rate of the optimizer or look at different optimizers all together
Train the model for more epochs

For comparability, please don't change the ratios for train/test and train/validation!

np.random.seed(28)

####################################################
# Please modify the following lines #
# ! Don't change training/test/validation ratios ! #
####################################################

# Data preprocessing
features, labels = load_features_labels("./data/top", size=(512,512), color=True, flatten=False, identifiers=['MC', 'MS', 'C'])
features = np.array(features) # Datatype conversion of feature vector from list to array
labels = np.array(labels) # Datatype conversion of label vector from list to array
X_train, X_test, y_train, y_test = split_data(features, labels) # Split features and labels into training and testing datasets
y_train, y_test = encode_labels(y_train, y_test) # Encode labels

# Model definition
model = Sequential()
model.add(Conv2D(4, 5, input_shape = X_train[0].shape, activation = 'relu', padding="same"))
model.add(MaxPooling2D())
model.add(Conv2D(8, 3, activation = 'relu', padding="same"))
model.add(MaxPooling2D())
model.add(Conv2D(8, 3, activation = 'relu', padding="same"))
model.add(GlobalMaxPooling2D())
model.add(Dense(8, activation = 'relu'))
model.add(Dense(y_train[0].shape[0], activation = 'softmax'))

# Model compilation
optimizer=Adam(learning_rate=0.0005)
model.compile(loss = 'categorical_crossentropy', optimizer = optimizer, metrics = ['accuracy'])

# Model training
model.fit(np.array(X_train), np.array(y_train), epochs = 50, batch_size = 2, validation_split=0.1,
verbose = 1, sample_weight=compute_sample_weight('balanced', y_train))

# Model evaluation
evaluate_model(X_test, y_test, model)
Epoch 1/50
22/22 [==============================] - 9s 388ms/step - loss: 26.1337 - accuracy: 0.2955 - val_loss: 11.7893 - val_accuracy: 0.2000
Epoch 2/50
22/22 [==============================] - 8s 383ms/step - loss: 11.7874 - accuracy: 0.1364 - val_loss: 8.1721 - val_accuracy: 0.2000
Epoch 3/50
22/22 [==============================] - 8s 383ms/step - loss: 7.6837 - accuracy: 0.2500 - val_loss: 7.0914 - val_accuracy: 0.2000
Epoch 4/50
22/22 [==============================] - 8s 385ms/step - loss: 5.6548 - accuracy: 0.3182 - val_loss: 5.4500 - val_accuracy: 0.4000
Epoch 5/50
22/22 [==============================] - 8s 381ms/step - loss: 4.5269 - accuracy: 0.3864 - val_loss: 4.4434 - val_accuracy: 0.4000
Epoch 6/50
22/22 [==============================] - 8s 381ms/step - loss: 3.9237 - accuracy: 0.3864 - val_loss: 3.7421 - val_accuracy: 0.4000
Epoch 7/50
22/22 [==============================] - 8s 382ms/step - loss: 3.2746 - accuracy: 0.4091 - val_loss: 3.1884 - val_accuracy: 0.4000
Epoch 8/50
22/22 [==============================] - 8s 382ms/step - loss: 2.8123 - accuracy: 0.4091 - val_loss: 2.7209 - val_accuracy: 0.4000
Epoch 9/50
22/22 [==============================] - 8s 384ms/step - loss: 2.4667 - accuracy: 0.4091 - val_loss: 2.4462 - val_accuracy: 0.4000
Epoch 10/50
22/22 [==============================] - 8s 384ms/step - loss: 2.1635 - accuracy: 0.4091 - val_loss: 2.2040 - val_accuracy: 0.4000
Epoch 11/50
22/22 [==============================] - 8s 379ms/step - loss: 1.9306 - accuracy: 0.4091 - val_loss: 2.0059 - val_accuracy: 0.4000
Epoch 12/50
22/22 [==============================] - 8s 383ms/step - loss: 1.7306 - accuracy: 0.4091 - val_loss: 1.8478 - val_accuracy: 0.4000
Epoch 13/50
22/22 [==============================] - 8s 381ms/step - loss: 1.5747 - accuracy: 0.4091 - val_loss: 1.7066 - val_accuracy: 0.4000
Epoch 14/50
22/22 [==============================] - 8s 380ms/step - loss: 1.3976 - accuracy: 0.4091 - val_loss: 1.5534 - val_accuracy: 0.4000
Epoch 15/50
22/22 [==============================] - 8s 378ms/step - loss: 1.2548 - accuracy: 0.4091 - val_loss: 1.3888 - val_accuracy: 0.4000
Epoch 16/50
22/22 [==============================] - 8s 382ms/step - loss: 1.1533 - accuracy: 0.4091 - val_loss: 1.3450 - val_accuracy: 0.2000
Epoch 17/50
22/22 [==============================] - 8s 380ms/step - loss: 1.0260 - accuracy: 0.4091 - val_loss: 1.1637 - val_accuracy: 0.4000
Epoch 18/50
22/22 [==============================] - 8s 382ms/step - loss: 0.9248 - accuracy: 0.4091 - val_loss: 1.1008 - val_accuracy: 0.2000
Epoch 19/50
22/22 [==============================] - 8s 383ms/step - loss: 0.8120 - accuracy: 0.4091 - val_loss: 0.9707 - val_accuracy: 0.4000
Epoch 20/50
22/22 [==============================] - 8s 381ms/step - loss: 0.7136 - accuracy: 0.4091 - val_loss: 0.8725 - val_accuracy: 0.4000
Epoch 21/50
22/22 [==============================] - 8s 379ms/step - loss: 0.6277 - accuracy: 0.4091 - val_loss: 0.7792 - val_accuracy: 0.4000
Epoch 22/50
22/22 [==============================] - 8s 379ms/step - loss: 0.5672 - accuracy: 0.4091 - val_loss: 0.7057 - val_accuracy: 0.4000
Epoch 23/50
22/22 [==============================] - 8s 381ms/step - loss: 0.5100 - accuracy: 0.4091 - val_loss: 0.6321 - val_accuracy: 0.4000
Epoch 24/50
22/22 [==============================] - 8s 382ms/step - loss: 0.4656 - accuracy: 0.4091 - val_loss: 0.6048 - val_accuracy: 0.4000
Epoch 25/50
22/22 [==============================] - 8s 379ms/step - loss: 0.4298 - accuracy: 0.4091 - val_loss: 0.5514 - val_accuracy: 0.4000
Epoch 26/50
22/22 [==============================] - 8s 380ms/step - loss: 0.3959 - accuracy: 0.4091 - val_loss: 0.4881 - val_accuracy: 0.6000
Epoch 27/50
22/22 [==============================] - 8s 377ms/step - loss: 0.3602 - accuracy: 0.4091 - val_loss: 0.4679 - val_accuracy: 0.6000
Epoch 28/50
22/22 [==============================] - 8s 381ms/step - loss: 0.3350 - accuracy: 0.4773 - val_loss: 0.4212 - val_accuracy: 0.6000
Epoch 29/50
22/22 [==============================] - 8s 381ms/step - loss: 0.3285 - accuracy: 0.5909 - val_loss: 0.4223 - val_accuracy: 0.6000
Epoch 30/50
22/22 [==============================] - 8s 385ms/step - loss: 0.3127 - accuracy: 0.5909 - val_loss: 0.3797 - val_accuracy: 0.6000
Epoch 31/50
22/22 [==============================] - 8s 383ms/step - loss: 0.2905 - accuracy: 0.7045 - val_loss: 0.3845 - val_accuracy: 0.6000
Epoch 32/50
22/22 [==============================] - 8s 379ms/step - loss: 0.2945 - accuracy: 0.7500 - val_loss: 0.3831 - val_accuracy: 0.6000
Epoch 33/50
22/22 [==============================] - 8s 378ms/step - loss: 0.2760 - accuracy: 0.7500 - val_loss: 0.3529 - val_accuracy: 0.6000
Question: Describe your approach optimizing the hyperparameters. Which behavior did you observe?
Epoch 34/50
22/22 [==============================] - 8s 381ms/step - loss: 0.2549 - accuracy: 0.7955 - val_loss: 0.3466 - val_accuracy: 0.6000
Epoch 35/50
22/22 [==============================] - 8s 386ms/step - loss: 0.2554 - accuracy: 0.7500 - val_loss: 0.3345 - val_accuracy: 0.6000
Section 1.5: Data augmentation
Epoch 36/50
22/22 [==============================] - 9s 400ms/step - loss: 0.2546 - accuracy: 0.7955 - val_loss: 0.3323 - val_accuracy: 0.6000
Data Epoch
augmentation
37/50 is a technique for artificially increasing the dataset without the need for additional data acquisition. The reason for this is,
22/22 [==============================] - 8s 385ms/step - loss: 0.2354 - accuracy: 0.8182 - val_loss: 0.3249 - val_accuracy: 0.6000
that most machine learning models perform better the higher the available data volume is.
Epoch 38/50
22/22 [==============================] - 8s 381ms/step - loss: 0.2263 - accuracy: 0.8864 - val_loss: 0.3077 - val_accuracy: 0.6000
Data Epoch
augmentation
39/50
uses the principle of slight modifications to the original data to create new data, while using the labels of the existing
image. As those
22/22 modifications are rather small, -the8simage
[==============================] as a whole
382ms/step is not
- loss: changed
0.2294 by a lot and
- accuracy: the to- be
0.9091 identified0.3058
val_loss: object,-orval_accuracy:
in our case 0.6000
Epoch 40/50
image class, can still be recognized. However, the training process can be increased significantly. One can think of many variations of these
22/22 [==============================] - 8s 380ms/step - loss: 0.2153 - accuracy: 0.8864 - val_loss: 0.3014 - val_accuracy: 0.6000
slightEpoch
modifications
41/50 of an image. Typical examples include:
22/22 [==============================] - 8s 381ms/step - loss: 0.2150 - accuracy: 0.9091 - val_loss: 0.3025 - val_accuracy: 0.6000
Random
Epoch flipping of the image horizontally or vertically
42/50
22/22 [==============================] - 8s 380ms/step - loss: 0.2061 - accuracy: 0.8864 - val_loss: 0.3164 - val_accuracy: 0.6000
Epoch 43/50
Random rotations
22/22 [==============================] - 8s 379ms/step - loss: 0.1997 - accuracy: 0.9091 - val_loss: 0.2829 - val_accuracy: 0.6000
Random
Epoch shifts
44/50
22/22 [==============================] - 8s 384ms/step - loss: 0.1903 - accuracy: 0.9091 - val_loss: 0.2857 - val_accuracy: 0.6000
Blurring the image
Epoch 45/50
Adding[==============================]
22/22 artificially created noise - 9s 388ms/step - loss: 0.1847 - accuracy: 0.9318 - val_loss: 0.2616 - val_accuracy: 0.6000
Epoch 46/50
Cropping
22/22 [==============================] - 8s 387ms/step - loss: 0.1874 - accuracy: 0.9091 - val_loss: 0.2626 - val_accuracy: 0.6000
Changes
Epoch in contrast
47/50
22/22 [==============================] - 8s 381ms/step - loss: 0.1786 - accuracy: 0.9091 - val_loss: 0.2540 - val_accuracy: 0.6000
Elastic deformations
Epoch 48/50
Below22/22 [==============================]
you can - 8saugmentation
see some examples of the different 383ms/step -strategies
loss: 0.1699 - accuracy:
applied 0.9318 - val_loss: 0.2669 - val_accuracy:
to our dataset 0.6000
Epoch 49/50
22/22 [==============================] - 8s 379ms/step - loss: 0.1651 - accuracy: 0.9545 - val_loss: 0.2480 - val_accuracy: 0.6000
Implementation in keras
Epoch 50/50
22/22 [==============================] - 8s 381ms/step - loss: 0.1652 - accuracy: 0.9318 - val_loss: 0.2616 - val_accuracy: 0.6000
Keras includes its own procedure for image augmentation using the ImageDataGenerator generator. This generator offers a variety of data
precision recall f1-score support
augmentation strategies, that are directly applied to the raw data during model training. Thus, the augmented data does not need to be stored to
the disk. Complete 0.00 0.00 0.00 2
Missing cover 0.40 0.67 0.50 6
Missing
For this screw
exercise, 0.75to use the
we are going 0.46ImageDataGenerator
0.57 13
from keras. Please have a look at the documentation to get familiar:
https://keras.io/api/preprocessing/image/#imagedatagenerator-classData
accuracy 0.48 21 augmentation is a technique for artificially increasing the dataset
macro
without the needavg 0.38data acquisition.
for additional 0.38 0.36reason for
The 21this is, that most machine learning models perform better the higher the
weighted avg 0.58 0.48 0.50 21
available data volume is.

# Data preprocessing
features, labels = load_features_labels("./data/top", size=(512,512), color=True, flatten=False, identifiers=['MC', 'MS', 'C'])
features = np.array(features) # Datatype conversion of feature vector from list to array
labels = np.array(labels) # Datatype conversion of label vector from list to array
X_train, X_test, y_train, y_test = split_data(features, labels) # Split features and labels into training and testing datasets
y_train, y_test = encode_labels(y_train, y_test) # Encode labels

from tensorflow.keras.preprocessing.image import ImageDataGenerator

### Create and show data augmentation


datagen = ImageDataGenerator(
featurewise_center=False,
featurewise_std_normalization=False,
rotation_range=10,
width_shift_range=0.2,
height_shift_range=0.2,
horizontal_flip=True,
vertical_flip=True)

random_index = random.randint(0, len(features)) # Randomly select one image


datagen.fit(features[[random_index]]) # Fit the image generator with the randomly selected image

# Display the random augmentations


from mpl_toolkits.axes_grid1 import ImageGrid

fig = plt.figure(figsize=(16, 12))


grid = ImageGrid(fig, 111, nrows_ncols=(3, 4), axes_pad=(0.1, 0.3))

grid[0].imshow(features[random_index])
grid[0].set_title("Original")
for i, ax in enumerate(grid[1:]):
image = datagen.flow(features[[random_index]]).next()[0].astype(int)
ax.imshow(image) # Plot image
plt.show()
### Run model training with given data generator
X_train, X_validation, y_train, y_validation = train_test_split(X_train, y_train, train_size=0.9, stratify=y_train, random_state=21)

datagen = ImageDataGenerator(
featurewise_center=False,
featurewise_std_normalization=False,
rotation_range=10,
width_shift_range=0.2,
height_shift_range=0.2,
horizontal_flip=True,
vertical_flip=True)

datagen.fit(np.array(X_train))

model = Sequential()
model.add(Conv2D(8, 5, input_shape = X_train[0].shape, activation = 'relu', padding="same"))
model.add(MaxPooling2D())
model.add(Conv2D(8, 5, activation = 'relu', padding="same"))
model.add(MaxPooling2D())
model.add(Conv2D(16, 5, activation = 'relu', padding="same"))
model.add(MaxPooling2D())
model.add(Conv2D(16, 3, activation = 'relu', padding="same"))
model.add(GlobalMaxPooling2D())
model.add(Dense(64, activation = 'relu'))
model.add(Dense(16, activation = 'relu'))
model.add(Dense(y_train[0].shape[0], activation = 'softmax'))

model.compile(loss = 'categorical_crossentropy', optimizer = 'Adam', metrics = ['accuracy'])

model.fit(datagen.flow(np.array(X_train), np.array(y_train), batch_size=8), validation_data=(X_validation, y_validation),


steps_per_epoch=len(X_train) / 8, epochs=50)


Epoch 34/50
5/5 [==============================] - 13s 2s/step - loss: 0.5471 - accuracy: 0.8182 - val_loss: 0.7551 - val_accuracy: 0.6000
Epoch 35/50 
5/5 [==============================] - 13s 3s/step - loss: 0.7288 - accuracy: 0.7045 - val_loss: 0.5190 - val_accuracy: 0.8000
Epoch 36/50
5/5 [==============================] - 13s 2s/step - loss: 0.5174 - accuracy: 0.7955 - val_loss: 0.4745 - val_accuracy: 0.8000
Epoch 37/50
5/5 [==============================] - 13s 2s/step - loss: 0.6051 - accuracy: 0.7727 - val_loss: 0.7809 - val_accuracy: 0.6000
Epoch 38/50
5/5 [==============================] - 13s 2s/step - loss: 0.6883 - accuracy: 0.7045 - val_loss: 0.5812 - val_accuracy: 0.8000
Epoch 39/50
5/5 [==============================] - 13s 2s/step - loss: 0.5558 - accuracy: 0.7500 - val_loss: 0.3944 - val_accuracy: 0.8000
Epoch 40/50
5/5 [==============================] - 13s 2s/step - loss: 0.6485 - accuracy: 0.7727 - val_loss: 0.4437 - val_accuracy: 0.8000
Epoch 41/50
5/5 [==============================] - 13s 2s/step - loss: 0.7244 - accuracy: 0.7500 - val_loss: 0.4760 - val_accuracy: 0.8000
Epoch 42/50
5/5 [==============================] - 13s 2s/step - loss: 0.7097 - accuracy: 0.8182 - val_loss: 0.7354 - val_accuracy: 0.8000
Epoch 43/50
5/5 [==============================] - 13s 2s/step - loss: 0.5350 - accuracy: 0.7955 - val_loss: 0.7730 - val_accuracy: 0.8000
Epoch 44/50
5/5 [==============================] - 13s 2s/step - loss: 0.6777 - accuracy: 0.6818 - val_loss: 0.6028 - val_accuracy: 0.8000
Epoch 45/50
5/5 [==============================] - 13s 3s/step - loss: 0.5212 - accuracy: 0.8182 - val_loss: 0.5707 - val_accuracy: 0.8000
Epoch 46/50
5/5 [==============================] - 13s 2s/step - loss: 0.6360 - accuracy: 0.7955 - val_loss: 0.5915 - val_accuracy: 0.8000
Epoch 47/50
5/5 [==============================] - 13s 2s/step - loss: 0.6162 - accuracy: 0.7727 - val_loss: 0.6657 - val_accuracy: 0.8000
Epoch 48/50
5/5 [==============================] - 13s 2s/step - loss: 0.7296 - accuracy: 0.7727 - val_loss: 0.6666 - val_accuracy: 0.8000
Epoch 49/50
5/5 [==============================] - 13s 2s/step - loss: 0.4567 - accuracy: 0.8636 - val_loss: 0.6394 - val_accuracy: 0.8000
Epoch 50/50
5/5 [==============================] - 13s 3s/step - loss: 0.5065 - accuracy: 0.7500 - val_loss: 0.4732 - val_accuracy: 0.8000
<keras.callbacks.History at 0x7fcbc4fef890>

evaluate_model(X_test, y_test, model)

WARNING:tensorflow:5 out of the last 5 calls to <function Model.make_predict_function.<locals>.predict_function at 0x7fcbc70fe560> trigg


precision recall f1-score support

Complete 0.00 0.00 0.00 2


Missing cover 1.00 0.67 0.80 6
Missing screw 0.81 1.00 0.90 13

accuracy 0.81 21
macro avg 0.60 0.56 0.57 21
weighted avg 0.79 0.81 0.78 21

Question: What behavior can be observed while training the model using the data augmentation? Did it improve?

Task & Question: Experiment with the different data augmentation parameters, are all of them similar effective?

Your answer: TBD

Feedback: ...
Section 2: Expanding the project scope
So far, only the classes Complete, Missing cover, and Missing screw were investigated. Those defects are easy to observe in the top view. The
remaining defect Not screwed is hardly visible in the top view. Thus, information from the side view could be used to detect this defect.

Task: In this last part of the exercise, you are tasked to expand the current quality monitoring solution to also detect not fully
fastened screws. As mentioned above, it might be useful to investigate the side view images to achieve this purpose.

You can approach this problem using any of the above mentioned solutions such as SVMs, ANNs or CNNs.

# Your code here


# You can use as many cells as you want, but insert them before the final task at the bottom

Question: Please describe your approach for expanding the project scope briefly.

Your answer: TBD

Feedback: ...

Question: What was your final prediction results? What would you do to further improve the results?

Your answer: TBD

Feedback: ...

Question: Which challenges did you encounter while solving the problem? How did you solve those?

Your answer: TBD

Feedback: ...

You might also like