Slides Merged
Slides Merged
Motivation 9
Machine Learning Types 22
Machine Learning Pipeline 55
Summary 69
Linear Regression - Motivation 72
Linear Regression - Overall Picture 86
Linear Regression - Model 94
Linear Regression - Optimization 103
Linear Regression - Basis Functions 128
Logistic Regression - Motivation 137
Logistic Regression - Framework 153
Overfitting and Underfitting 185
Problem Statement 208
Optimization 216
Kernel Trick 224
Hard and Soft Margin 234
Regression 238
Summary 244
Applications 247
Intuition 252
Mathematics 259
Summary 267
Perceptron 270
Multilayer Perceptron 280
Applications 333
Machine Learning for Engineers
Organizational Information
Bilder: TF / Malter
Course Organizers
Prof. Dr. Bjoern M. Eskofier
Machine Learning and Data Analytics Lab
University Erlangen-Nürnberg
Exercises
Two real-world industrial applications
• Exercise 1: Energy prediction using Support Vector Machines
• Exercise 2: Image classification using Convolutional Neural Networks
Deep Learning
by A. Courville, I. Goodfellow, Y. Bengio, 2015
Bilder: TF / Malter
Four Industrial Revolutions
Currently in the 4. Industrial Revolution 3. Industrial Revolution
Electronics and IT enable automation-driven rationalization
1. Industrial Revolution:
1. Industrial Revolution Steam 1960
Machinesas well as varied series production
Steam machines allow the
2. Industrial
1750 Revolution: Division of Labor (keyword: Conveyer belts)
Industrialization
3. Industrial Revolution: Electronics, Machines and Automatisation with IT
4. Industrial Revolution: Interlinked Production. Machines Communicate
with each other and optimize the process
Source: Industrie 4.0 in Produktion, Automatisierung und Logistik, T. Bauernhansl, M. Hompel, B. Vogel-Heuser, 2014
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 2
Core of the Fourth Industrial Revolution is Digitalization
• 1960 : Dominant Factor Mechanical Engineering
Percantage of Engineering in Mechanical and Plant Engineering
• Mechanical Engineering decreasing with each decade
• Mechanical and plant
• During
Mechanical Engineering Electrical Engineering Computer Sciences Systems Engineering
that time Computer Science (AI, and Machine Learning)
engineering increase
was dominated by
100% 3
5 5
90% steadily! 9 15 mechcanical engineering in
10
1960s
80% 18
• Share of mechanical engineering
70% 30
has been steadily decreasing
60%
until today
50%
95 • Impact of computer sciences has
40% 85 25
70 been increasing since 1980s
30%
20%
30
→ Mechanical and plant
10% engineering is composed of
0% mutliple fields and became an
1960 1980 2000 2020
interdisciplinary area unlike 1960s
Source: Automatisierung 4.0, T. Schmertosch, M. Krabbes, 2018
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 3
What drives that progress in Computer Science?
•
Annual Of course:
number Algorithmic
of AI papers advances (NOT ONLY AI)
•
25000 BUT: One significant force of progress IS AI
USA
Europe China
• You can see that clearly with the amount of publications and the
Total number of AI papers
20000
applications in that area
•
15000 So now: What is Artificial Intelligence actually? An What is the
difference to Machine Learning and Deep Learning?
10000
5000
0
2000 2005 2010 2015
Source: https://www.ibm.com/blogs
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 5
Driving factors for the Advancement of AI
There are three important reasons for these advancements!
1. Increase of the Computational Power
Increase ofin
2. Increase Increase
thethe Amount of the amount
of Available Dataof Development of new
computational power available data algorithms
3. Development of New Algorithms
Sources:
Left Image: https://datafloq.com
Middle Image:https://europeansting.com
Right Image: https://medium.com
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 6
Increase of Computational Power
•• Moore’s
Transistors are Law: “the number of transistors inMoore’s
the building a dense
Law integrated
Transistors per Square Millimeter by Year
circuit
blocks (IC) doubles about every two years.”
of CPUs
•• Transistors: Small “blocks” in the computer doing “simple”
Numbers of transistors are
calculations
correlated with computational
• power
On the right image: x-axis = Years from 1970 – 2020 , y-axis =
logarithmic scale of transistors per mm^2
• Number of transistors has been
• increasing,
You see thethe
hence amount of transistors mm^2 increases linearly BUT
computational power
on a logarithmic scale, that is exponential!
•• ThisThat meansisexponential
phenomenon stated in growth of computation power!
Moore’s law
Source: https://medium.com
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 7
Computational Power of today’s supercomputers
TOP500 Supercomputers (2020)
•June
2020 Increase in computational power especially noticeablePflop/s
System Specs Site Country Cores in
Rmax
Power
1 Supercomputer
Fugaku Fujitsu A64FX (48C, 2.2GHz), Tofu Interconnect D RIKEN R-CCS Japan 7,288,072 415.5 28.3
•2 Super
Summit Computers
IBM POWER9 (22C, have
3.07GHz),millions
NVIDIA Volta GV100 of cores (2020)
(80C), Dual Rail Mellanox EDR Infiniband
DOE/SC/ORNL USA 2,414,592 148.6 10.1
•3 Leaders
Sierra (2020) are(22C,Japan,
IBM POWER9 3.1GHz), NVIDIAUSA,
Tesla V100 China
(80C), Dual Rail Mellanox EDR Infiniband
DOE/NNSA/LLNL USA 1,572,480 94.6 7.4
•4 Germany is Shenwei
Sunway TaihuLight
on the SW2601013.
Interconnect
place
(260C, 1.45GHz) Custom
NSCC in Wuxi China 10,649,600 93 15.4
5 Tianhe-2A Intel Ivy Bridge (12C, 2.2GHz NSCC Guangzhou China 4,981,760 61.4 18.5
Source: https://www.top500.org
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 8
Increase in the Amount of Available Data
• The amount
Other ofbig driving
created datafactor for AI:
increased fromAvailability
Accordingof to
DATA
General Electric (GE), each of
• twoCreated
zettabytesData
in 2010 to 47 zettabytes
jumped from 2 zettabyteits aircraft
(2010) engines
to 47produces
zettabytearound twenty
(2020)
in 2020 terabyte of data per hour
• It comes from all things around us (cars, IoT, aircrafts….)
• General electric: Each engine produces 20TB of data
https://www.statista.com https://www.forbes.com
Source: A Roadmap for Foundational Research on Artificial Intelligence in Medical Imaging, Curtis P. Langlotz et al., 2018
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 10
Industrial Examples – Real Environments
• (Show
KUKA Video)
Bottleflip Challenge WAYMO driverless driving
Engineers from KUKA tackled the “Bottleflip Challenge”
The robot flips a bottle in the air.
Fun Fact: Robot trained itself in a single night
• (Show Video)
Driving has been traditionally a human job
Autonomous Driving could progress thanks to:
Artificial neural networks and deep learning.
• Challenge called «Bottleflipping» • Autonomous driving
• Robot trained itself over night • A traditional human job is carried out
Source: https://www.youtube.com/watch?v=HkCTRcN8TB0&t by machine learning algorithms
Source: https://www.youtube.com/watch?v=aaOB-ErYq6Y&t
Source: https://deepmind.com
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 12
Introduction to Machine Learning
Machine Learning Types
Bilder: TF / Malter
Machine Learning Categories
Machine Learning
Key Aspects:
• Learning is explicit
• Learning using direct feedback
• Data with labeled output
Source: https://www.kdnuggets.com
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 4
Supervised Learning Problems
Classification Regression
Label 𝒚 = Dog
Feature 𝑥1
Label 𝑦
Label 𝒚 = Cat
Feature 𝑥0 Feature 𝑥
Source: https://www.kdnuggets.com
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 5
Regression
Regression is used to predict
a continuous value Regression
Label 𝑦
𝒟 = (𝒙𝟎 , 𝒚𝟎 ), (𝐱 𝟏 , 𝒚𝟏 ), … , (𝐱 𝐧 , 𝒚𝒏 )
Sample : (𝒙𝒊 , 𝒚𝒊 )
𝒙𝒊 ∈ ℝ𝑚 is the feature vector of sample 𝑖
Feature 𝑥
𝒚𝒊 ∈ ℝ is the label value of sample 𝑖
Label 𝑦
model 𝑓 to all samples:
𝑓 𝒙𝒊 = 𝒚𝒊 , ∀(𝒙𝒊 , 𝒚𝒊 ) ∈ 𝒟
https://de.wikipedia.org/wiki/Zerspanen
Objective: Realization:
• Prediction of the surface quality • Speed and Feed data is gathered and the
based on the production parameters surface roughness measured for some trials
• By using regression algorithms surface
quality is predicted
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 8
Example: Prediction of Remaining Useful Life (RUL)
“Health Monitoring and Prediction (HMP)” system (developed by BISTel)
Objective:
• Machine components degrade over time
• This causes malfunctions in the production line
• Replace component before the malfunction!
Realization:
• Use data of malfunctioning systems
• Fit a regression model to the data and predict
the RUL
• Use the model on a active systems and apply
necessary maintenance
Source: https://www.bistel.com/
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 9
Classification
Classification is used to predict
the class of the input Classification
Label 𝒚 = Dog
Important:
Feature 𝑥1
The output belongs to only one class
Sample : 𝒔𝐢 = (𝑥𝑖 , 𝑦𝑖 )
𝑦𝑖 ∈ 𝐿 is the label of sample 𝑖 Label 𝒚 = Cat
Feature 𝑥0
In this example: 𝐿 = {𝐶𝑎𝑡, 𝐷𝑜𝑔}
Source: https://scikit-learn.org/
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 10
Classification
Goal:
Find a way to divide the input into the Classification
output classes!
Label 𝒚 = Dog
Feature 𝑥1
That means, we find a decision
boundary 𝑓 for all samples:
𝑓 𝒙𝒊 = 𝒚𝒊 , ∀(𝒙𝒊 , 𝒚𝒊 ) ∈ 𝒟
Label 𝒚 = Cat
In this case 𝑓 is a linear decision
Feature 𝑥0
boundary! (Black Line)
Source: https://scikit-learn.org/
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 11
Example: Foreign object detection
Objective:
• Foreign object detection on a production part
• After a production step a chip can remain
on the piston rod
• Quality assurance: Only parts without chip
are allowed for the next production step
Realization:
• A camera system records 3,000 pictures
• All images are labeled by a human
Piston rod Piston rod with chip
• A machine learning classification algorithm without chip
was trained to differentiate between chip
or no chip situations
Source: Implementation and potentials of a machine vision system in a series production using deep learning and low-cost hardware, Hubert
Würschinger et al., 2020
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 12
Classification algorithms
Below, you Nearest
Input Data can see example
Linear datasets
RBF andGaussian
the application of algorithms
Different different use
algorithms.Neighbors SVM SVM Processes different ways to classify
data
Therefore:
The algorithms perform
different on the datasets
Example:
Linear SVM’s has bad
results in the second
dataset
Reason: No linear way to
distinguish data
The lower right value shows the classification accuracy
Source: https://scikit-learn.org/
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 13
Classification vs. Regression
Both areRegression
Supervised Learning Problems i.e.Classification
require labeled
datasets in Machine learning. But •the
• The output are continuous or real
difference between both is
The output variable must be a discrete
how they are used for different machine
values learning problems.
value (class)
• We try to fit a regression model, which • We try to find a decision boundary,
can predict the output more accurately which can divide the dataset into different
classes.
• Regression algorithms can be used to • Classification Algorithms can be used to
solve the regression problems such as solve classification problems such as
Weather Prediction, House price Hand-written digits(MNIST), Speech
prediction, Stock market prediction etc. Recognition, Identification of cancer cells,
Defected or Undefected solar cells etc.
Source: https://scikit-learn.org/
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 14
Machine Learning Categories
Machine Learning
Key Aspects:
• Learning is implicit
• Learning using indirect feedback
• Methods are self-organizing
Petal width
1.5
Petal width
1.5
2.0 2.0
Petal width
Petal width
1.5 1.5
1.0 1.0
Iris virginica
0.5 Iris versicolor 0.5 Cluster 1
Iris setsoa Cluster 2
0.0 0.0
1 2 3 4 5 6 7 1 2 3 4 5 6 7
Petal length Petal length
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 20
Clustering algorithms
K-Means Affinity Mean Shift Spectral Ward Different clustering
Propagation Clustering methods produce
different results
e.g. some algorithms “find”
more clusters than others
Example:
K-Means performs “well” on
the third dataset, but not on
dataset one and two
Reason:
K-Means can only identify
“circular” clusters
Source: https://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_comparison.html
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 21
Curse of Dimensionality
“As the number of features or dimensions grows, the amount of data we
need to generalize accurately grows exponentially.” – Charles Isbell
Source: Chen L. (2009) Curse of Dimensionality. In: LIU L., ÖZSU M.T. (eds) Encyclopedia of Database Systems. Springer, Boston, MA.
https://doi.org/10.1007/978-0-387-39940-9_133
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 22
Example: Curse of Dimensionality
The production system has N sensors attached with either the input
set to “On” or “Off”
𝑁 =1 : 𝐷 = 21 = 2
𝑁 = 10 : 𝐷 = 210 = 1024
𝑁 = 100 : 𝐷 = 2100 = 1.2 × 1030
For N = 100, the number of points are even more than the number of atoms
in the universe!
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 24
Dimensionality Reduction
The goal:
S0 S1 S2 S3 S4 S5 S6 S7 S8
Transform the samples from a high to a Sample0 0.2 0.1 11.1 2.2 Off 7 1.1 0 1.e-1
lower dimensional representation! Sample1 1.2 -0.1 3.1 -0.1 On 9 2.3 -1 1.e-4
Sample2 2.7 1.1 0.1 0.1 Off 10 4.5 -1 1.e-9
Sample3 3.1 0.1 1.1 0.2 Off 1 6.6 -1 1.e-1
Ideally:
Find a representation, which solves
your problem!
T0 T1 T2 T3
Key Aspects:
• Learning is implicit
• Learning using indirect feedback
based on trials and reward signals
• Actions are affecting future measurements (i.e. inputs)
Source: https://ai.googleblog.com/2018/06/scalable-deep-reinforcement-learning.html
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 32
Introduction to Machine Learning
Machine Learning Pipeline
Bilder: TF / Malter
The Machine Learning Pipeline
A concept that provides guidance in a machine learning project
• Step-by-step process
• Each step has a goal and a method
Source: Artificial Intelligence with Python - Second Edition, by Alberto Artasanchez, Prateek Joshi
23.03.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 2
1. Problem Definition
Our aim with machine learning is to develop a solution for a problem. In order to develop a satisfying
solution, we need to define the problem. This definition of the problem lays the foundation to our solution. It
shows us what kind of data we need, what kind of algorithms we can use.
Examples:
• If we are trying to find a faulty equipment, we have a classification problem
• If we are trying to predict a continuous number, we have a regression problem
Tennis
Day Weather Temp. Humidity Wind
recommended?
1 Sunny Hot High Weak No
2 Sunny Hot High Strong No
3 Cloudy Hot High Weak Yes
4 Rainy Mild High Weak Yes
… … … … … …
We measure the actual We calculate the Within defined range: model is working
energy consumption of error of the sufficiently
10 work steps after each predictions of the Out defined range: model is working insufficient
maintenance procedure energy consumption → stop using and root cause analysis
Candidate
Problem Data Data Data Model Model Perfomance
Model
Definition Ingestion Preparation Segregation Training Deployment Monitoring
Evaluation
Bilder: TF / Malter
Summary
In this chapter we talked about:
• The history of machine learning and recent trends
• The different types of machine learning types such as supervised learning, unsupervised learning and
reinforcement learning
• The steps involved in a common machine learning pipeline
Bilder: TF / Malter
Motivation
Linear regression is used in multiple
scenarios each and every day!
Use Cases:
• Trend Analysis in financial
markets, sales and business
• Computer Vision problems
Financial development in FXCM
i.e. registration & localization
problems
• etc.
Source: https://www.tradingview.com/script/OZZpxf0m-Linear-Regression-Trend-Channel-with-Entries-Alerts/
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 9
When do we use it?
The problem has to be simple:
• Dataset is small
• Linear model is enough i.e. Trend Analysis
• Linear models are the basis for complex models
i.e. Deep Networks
Source: https://www.who.int/toolkits/growth-reference-data-for-5to19-years/indicators/height-for-age
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 11
Example: Growth Reference Dataset (WHO)
Age Height The dataset contains Age (5y – 12y) and Height
4.1 years 108 cm information from people in the USA
5.2 years 112 cm
5.6 years 108 cm We want to answer Questions like:
5.7 years 116 cm What is the height of my child,
6.2 years 112 cm when it is 14 years old?
6.3 years 116 cm
6.6 years 120 cm
What is the height of my child,
6.7 years 122 cm
when it is 30 years old?
Source: https://www.who.int/toolkits/growth-reference-data-for-5to19-years/indicators/height-for-age
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 12
Example: 1.Visualize the data
Age Height
145
4.1 years 108 cm
5.2 years 112 cm 140
Height in (cm)
5.6 years 108 cm 130
5.7 years 116 cm 120
6.2 years 112 cm 115
6.3 years 116 cm 110
6.6 years 120 cm
6.7 years 122 cm 5 6 7 8 9 10 11 12
Age in years
Source: https://www.who.int/toolkits/growth-reference-data-for-5to19-years/indicators/height-for-age
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 13
Example: 1.Visualize the data
Age Height
145
4.1 years 108 cm
5.2 years 112 cm 140
Height in (cm)
5.6 years 108 cm 130
5.7 years 116 cm 120
6.2 years 112 cm 115
6.3 years 116 cm 110
6.6 years 120 cm
6.7 years 122 cm 5 6 7 8 9 10 11 12
Age in years
Source: https://www.who.int/toolkits/growth-reference-data-for-5to19-years/indicators/height-for-age
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 14
Example: 2.Fit a model
145
140
Height in (cm)
130
120
115
110
5 6 7 8 9 10 11 12
Age in years
Source: https://www.who.int/toolkits/growth-reference-data-for-5to19-years/indicators/height-for-age
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 15
Example: 2.Fit a model
What “model” i.e. function
describes our data? 145
Height in (cm)
130
• Polynomial model
• Gaussian model 120
115
110
5 6 7 8 9 10 11 12
Age in years
Source: https://www.who.int/toolkits/growth-reference-data-for-5to19-years/indicators/height-for-age
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 16
Example: 2.Fit a model
What “model” i.e. function
describes our data? 145
Height in (cm)
130
• Polynomial model
• Gaussian model 120
115
110
Source: https://www.who.int/toolkits/growth-reference-data-for-5to19-years/indicators/height-for-age
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 17
Example: 3.Answer your Questions
What is the height of my child,
when it is 14 years old? 270
→ 165 cm 230
Height in (cm)
190
150
110
70
0 5 10 15 20 25 30 35
Age in years
Source: https://www.who.int/toolkits/growth-reference-data-for-5to19-years/indicators/height-for-age
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 18
Example: 3.Answer your Questions
What is the height of my child,
when it is 14 years old? 270
→ 165 cm 230
Height in (cm)
190
→ 270 cm 70
0 5 10 15 20 25 30 35
Age in years
Source: https://www.who.int/toolkits/growth-reference-data-for-5to19-years/indicators/height-for-age
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 19
Example: Actual Answer
What is the height of my child,
when it is 14 years old?
(We estimated 165 cm)
→ ~165 cm
Source: https://www.who.int/toolkits/growth-reference-data-for-5to19-years/indicators/height-for-age
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 20
Next in this Lecture:
• What is the Mathematical Framework?
• How do we fit a linear model?
Bilder: TF / Malter
Overview
• Overall Picture
• The Linear Model
• Optimization
• Basis Function Expansion
145
140
Height in (cm)
130
120
115
110
5 6 7 8 9 10 11 12
Age in years
Note: Bold variables are vectors and/or matrices, non-bold variables are scalars!
Note: x (input) is called the independent/predictor variable and y (output) is called dependent/outcome variable
Source: https://www.who.int/toolkits/growth-reference-data-for-5to19-years/indicators/height-for-age
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 26
Overall Picture
Linear Regression is a method to fit linear models to our data!
145
The linear model:
140
f 𝐱 = w0 ⋅ 𝑥0 + w1 ⋅ 𝑥1 = 𝐲
Height in (cm)
130
120
115
110
5 6 7 8 9 10 11 12
Age in years
Note: Bold variables are vectors and/or matrices, non-bold variables are scalars!
Note: x (input) is called the independent/predictor variable and y (output) is called dependent/outcome variable
Source: https://www.who.int/toolkits/growth-reference-data-for-5to19-years/indicators/height-for-age
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 27
Overall Picture
Linear Regression is a method to fit linear models to our data!
145
The linear model:
140
f 𝐱 = w0 ⋅ 𝑥0 + w1 ⋅ 𝑥1 = 𝐲
Height in (cm)
130
𝑥0 1 120
𝐱= 𝑥 =
1 Age in Years 115
y = Height 110
w = Weights
5 6 7 8 9 10 11 12
Age in years
Note: Bold variables are vectors and/or matrices, non-bold variables are scalars!
Note: x (input) is called the independent/predictor variable and y (output) is called dependent/outcome variable
Source: https://www.who.int/toolkits/growth-reference-data-for-5to19-years/indicators/height-for-age
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 28
Overall Picture
Linear Regression is a method to fit linear models to our data!
145
The linear model (in our example):
140
f 𝐱 = 70 ⋅ 𝑥0 + 6.5 ⋅ 𝑥1 = 𝐲
Height in (cm)
130
120
115
110
5 6 7 8 9 10 11 12
Age in years
Note: Bold variables are vectors and/or matrices, non-bold variables are scalars!
Note: x (input) is called the independent/predictor variable and y (output) is called dependent/outcome variable
Source: https://www.who.int/toolkits/growth-reference-data-for-5to19-years/indicators/height-for-age
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 29
Overall Picture
Linear Regression is a method to fit linear models to our data!
145
The linear model (in our example):
140
f 𝐱 = 70 ⋅ 𝑥0 + 6.5 ⋅ 𝑥1 = 𝐲
Height in (cm)
130
→ Finding a good 𝑤0 and 𝑤1 is 120
called fitting! 115
110
5 6 7 8 9 10 11 12
Age in years
Note: Bold variables are vectors and/or matrices, non-bold variables are scalars!
Note: x (input) is called the independent/predictor variable and y (output) is called dependent/outcome variable
Source: https://www.who.int/toolkits/growth-reference-data-for-5to19-years/indicators/height-for-age
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 30
Overview
• Overall Picture
• The Linear Model
• Optimization
• Basis Function Expansion
Height in (cm)
130
120 𝜖𝑖
115
110
5 6 7 8 9 10 11 12
Age in years
Note: In this case we assume that the noise 𝜖𝑖 is Gaussian distributed
Source: Machine Learning: A Probabilistic Perspective, Kevin P. Murphy, p.19
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 32
The Linear Model
The previous description of the linear model is not accurate!
Reason: Real Systems produce noise!
145
140
Linear model with noise:
Height in (cm)
130
f 𝐱 = w0 ⋅ 𝑥0 + w1 ⋅ 𝑥1 + 𝜖𝑖
120 𝜖𝑖
115
110
5 6 7 8 9 10 11 12
Age in years
1) In this case we assume that the noise 𝜖𝑖 is Gaussian distributed
Source: Machine Learning: A Probabilistic Perspective, Kevin P. Murphy, p.19
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 33
The Linear Model
The previous description of the linear model is not accurate!
Reason: Real Systems produce noise!
145
140
Linear model with noise:
Height in (cm)
130
f 𝐱 = w0 ⋅ 𝑥0 + w1 ⋅ 𝑥1 + 𝜖𝑖
120 𝜖𝑖
1) 115
The 𝜖𝑖 and the summation 𝝐 of
all samples is called Residual Error! 110
5 6 7 8 9 10 11 12
Age in years
1) In this case we assume that the noise 𝜖𝑖 is Gaussian distributed
Source: Machine Learning: A Probabilistic Perspective, Kevin P. Murphy, p.19
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 34
The Residual Error 𝝐
The residual error 𝜖 is not part of the input!
It is a random variable! Gaussian Distribution
0.5
0.4
Typically: We assume 𝜖 is Gaussian
Probability
Distributed (i.e. Gaussian noise) 0.3
1)
1 𝜖−𝜇 2 0.2
1 −2 𝜎
𝑝 𝜖 = 𝑒
𝜎 2𝜋 0.1
-3 -2 -1 0 1 2 3 4
But other distributions are also possible! Event
1) The short form is 𝜖 ~ 𝒩(𝜇, 𝜎 2 )
Note: In our use-case 𝜇 = 0 and 𝜎 is small. That means our samples are only “slightly” deviate
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 35
The distribution of 𝑦
The noise is Gaussian:
𝑦1
𝑝 𝜖 =𝒩 𝜖 0, 𝜎 2 ) 145
140 𝑝(𝐲1 |𝐱1 )
Height in (cm)
“Inserting” the linear function: 130
𝑝 𝐲 𝐱, 𝐰, 𝜎) = 𝒩 𝐲 𝐰 T 𝐱, 𝜎 2 ) 120
𝑦0
115 𝑝(𝐲0 |𝐱 0 )
110
This is the conditional probability 𝑥0 𝑥1
density for the target variable 𝐲!
5 6 7 8 9 10 11 12
Age in years
𝑓 𝐱 = 𝑤𝑗 𝑥𝑗 + 𝜖 = 𝐲
𝑗=1
Vector Notation:
𝑓 𝐱 = 𝐰T𝐱 + 𝜖 = 𝐲
Note: Bold variables are vectors and/or matrices, non-bold variables are scalars!
Source: Machine Learning: A Probabilistic Perspective, Kevin P. Murphy, p.19
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 37
What parameters do we have to estimate?
The model is linear:
𝐷
𝑓 𝐱 = 𝑤𝑗 𝑥𝑗 + 𝜖 = 𝐲
𝑗=1
Note: In this case, we assume that the noise is Gaussian distributed! i.e. 𝜖 ~ 𝒩(𝜇, 𝜎 2 )
Note: We assume 𝜇 = 0 : That means we only have to estimate 𝜎! Homework: Why can we assume 𝜇 = 0?
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 38
What parameters do we have to estimate?
The model is linear:
𝐷
𝑓 𝐱 = 𝑤𝑗 𝑥𝑗 + 𝜖 = 𝐲
𝑗=1
Note: In this case, we assume that the noise is Gaussian distributed! i.e. 𝜖 ~ 𝒩(𝜇, 𝜎 2 )
Note: We assume 𝜇 = 0 : That means we only have to estimate 𝜎! Homework: Why can we assume 𝜇 = 0?
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 39
Overview
• Overall Picture
• The Linear Model
• Optimization
• Basis Function Expansion
𝛉∗ = argmax 𝑝(𝒟|𝛉)
𝛉
That means:
• We want to find the optimal parameters 𝛉∗
• We search over all possible 𝛉
• We select the 𝛉 which most likely generated our training dataset 𝒟
0.4
Para.: 𝛉 = {𝜇, 𝜎}
0.3
𝛉 = {9, 0.9} : Bad
0.2
0.1
0 1 2 3 4 5 6 7 8 9 10
Event
Reminder: 𝜃 are the parameters, 𝒟 is the training dataset!
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 42
Intuitive Example
𝛉∗ = argmax 𝑝(𝒟|𝛉)
0.5 𝛉
Model: 𝒟 ~ 𝒩(𝜇, 𝜎 2 )
Distribution of 𝒟
0.4
Para.: 𝛉 = {𝜇, 𝜎}
0.3
𝛉 = {9, 0.9} : Bad
0.2
𝛉 = {4, 0.9} : Better
0.1
0 1 2 3 4 5 6 7 8 9 10
Event
Reminder: 𝜃 are the parameters, 𝒟 is the training dataset!
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 43
Intuitive Example
𝛉∗ = argmax 𝑝(𝒟|𝛉)
0.5 𝛉
Model: 𝒟 ~ 𝒩(𝜇, 𝜎 2 )
Distribution of 𝒟
0.4
Para.: 𝛉 = {𝜇, 𝜎}
0.3
𝛉 = {9, 0.9} : Bad
0.2
𝛉 = {4, 0.9} : Better
0.1 𝛉 = {5, 1.5} : Best
0 1 2 3 4 5 6 7 8 9 10
Event
Reminder: 𝜃 are the parameters, 𝒟 is the training dataset!
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 44
Maximum Likelihood Estimation
This process is called Maximum Likelihood Estimation (MLE)
𝛉∗ = argmax 𝑝(𝒟|𝛉)
𝛉
𝓛 𝛉 =𝑝 𝒟𝛉
= 𝑝 𝒔0 , 𝒔1 , … , 𝒔𝑛 𝛉
∗ 𝑝 𝒔 𝛉 ⋅ 𝑝 𝒔 | 𝛉 ⋅ … ⋅ 𝑝(𝐬 𝛉
= 0 1 𝑛
𝑁
= ෑ 𝑝(𝐬𝑖 |𝛉)
𝑖=1
Note: In our case a sample 𝒔𝑖 is 𝐱 𝑖 and 𝐲𝑖 ; Just assume both are “tied” together i.e. like a vector
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 46
The Log-Likelihood
The product of small numbers in a computer is unstable!
* The logarithm „just“ scales the problem. Likelihood and Log-Likelihood both have to be maximized!
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 47
The conditional distribution 𝑝(𝐬𝑖 |𝛉)
We know the conditional is the Gaussian distribution of 𝐲 (Slide 35):
𝑝 𝒔𝑖 |𝛉 = 𝑝 𝐲𝑖 𝐱 𝑖 , 𝐰, 𝜎) = 𝒩 𝐲𝑖 𝐰 T 𝐱 𝑖 , 𝜎 2 )
log[𝑝 𝐬𝑖 𝛉 ]
𝑖=1
𝑝 𝒔𝑖 |𝛉 = 𝑝 𝐲𝑖 𝐱 𝑖 , 𝐰, 𝜎) = 𝒩 𝐲𝑖 𝐰 T 𝐱 𝑖 , 𝜎 2 )
𝑝 𝒔𝑖 |𝛉 = 𝑝 𝐲𝑖 𝐱 𝑖 , 𝐰, 𝜎) = 𝒩 𝐲𝑖 𝐰 T 𝐱 𝑖 , 𝜎 2 )
𝑝 𝒔𝑖 |𝛉 = 𝑝 𝐲𝑖 𝐱 𝑖 , 𝐰, 𝜎) = 𝒩 𝐲𝑖 𝐰 T 𝐱 𝑖 , 𝜎 2 )
𝑁
1 2 𝑁
𝓵 𝛉 = − 2 𝐲𝑖 − 𝐰 T 𝐱 𝑖 − log(2𝜋𝜎 2 )
2𝜎 2
𝑖=1
Constant Constant
𝑁
1 2 𝑁
𝓵 𝛉 = − 2 𝐲𝑖 − 𝐰 T 𝐱 𝑖 − log(2𝜋𝜎 2 )
2𝜎 2
𝑖=1
Constant Constant
𝑁
1 2 𝑁
𝓵 𝛉 = − 2 𝐲𝑖 − 𝐰 T 𝐱 𝑖 − log(2𝜋𝜎 2 )
2𝜎 2
𝑖=1
Constant Constant
𝑁
1 2 𝑁
𝓵 𝛉 = − 2 𝐲𝑖 − 𝐰 T 𝐱 𝑖 − log(2𝜋𝜎 2 )
2𝜎 2
𝑖=1
1)
… and of course many, many more methods!
1)
… and of course many, many more methods!
𝟏 𝐓
NLL 𝛉 = 𝐲 − 𝐱𝐰 𝐲 − 𝐱𝐰
𝟐
1) This form uses the “Negative Log Likelihood”, which can also be derived from the likelihood
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 59
Analytical Solution
1)
For the analytical solution, we want a “simpler” form:
𝟏 𝐓
NLL 𝛉 = 𝐲 − 𝐱𝐰 𝐲 − 𝐱𝐰
𝟐
1) This form uses the “Negative Log Likelihood”, which can also be derived from the likelihood
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 60
Analytical Solution
The solution i.e. the minimum is: 𝑁
−1 T
𝐰 = 𝐱T𝐱 𝐱 𝐲 𝐱 T 𝐲 = 𝐱 𝑖 𝐲𝑖
𝑖=1
𝑁
In practice too time
𝐱 T 𝐱 = 𝐱 𝑖 𝐱 𝑖𝑇
consuming to compute!
𝑖=1
N 2
𝑥𝑖,1 ⋅⋅⋅ 𝑥𝑖,1 ⋅ 𝑥𝑖,𝐷
Reason: = ⫶ ⋱ ⫶
The more training data, 2
𝑖=1 𝑥𝑖,𝐷 ⋅ 𝑥𝑖,1 ⋅⋅⋅ 𝑥𝑖,𝐷
the longer the calculation!
Note: 𝑁 : Amount of Samples in the Training Dataset ; 𝐷 : Amount of Dimensions (length of the input vector)
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 61
What is the quality of the fit?
The goodness of fit should be evaluated to verify the learned model!
Model can completely (100%) Model can partially (91.3%) Model cannot explain any
explain variations in 𝑦 explain variations in y, but still (0%) variation in y, because
considered good it's its own average
Source: https://www.who.int/toolkits/growth-reference-data-for-5to19-years/indicators/height-for-age
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 66
Polynomial Regression
House Prices in Springfield (USA)
The answer is – of course – yes!
𝑓 𝐱 = 𝑤𝑗 𝚽𝒋 (𝒙) + 𝜖 = 𝐲
𝑗=1
Degree 𝑚 = 2 Degree 𝑚 = 13
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 70
Basis Functions
We can use any arbitrary function, with one condition:
Bilder: TF / Malter
Motivation
Logistic regression is the application of
linear regression to classification!
Use Cases:
• Credit Scoring
• Medicine
• Text Processing
• etc.
Source: Fisher, R.. “THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS.” Annals of Human Genetics 7 (1936): 179-188.
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 77
Example: Transform the Label
Label Petal Petal
Width Length
Setosa 5.0 mm 9.2 mm
Versi. 9.2 mm 26.1 mm
Setosa 7.7 mm 18.9 mm
Versi. 9.1 mm 32.1 mm
Setosa 7.9 mm 15.5 mm
Setosa 5.7 mm 12 mm
Setosa 2.5 mm 13.5 mm
Versi. 15.0 mm 39 mm
Source: Fisher, R.. “THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS.” Annals of Human Genetics 7 (1936): 179-188.
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 78
Example: Transform the Label
Label Petal Petal
Width Length
0 5.0 mm 9.2 mm
1 9.2 mm 26.1 mm
0 7.7 mm 18.9 mm
1 9.1 mm 32.1 mm
0 7.9 mm 15.5 mm
0 5.7 mm 12 mm
0 2.5 mm 13.5 mm
1 15.0 mm 39 mm
Source: Fisher, R.. “THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS.” Annals of Human Genetics 7 (1936): 179-188.
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 79
Example: 1.Visualize the data
Label Petal Petal
Width Length 15.0
Petal Width in mm
0 5.0 mm 9.2 mm 12.5
1 9.2 mm 26.1 mm 10.0
0 7.7 mm 18.9 mm 7.5
1 9.1 mm 32.1 mm
5.0
0 7.9 mm 15.5 mm
2.5
0 5.7 mm 12 mm
0 2.5 mm 13.5 mm 10 14 18 23 27 31 36 40
1 15.0 mm 39 mm Petal Length in mm
Source: Fisher, R.. “THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS.” Annals of Human Genetics 7 (1936): 179-188.
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 80
Example: 1.Visualize the data
Label Petal Petal
Width Length 15.0
Petal Width in mm
0 5.0 mm 9.2 mm 12.5
1 9.2 mm 26.1 mm 10.0
0 7.7 mm 18.9 mm 7.5
1 9.1 mm 32.1 mm
5.0
0 7.9 mm 15.5 mm Iris Setosa (0)
2.5 Iris Versicolor (1)
0 5.7 mm 12 mm
0 2.5 mm 13.5 mm 10 14 18 23 27 31 36 40
1 15.0 mm 39 mm Petal Length in mm
Source: Fisher, R.. “THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS.” Annals of Human Genetics 7 (1936): 179-188.
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 81
Example: 2. Find a decision boundary
15.0
Petal Width in mm
12.5
10.0
7.5
5.0
2.5
10 14 18 23 27 31 36 40
Petal Length in mm
Petal Width in mm
• Linear boundary 12.5
10.0
• Polynomial boundary
7.5
• Gaussian boundary
5.0
2.5
10 14 18 23 27 31 36 40
Petal Length in mm
Petal Width in mm
• Linear boundary 12.5
10.0
• Polynomial boundary
7.5
• Gaussian boundary
5.0
2.5
Petal Width in mm
12.5
10.0
7.5
5.0
2.5
10 14 18 23 27 31 36 40
Petal Length in mm
Petal Width in mm
15mm. 12.5
10.0
10 14 18 23 27 31 36 40
Petal Length in mm
Petal Width in mm
15mm. 12.5
10.0
Note: Setosa Versicolor | Reason: The point is on the “left side” of the decision boundary
Source: Fisher, R.. “THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS.” Annals of Human Genetics 7 (1936): 179-188.
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 87
Next in this Lecture:
• What is the Mathematical Framework?
• How do we classify using a linear model?
Bilder: TF / Malter
Overview
• The Logistic Model
• Optimization
Petal Width in mm
12.5
10.0
7.5
5.0
2.5
10 14 18 23 27 31 36 40
Petal Length in mm
Petal Width in mm
12.5
“Uncertain”
10.0
Thumb-Rule: “Certain”
7.5
The larger the distance of the
input 𝐱 to the decision boundary, 5.0
Petal Width in mm
𝑓 𝐱 = 𝑤𝑗 𝑥𝑗 12.5
f(x)
𝑗=1 10.0
f(x)
1) 7.5
Calculates a signed distance,
between the input and the linear 5.0
model 2.5
Note: Setosa Versicolor | 1) negative (when left of the line), positive (when right of the line)
Source: Fisher, R.. “THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS.” Annals of Human Genetics 7 (1936): 179-188.
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 95
The sigmoid function
The sigmoid (logit, logistic) function
maps to the range [0,1]! 1.0
0.82
Output 𝜇 𝑥, 𝐰
0.66
That means the model is now:
0.5
1
𝜇 𝐱, 𝐰 = 𝐓𝐱 0.33
1 + 𝑒 −𝐰
0.17
Fun fact: The sigmoid function is sometimes lovingly called “squashing function”
Note: Here we already inserted the function f! Homework: What is the general form of the sigmoid function?
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 96
Bernoulli distribution
The Bernoulli distribution can model
both events (yes-or-no event): 1.0
𝑝 y 𝐱, 𝐰) = Ber y 𝜇 𝐱, 𝐰 )
𝑦 1−𝑦
= 𝜇 𝐱, 𝐰 1 − 𝜇 𝐱, 𝐰 0.5
Heads
Mrs. B
Mr. A
Problem:
Tails
How do we get the label,
Voting Fair Coin
based on the calculated probability?
Note: We use the above distribution for the MLE estimation! Basically we replace “𝑝 𝐬𝑖 𝛉 ” in the log-likelihood with this!
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 97
The Decision Rule
Based on 𝜇(𝐱, 𝐰) we can decide,
which class is more “likely”! 1.0
0.82
Output 𝜇 𝑥, 𝐰
0.66
The Decision Rule is:
1, if 𝜇(𝐱, 𝐰) > 0.5 0.5
𝑦=ቊ
0, if 𝜇(𝐱, 𝐰) ≤ 0.5 0.33
0.17
𝓵(𝛉) = log[𝑝 y𝑖 𝐱 𝑖 , 𝛉 ]
𝑖=1
= 𝑦𝑖 log 𝜇 𝐱 𝑖 , 𝐰 + (1 − 𝑦𝑖 ) log 1 − 𝜇 𝐱 𝑖 , 𝐰
𝑖=1
• Unique minimum
• No analytical solution possible!
→ Optimization with Gradient descent.
Idea:
We find the minimum, by “walking
down” the slope of the mountain 𝐿(𝛉)
𝑤0
𝑤1
Loss 𝐿(𝜃)
𝜃𝑖
Parameter 𝜃
Loss 𝐿(𝜃)
𝜃𝑖
Parameter 𝜃
Loss 𝐿(𝜃)
The Gradient for Logistic Regression:
∇𝐿(𝜃𝑖 ) = 𝜇 𝐱 𝐢 , 𝐰 − 𝑦𝑖 𝐱 𝑖
𝑖
𝜃𝑖
Parameter 𝜃
Loss 𝐿(𝜃)
4 𝜃𝑖+1 = 𝜃𝑖 − 𝜂 ⋅ ∇𝐿(𝜃𝑖 )
𝜃𝑖 𝜃𝑖+1
Parameter 𝜃
Loss 𝐿(𝜃)
4 𝜃𝑖+1 = 𝜃𝑖 − 𝜂 ⋅ ∇𝐿(𝜃𝑖 )
𝜂 is called the learning rate
• If 𝜂 is too large
→ Overshooting the minimum
𝜃𝑖 𝜃𝑖+1
• If 𝜂 is too small Parameter 𝜃
→ Minimum is not reached
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 111
Example: Gradient descend
Algorithm:
1 Repeat for each iteration 𝑖:
𝐿(𝜃𝑖 )
2 Calculate loss 𝐿(𝜃𝑖 )
3 Calculate gradient ∇𝐿(𝜃𝑖 )
Loss 𝐿(𝜃)
4 𝜃𝑖+1 = 𝜃𝑖 − 𝜂 ⋅ ∇𝐿(𝜃𝑖 )
𝜃𝑖 𝜃𝑖+1
Parameter 𝜃
Loss 𝐿(𝜃)
4 𝜃𝑖+1 = 𝜃𝑖 − 𝜂 ⋅ ∇𝐿(𝜃𝑖 ) 𝐿(𝜃𝑖+1 )
5 𝑖 =𝑖+1
𝜃𝑖 𝜃𝑖+1
Parameter 𝜃
Loss 𝐿(𝜃)
4 𝜃𝑖+1 = 𝜃𝑖 − 𝜂 ⋅ ∇𝐿(𝜃𝑖 ) 𝐿(𝜃𝑖+1 )
5 𝑖 =𝑖+1
𝐿(𝜃𝑖+2 )
𝜃𝑖 𝜃𝑖+1 𝜃𝑖+2
Parameter 𝜃
Loss 𝐿(𝜃)
4 𝜃𝑖+1 = 𝜃𝑖 − 𝜂 ⋅ ∇𝐿(𝜃𝑖 ) 𝐿(𝜃𝑖+1 )
5 𝑖 =𝑖+1
𝐿(𝜃𝑖+2 )
𝐿(𝜃𝑖+3 )
Minimum
Loss 𝐿(𝜃)
4 𝜃𝑖+1 = 𝜃𝑖 − 𝜂 ⋅ ∇𝐿(𝜃𝑖 ) 𝐿(𝜃𝑖+1 )
5 𝑖 =𝑖+1
𝐿(𝜃𝑖+2 )
𝐿(𝜃𝑖+3 )
Minimum
Repeat the process until the loss is minimal! 𝜃𝑖 𝜃𝑖+1 𝜃𝑖+2 𝜃𝑖+3
Parameter 𝜃
Minimum
Parameter
Minimum
Minimum
Parameter
Minimum
https://de.serlo.org/mathe/funktionen/wichtige-funktionstypen-ihre-eigenschaften/polynomfunktionen-beliebigen-grades/polynom
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 120
Thank you for listening!
Machine Learning for Engineers
Overfitting and Underfitting
Bilder: TF / Malter
Overfitting and Underfitting
Optimal function vs.
Estimated function
For 𝑀 = 0 and 𝑀 = 1 the function fails to
model the data, as the chosen model
is too simple (Underfitting)
Prediction Error
(i.e. an ideal complexity)
• Variance error
Ideal Complexity
Complexity
(Minimum
Testing Loss)
Note: Complexity does not mean parameters! It means the mathematical complexity! i.e. the ability of the model to capture a relationship!
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 127
Complexity vs. Generalization Error
Bias: High Bias Low Bias →
Error induced by simplifying
Prediction Error
assumptions of the model
Complexity
Prediction Error
assumptions of the model
Variance:
Error induced by differing
training data i.e. how sensitive
is the model to “noise”
Complexity
Prediction Error
assumptions of the model
Variance:
Error induced by differing
training data i.e. how sensitive
is the model to “noise”
Prediction Error
It has both a low bias and a
low variance!
1) We use this instead of the test set. This is important! You only touch the test set for the final evaluation!
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 132
Hyperparameter search methods
There exist multiple ways to find good hyperparameters…
… manual search (Trial-and-Error)
… random search
… grid search
… Bayesian methods
… etc.
Note: Basically every known optimization technique can be used. Examples: Particle Swarm Optimization, Genetic Algorithms, Ant Colony Optimization…
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 133
Hyperparameter search on the data splits
Typically you split the data into a training and testing split
Solution:
Splitting the training data further into a train split and validation split
Solution:
Splitting the training data further into a train split and validation split
Solution:
Splitting the training data further into a train split and validation split
1) We assume the test data is already split and put away. A typical split of all the data is 80% train – 10% validation - 10% testing
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 138
K-Fold Cross Validation
Approach: Training data (90%) Test 10%
Split the training data into 𝑘-folds
(in our example 𝑘 = 5)
Fold 1 Fold 2 Fold 3 Fold 4 Fold 5
When k=1, the process is called Leave-one out cross validation (or LOOCV)
Bilder: TF / Malter
Recap: Logistic Regression
Goal: Find a linear decision Classification
boundary separating the two classes
Label 𝒚 = Dog
Problem: Multiple linear decision
Feature 𝑥1
boundaries solve the problem with
equal accuracy
Question: Which one is the best?
Label 𝒚 = Cat D
A B C D
Feature 𝑥0
A B C
Feature 𝑥1
throughout the boundary
C always is closer to Dog than Cat
D almost touches Dog at the bottom
and Cat at the top Label 𝒚 = Cat D
Feature 𝑥0
A B C
Feature 𝑥1
terms, we say it has the largest
margin 𝑚 𝑚
𝑚
Updated Goal: Find linear decision
boundary with the largest margin. Label 𝒚 = Cat
Feature 𝑥0
B
Feature 𝑥1
𝒘 is the normal vector
Feature 𝑥1
Furthermore, by definition, the margin
boundaries are given by
𝒘 ⋅ 𝒙 − 𝑏 = +1
𝑏
𝒘 ⋅ 𝒙 − 𝑏 = −1 𝒘
2 Feature 𝑥0
is thus the distance between the
𝒘
two margins → maximize
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 6
Mathematical Model for Classes
Two-class problem 𝑦1 , … , 𝑦𝑛 = ±1
𝑦 = +1
All training data 𝒙1 , … , 𝒙𝑛 needs to be 2
correctly classified outside margin 𝒘
𝒘
Feature 𝑥1
𝒘 ⋅ 𝒙𝑖 − 𝑏 ≥ 1 if 𝑦𝑖 = +1
𝒘 ⋅ 𝒙𝑖 − 𝑏 ≤ −1 if 𝑦𝑖 = −1 𝑦 = −1
𝑏
Due to the label choice, this can be 𝒘
simplified as Feature 𝑥0
𝑦𝑖 𝒘 ⋅ 𝒙𝑖 − 𝑏 ≥ 1
Feature 𝑥1
𝒘
1 2
is equal to minimizing 𝒘
2
1
min 𝒘 2 𝑦 = −1
𝒘,𝑏 2 𝑏
𝒘
2. Subject to no misclassification on
Feature 𝑥0
the training data
𝑦𝑖 𝒘 ⋅ 𝒙𝑖 − 𝑏 ≥ 1
Bilder: TF / Malter
Optimization of Support Vector Machines
In the previous section, we derived the constrained optimization problem
for Support Vector Machines (SVMs):
1 2
min 𝒘
𝒘,𝑏 2
s. t. 𝑦𝑖 𝒘 ⋅ 𝒙𝑖 − 𝑏 ≥ 1
1) While not relevant for the exam, a more detailed derivation is available in C. Bishop, Pattern Recognition and Machine Learning. New York:
Springer, 2006, Appendix E.
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 11
Primal versus Dual Formulation
This optimization problem is called the primal formulation, with solution 𝑝∗ :
𝑝∗ = min max ℒ 𝒘, 𝑏, 𝜶
𝒘,𝑏 𝜶
Since Slater’s condition holds for this convex optimization problem, we can
guarantee that 𝑝∗ = 𝑑 ∗ and solve the dual problem instead.
s. t. 𝛼𝑖 ≥ 0, 𝛼𝑖 𝑦𝑖 = 0
𝑖=1
This quadratic programming problem can be solved using sequential
minimal optimization, but which is not a topic of this lecture.
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 14
The Karush–Kuhn–Tucker Conditions
The given problem fulfills the Karush-Kuhn-Tucker conditions1):
𝛼𝑖 ≥ 0
𝑦𝑖 𝑤 ⋅ 𝑥 − 𝑏 − 1 ≥ 0
𝛼𝑖 𝑦𝑖 𝑤 ⋅ 𝑥 − 𝑏 − 1 = 0
💡 When 𝛼𝑖 is non-zero, the training point is on the margin and thus a so-
called “support vector”.
1) While not relevant for the exam, a more detailed derivation is available in C. Bishop, Pattern Recognition and Machine Learning. New York:
Springer, 2006, Appendix E.
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 15
Optimization Summary
The dual formulation leads to a quadratic programming problem on 𝜶:1)
𝑛 𝑛
1
max ℒ 𝒘, 𝑏, 𝜶 = 𝛼𝑖 − 𝑦𝑖 𝑦𝑗 𝛼𝑖 𝛼𝑗 (𝒙𝑖 ⋅ 𝒙𝑗 )
𝜶 2
𝑖=1 𝑖,𝑗=1
1) This is simplified for the purposes of this summary. For the additional constraints that are also required refer to slide 14
2) Remember the equation for 𝒘 resulting from the derivative in slide 13
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 16
Machine Learning for Engineers
Support Vector Machines – Non-Linearity and the Kernel Trick
Bilder: TF / Malter
Recap: Basis Functions
House Prices in Springfield (USA)
For linear regression we transform
T
1.4
𝚽 𝑥 = 1 𝑥 2 3 4 5
𝑥 𝑥 𝑥 𝑥 𝑥 6
1.3
💡 Note how the basis function is applied at each of the dot products.
Most summands are thus zero. This property is called sparsity and greatly
simplifies the computations.
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 21
Solving Memory Issues with Kernel Trick
Since the basis function 𝚽: ℝ𝑛 ↦ ℝ𝑚 is applied explicitly on each of the
points, the memory scales with the output dimensionality m.
𝑛 𝑛
1
ℒ 𝒘, 𝑏, 𝜶 = 𝛼𝑖 − 𝑦𝑖 𝑦𝑗 𝛼𝑖 𝛼𝑗 (𝚽(𝒙𝑖 ) ⋅ 𝚽(𝒙𝑗 ))
2
𝑖=1 𝑖,𝑗=1
The kernel function usually doesn’t require explicit computation of the basis
function. This method of replacement is called the kernel trick.
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 23
The Linear Kernel
The kernel function is given by:
1
𝐾 𝒙𝑖 , 𝒙𝑗 = 2 𝒙𝑖 ⋅ 𝒙𝑗
2𝜎
• 𝜎 is a length-scale parameter
• 𝜎 is a length-scale parameter
Bilder: TF / Malter
Hard Margin and Soft Margin
Up until now we have always implicitly
𝑦 = +1
assumed classes are linearly separable
2
⮩ So-called hard margin 𝒘
𝒘
Feature 𝑥1
What happens when the classes 𝜉2
overlap and there is no solution? 𝜉1
𝑦 = −1
⮩ So-called soft margin 𝑏
𝒘
Introduce slack variables 𝜉1 , 𝜉2 that Feature 𝑥0
move points onto the margin
Feature 𝑥1
• We know 𝑦𝑖 𝒘 ⋅ 𝒙𝑖 − 𝑏 ≥ 1 👍
𝜉1
• No correction, thus 𝜉𝑖 = 0 𝜉2
𝑦 = −1
𝑏
If incorrectly classified or inside margin
𝒘
• We know 𝑦𝑖 𝒘 ⋅ 𝒙𝑖 − 𝑏 < 1 👎 Feature 𝑥0
• Move by 𝜉𝑖 = 1 − 𝑦𝑖 (𝒘 ⋅ 𝒙𝑖 − 𝑏)
𝑠. 𝑡. 𝑦𝑖 𝒘 ⋅ 𝒙𝑖 − 𝑏 ≥ 1 − 𝜉𝑖
The optimization procedure and kernel trick can similarly be derived for
this, but we’ll spare you the details 😉.
Bilder: TF / Malter
Intuition behind Support Vector Regression
The Support Vector Machine is a Regression
method for classification.
Let us now turn to regression. 𝜖
Label 𝑦
With training data 𝒙1 , 𝑦1 , … , (𝒙𝑛 , 𝑦𝑛 ),
we want to find a function of form:
𝜖
𝑦 =𝒘⋅𝒙+𝑏
For Support Vector Regression, we
Feature 𝑥
want to keep everything in a 𝜖-tube
Label 𝑦
𝑦𝑖 ≥ 𝒘 ⋅ 𝒙𝑖 + 𝑏 − 𝜖 𝜉+ 𝜉−
For outliers, we introduce slack 𝜖
variables 𝜉 + and 𝜉 − :
𝑦𝑖 ≤ 𝒘 ⋅ 𝒙𝑖 + 𝑏 + 𝜖 + 𝜉𝑖+
Feature 𝑥
𝑦𝑖 ≥ 𝒘 ⋅ 𝒙𝑖 + 𝑏 − 𝜖 − 𝜉𝑖−
Label 𝑦
+
variables 𝜉𝑖 and 𝜉𝑖− 𝜉+ 𝜉−
𝑛
1
min 𝒘 2 + 𝐶 ( 𝜉𝑖+ + 𝜉𝑖− ) 𝜖
𝒘,𝑏 2
𝑖=1
𝐶 is a free parameter specifying
the trade-off for outliers Feature 𝑥
Label 𝑦
tube specified by 𝜖 𝜉+ 𝜉−
𝑦𝑖 ≤ 𝒘 ⋅ 𝒙𝑖 + 𝑏 + 𝜖 + 𝜉𝑖+
𝑦𝑖 ≥ 𝒘 ⋅ 𝒙𝑖 + 𝑏 − 𝜖 − 𝜉𝑖− 𝜖
𝜉𝑖+ ≥ 0
𝜉𝑖− ≥ 0
Feature 𝑥
Label 𝑦
𝜉+ 𝜉−
Disadvantages
Bilder: TF / Malter
Summary of Support Vector Machines
Support Vector Machine Advantages
• Elegant to optimize, since they have one local and global optimum
Bilder: TF / Malter
Application: Image Compression
We can save data more efficiently
by applying PCA!
Can be used for other data as well! 16 Components (93.62%) 8 Components (85.50%)
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 17
Application: Facial Recognition
Eigenfaces (1991) are a fast and efficient
way to recognize faces in a gallery of facial
images.
General Idea:
1. Generate a set of eigenfaces (see right)
based on a gallery
2. Compute the weight for each eigenface
based on the query face image
3. Use the weights to classify (and identify)
the person
Turk, M. and A. Pentland. “Face recognition using eigenfaces.” Proceedings. 1991 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (1991): 586-591.
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 18
Application: Anomaly Detection
Detecting anomalous traffic in IP Networks is crucial for administration!
General Idea:
Transform the time series of IP packages (requests, etc.) using a PCA
→ The first k (typically 10) components represent
“normal” behavior
→ The components “after” k, represent
anomalies and/or noise
Packages mapped primarily to the latter component are classified as
anomalous!
Anukool Lakhina, Konstantina Papagiannaki, Mark Crovella, Christophe Diot, Eric D. Kolaczyk, and Nina Taft. 2004. Structural analysis of network traffic flows. SIGMETRICS Perform. Eval.
Rev. 32, 1 (June 2004), 61–72. DOI:https://doi.org/10.1145/1012888.1005697
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 19
Limitation: The Adidas Problem
Let the data points be distributed like 𝑤2
the Adidas logo, with three implicit 𝑤1
classes (teal, red, blue stripe).
Feature 2
The principal directions 𝑤1 and 𝑤2
found by PCA are given.
The first principal direction 𝑤1 does
not preserve the information to
classify the points!
Feature 1
Bilder: TF / Malter
Intuition behind Principal Component Analysis
Consider the following data with Twin Heights
height of two twins ℎ1 and ℎ2 .
Now assume that
Twin height ℎ2
a) We can only plot one-dimensional
figures due to limitations1
b) We have complex models that
can only handle a single value1
Twin height ℎ1
1) These are obviously constructed limitations for didactical concerns. However, in real life you will also face limitations in visualization
(maximum of three dimensions) or model feasibility.
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 2
Intuition behind Principal Component Analysis
How do we find a transformation from Twin Heights
two values to one value?
In this specific case we could
Twin height ℎ2
1
a) Keep the average height (ℎ1 +
2
ℎ2 ) as one value
b) Discard the difference in height
ℎ1 − ℎ2
Twin height ℎ1
Height difference
ℎ2 ) as one value
b) Discard the difference in height
ℎ1 − ℎ2
💡 This corresponds to a change in
basis or coordinate frame in the plot.
Average height
⮩ Rotation by matrix 𝑨
Height difference
Twin height ℎ2
Rotation
1 1 1
𝑨=
2 1 −1
Twin height ℎ1 Average height
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 5
Principal Component Analysis: Preserving Distances
💡 Idea: Preserve the distances Twin Heights
between each pair in the lower
dimension
Height difference
• Points that are far apart in the
original space remain far apart
• Points that are close in the original
space remain close
Average height
Height difference
• Find those axes and dimensions
Low variance
with high variance
• Discard the axes and dimensions
with increasingly low variance
High variance
Goal: Algorithmically find directions
with high/low variance in data1 Average height
1) In this trivial case we could hand-engineer such a transformation. However, this is not generally the case.
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 7
Machine Learning for Engineers
Principal Component Analysis – Mathematics
Bilder: TF / Malter
Description of Input Data
Let 𝑿 be a 𝑛 × 𝑝 matrix of data Twin Heights
points, where
• 𝑛 is number of points (here 15)
Twin height ℎ2
• 𝑝 is number of features (here 2)
1.75 1.77
1.67 1.68
𝑿=
⋮ ⋮
2.01 1.98 Twin height ℎ1
Twin height ℎ2
• 𝑝 is number of features (here 2)
The covariance matrix generalizes
the variance 𝜎 2 of a Gaussian
distribution 𝒩(𝜇, 𝜎 2 ) to higher
dimensions. Twin height ℎ1
1) Technically the covariance matrix of the transposed data points 𝑿𝑇 after centering to zero mean.
15.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 10
Description of Desired Output Data
We’re interested in new axes that Twin Heights
rotate the system, here 𝑤1 and 𝑤2
These are columns of a matrix
Twin height ℎ2
𝜆2
𝑤2 𝜆1
0.7 0.7 𝑤1
𝑾=
0.7 −0.7
𝑤1 𝑤2
𝜆1 = 2, 𝜆2 = 1
Twin height ℎ2
𝜆2
𝑤2 𝜆1
• 𝑾 is 𝑝 × 𝑝 matrix, where each 𝑤1
column is an eigenvector
• 𝑳 is a 𝑝 × 𝑝 matrix, with all
eigenvalues on diagonal in
decreasing order Twin height ℎ1
Twin height ℎ2
𝜆2
𝑿= 𝑼𝑺𝑾𝑇 𝑤2 𝜆1
𝑤1
• 𝑼 is 𝑛 × 𝑛 unitary matrix
• 𝑺 is 𝑛 × 𝑝 matrix of singular values
𝜎1 , … , 𝜎𝑝 on diagonal
Twin height ℎ1
• 𝑾 is 𝑝 × 𝑝 matrix of singular
vectors 𝒘1 , … , 𝒘2
Twin height ℎ2
𝜆2
𝑇 𝑇 𝜆1
= 𝑼𝑺𝑾 𝑼𝑺𝑾𝑇 𝑤2
𝑤1
= 𝑾𝑺𝑇 𝑼𝑇 𝑼𝑺𝑾𝑇
= 𝑾𝑺2 𝑾𝑇
• Eigenvalues correspond to
squared singular values 𝜆𝑖 = 𝜎𝑖2 Twin height ℎ1
• Eigenvectors directly correspond
to singular vectors
15.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 14
Summary of Algorithm
1. Find principal directions 𝑤1 , … 𝑤𝑝 Twin Heights
and eigenvalues 𝜆1 , … , 𝜆𝑝
Twin height ℎ2
2. Project data points into new
coordinate frame using 𝑤2
𝑤1
𝑻 = 𝑿𝑾
3. Keep the 𝑞 most important
dimensions as determined by
𝜆1 , … 𝜆𝑞 (which are sorted by Twin height ℎ1
variance)
Bilder: TF / Malter
Summary of Principal Component Analysis
• Rotate coordinate system such that all axes are sorted from most
variance to least variance
• Required axes 𝑾 determined using either
• Eigenvectors and –values of covariance matrix 𝑪 = 𝑾𝑳𝑾𝑇
• Singular Value Decomposition (SVD) of data points 𝑿 = 𝑼𝑺𝑾𝑇
Bilder: TF / Malter
The Human Brain
The human brain is our reference for
an intelligent agent, that
a) … contains different areas
specialized for some tasks (e.g.,
the visual cortex)
b) … consists of neurons as the
fundamental unit of “computation”
Input 𝑥1
Input 𝑥2 Output 𝑦1
Input 𝑥3
Input 𝑥1
⋅ 𝑤1
Input 𝑥2 Output 𝑦1
⋅ 𝑤2
Input 𝑥3
⋅ 𝑤3
Input 𝑥1
⋅ 𝑤1
Input 𝑥2 Output 𝑦1
⋅ 𝑤2 +
Input 𝑥3
⋅ 𝑤3 𝑏
Input 𝑥1
⋅ 𝑤1
Input 𝑥2 Output 𝑦1
⋅ 𝑤2 +
Input 𝑥3
⋅ 𝑤3 𝑏 1
𝜎 𝑥 =
1 + 𝑒 −𝑥
𝑦1 = 𝜎 𝑤1 ⋅ 𝑥1 + 𝑤2 ⋅ 𝑥2 + 𝑤3 ⋅ 𝑥3 + 𝑏 = 𝜎( 𝑤𝑖 ⋅ 𝑥𝑖 + 𝑏)
𝑖
Input 𝑥1
⋅ 𝑤1
Input 𝑥2 Output 𝑦1
⋅ 𝑤2 +
Input 𝑥3
⋅ 𝑤3 𝑏 1
𝜎 𝑥 =
1 + 𝑒 −𝑥
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 8
The Perceptron – Signaling Mechanism
Given a perceptron with parameters
𝑤1 = 4, 𝑤2 = 7, 𝑤3 = 11, 𝑏 = −10 and equation
𝑦1 = 𝜎 𝑤1 ⋅ 𝑥1 + 𝑤2 ⋅ 𝑥2 + 𝑤3 ⋅ 𝑥3 + 𝑏
1
𝜎 𝑥 = 𝜎 𝑥 = tanh(𝑥) 𝜎 𝑥 = max(𝑥, 0)
1 + 𝑒 −𝑥
Bilder: TF / Malter
The Perceptron – A Recap
In the last section we learned about the perceptron, a computational model
representing a neuron.
Inputs 𝑥1 , … , 𝑥𝑛 𝑥1
Output 𝑦1 𝑥2 𝑦1
Computation 𝑦1 = 𝜎(σ𝑛𝑖=1 𝑤𝑖 ⋅ 𝑥𝑖 + 𝑏) 𝑥3
𝑥2 𝑦3 𝒚 = 𝜎(𝑾 ⋅ 𝒙 + 𝒃)
Bilder: TF / Malter
How do our models learn?
• Up until now, the parameters 𝑾𝑖 and 𝒃𝑖 of the multilayer perceptron
were assumed as given
• We now aim to learn the parameters 𝑾𝑖 and 𝒃𝑖 based on some
example input and output data
• Let us assume we have a dataset of 𝒙 and corresponding 𝒚
1.2 0.2
3.2 0.2
𝒙1 , 𝒚1 = (1.4, ) and 𝒙2 , 𝒚2 = (0.4, ) and …
0.2 0.2
1.3 0.3
1.2 = 𝑥1
?
𝑦1 = 2.9 ֞ 3.2
1.4 = 𝑥2
?
𝑦2 = 0.2 ֞ 0.2
1.3 = 𝑥3
𝑾1 , 𝒃1 𝑾2 , 𝒃2
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 20
The Loss Function
• ෝ𝑖 and the
We need a comparison metric between the predicted outputs 𝒚
expected outputs 𝒚𝑖
• This is called the loss function, which usually depends on the type of
problem the multilayer perceptron solves
• For regression, a common metric is the mean squared error
• For classification, a common metric is the cross entropy
∇𝜽 ← Backward Pass
1.2 = 𝑥1
?
𝑦1 = 2.9 ֞ 3.2
1.4 = 𝑥2
?
𝑦2 = 0.2 ֞ 0.2
1.3 = 𝑥3
Forward Pass → 𝜽
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 25
Machine Learning for Engineers
Deep Learning – Gradient Descent
Bilder: TF / Malter
Parameter Optimization
• At this stage, we can compute the gradient of the parameters ∇𝜽
• The gradient tells us how we need to change the current parameters 𝜽
in order to make fewer errors on the given data
• We can use this in an iterative algorithm called gradient descent, with
the central equation being
𝜽𝑖+1 = 𝜽𝑖 − 𝜇 ⋅ ∇𝜽𝑖
• The learning rate 𝜇 tells us how quickly we should change the current
parameters 𝜽𝑖
⮩ Let us have a closer look at an illustrative example 💡
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 27
Gradient Descent – An Example
• Let us assume that the error function is
𝑓 𝜃 = 𝜃2
• The gradient of the error function is the
derivative 𝑓 ′ 𝜃 = 2 ⋅ 𝜃
• We aim to find the minimum error at
the location 𝜃 = 0
• Our initial estimate for the location is
𝜃1 = 2
• The learning rate is 𝜇 = 0.25
Global Local
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 34
Gradient Descent – A Summary
• Gradient Descent incrementally adjusts the parameters 𝜽 based on the
gradient ∇𝜽 of the parameters
1. For each iteration
2. Compute the error of the parameters 𝜽
3. Compute the gradient of the parameters ∇𝜽
4. Update parameters using 𝜽𝑖+1 = 𝜽𝑖 − 𝜇 ⋅ ∇𝜽𝑖
Bilder: TF / Malter
How do our models learn?
• We’ve just talked about the concept of gradient descent, but largely
avoided the details for the function 𝑓(𝜽)1)
• The goal is to minimize the error or loss function across the complete
dataset 𝒙1 , 𝒚1 , … , (𝒙𝑛 , 𝒚𝑛 )
𝑛 𝑛
𝑓 𝜽 = 𝐿(ෝ
𝒚𝑖 , 𝒚𝑖 ) = 𝐿(𝑔(𝒙𝑖 , 𝜽), 𝒚𝑖 )
𝑖=1 𝑖=1
1) In the previous section the function was 𝑓 𝜃 = 𝜃 2 . For multilayer perceptron, the function is quite obviously different.
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 37
How do our models learn?
• We thus need to loop over the training data 𝒙1 , 𝒚1 , … , (𝒙𝑛 , 𝒚𝑛 ) to
compute sums over individual data points
• This results in a slight modification to the gradient descent algorithm
1. For each epoch – Loop over training data
2. For each batch – Loop over pieces of training data
3. Compute the error of the parameters 𝜽
4. Compute the gradient of the parameters ∇𝜽
5. Update parameters using 𝜽𝑖+1 = 𝜽𝑖 − 𝜇 ⋅ ∇𝜽𝑖
1) Note how we need 4 batches to go through all of the training data in this specific example.
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 40
The Learning Process – A Summary
• The parameters 𝜽 are optimized by iterating through the training data
• For practical reasons, this is split into epochs and batches
Bilder: TF / Malter
The Visual Cortex
We’ve previously learned that the
brain has specialized regions.
• The visual cortex is in charge of
processing visual information
collected from the retinae
• However, cortical cells only
respond to stimuli of a small
receptive fields
Standard Cortical
neuron neuron
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 44
The Convolution Operation
The output computation now only depends on a subset of inputs:
𝑦11 = 𝑤11 𝑥11 + 𝑤12 𝑥12 + 𝑤13 𝑥13 + 𝑤21 𝑥21 + 𝑤22 𝑥22 + 𝑤23 𝑥23
+ 𝑤31 𝑥31 + 𝑤32 𝑥32 + 𝑤33 𝑥33
𝑥11 𝑥12 𝑥13 𝑥14 𝑥15
𝑥21 𝑥22 𝑥23 𝑥24 𝑥25 𝑤11 𝑤12 𝑤13 𝑦11 𝑦12
𝑥21 𝑥22 𝑥23 𝑥24 𝑥25 𝑤11 𝑤12 𝑤13 𝑦11 𝑦12 𝑦13
𝑥21 𝑥22 𝑥23 𝑥24 𝑥25 𝑤11 𝑤12 𝑤13 𝑦11 𝑦12 𝑦13
𝑥21 𝑥22 𝑥23 𝑥24 𝑥25 𝑤11 𝑤12 𝑤13 𝑦11 𝑦12 𝑦13
𝑥31 𝑥32 𝑥33 𝑥34 𝑥35 ⋆ 𝑤21 𝑤22 𝑤23 = 𝑦22 𝑦23
𝑥21 𝑥22 𝑥23 𝑥24 𝑥25 𝑤11 𝑤12 𝑤13 𝑦11 𝑦12 𝑦13
𝑥31 𝑥32 𝑥33 𝑥34 𝑥35 ⋆ 𝑤21 𝑤22 𝑤23 = 𝑦21 𝑦22 𝑦23
𝑥21 𝑥22 𝑥23 𝑥24 𝑥25 𝑤11 𝑤12 𝑤13 𝑦11 𝑦12 𝑦13
𝑥31 𝑥32 𝑥33 𝑥34 𝑥35 ⋆ 𝑤21 𝑤22 𝑤23 = 𝑦21 𝑦22 𝑦23
𝑥21 𝑥22 𝑥23 𝑥24 𝑥25 𝑤11 𝑤12 𝑤13 𝑦11 𝑦12 𝑦13
𝑥31 𝑥32 𝑥33 𝑥34 𝑥35 ⋆ 𝑤21 𝑤22 𝑤23 = 𝑦21 𝑦22 𝑦23
𝑥41 𝑥42 𝑥43 𝑥44 𝑥45 𝑤31 𝑤32 𝑤33 𝑦31 𝑦32
𝑥21 𝑥22 𝑥23 𝑥24 𝑥25 𝑤11 𝑤12 𝑤13 𝑦11 𝑦12 𝑦13
𝑥31 𝑥32 𝑥33 𝑥34 𝑥35 ⋆ 𝑤21 𝑤22 𝑤23 = 𝑦21 𝑦22 𝑦23
𝑥41 𝑥42 𝑥43 𝑥44 𝑥45 𝑤31 𝑤32 𝑤33 𝑦31 𝑦32 𝑦33
Input 𝒙
The different outputs are
concatenated in the
channel dimension.
Kernel 𝒘
Output 𝒚 + + =
Bilder: TF / Malter
Summarizing Features via Pooling
• Deeper in the network the number of features quickly grows
• It makes sense to summarize these features deeper in the network
• From a practical perspective, this reduces the number of computations
• From a theoretical perspective, these higher-level (i.e., global scale)
features don’t require a high spatial resolution
• This process is called pooling and works like a convolution, with the
kernel replaced by a pooling operation
Pooling 2
Pooling 1 Convolution 2
Input Image Convolution 1
13.10.2021 | Prof. Bjoern Eskofier | MaD | Introduction to Machine Learning 63
Machine Learning for Engineers
Deep Learning – Applications
Bilder: TF / Malter
Full Image Classification
• Task: Plant leaf disease classification
of 14 different species
• Input: Image of plant leaf
• Output: Plant and disease class
grape with black measles potato with late blight
• Model: EfficientNet
• Ü. Atilaa, M. Uçarb, K. Akyolc, and E. Uçarb, “Plant leaf disease
classification using EfficientNet deep learning model”, 2020.
3. Visualization
6. Linear Regression
1. Introduction
Within this exercise we want to show the implementation of a supervised learning procedure with the necessary pre- and post-processing steps
using the use case of the energy prediction of a machining process.
Our goal is to perform a regression analysis using the data that we have to train different regression models to predict the target variable. In our
use case we want to predict the energy requirement to perform a milling process.
Based on the planned process parameters, the energy required for the milling process is to be forecasted. As a basis for the development of a
regression model, tests were carried out on a milling machine to gain sufficient data for the training.
1.3 Deliverables
To complete this exercise successfully, you need to provide certain results. Throughout the notebook you will find questions you need to
answer, and coding tasks where you need to modify existing code or fill in blanks. The answers to the questions need to be added in the
prepared Your answer markdown fields. Coding tasks can be solved by modifying or inserting code in the cells below the task. If necessary, you
can add extra cells to the notebook, as long as the order of the existing cells remains unchanged. Once you are finished with the lab, you can
submit it through the procedure described in the forum. Once the labs are submitted, you will receive feedback regarding the questions. Thus,
the Feedback boxes need to be left empty.
Example:
Your answer: Have a look at the forum, maybe some of your peers already experienced similiar issues. Otherwise start a new
discussion to get help!
Feedback: This is a great approach! Besides asking in the forum I'd also suggest asking your tutor.
Solution: The correct solution for the question. The solution well be provided after the review process.
1.4 Resources
If you are having issues while completing this lab, feel free to post your questions in the forum. Your peers as well as our teaching advisors will
screen the forum regularly and answer open questions. This way, the information is available for fellow students encountering the same issues.
Note: Here we also want to promote the work with online resources and thus the independent solution of problems. For some tasks you have to
"google" for the appropriate method yourself. For other tasks the already given code must be adapted or just copied.
Axis
Feed [mm/min]
Path [mm]
Energy requirement - Target variable [kJ]
There are different options to import a data set into Google Colab. You can either import/upload from Google Drive or from your own HDD. In
this Notebook the Google Drive folder is used.
For this purpose it is necessary to connect your Google Drive to this Notebook. Execute the following cell and folllow the instructions.
Mounted at /content/drive
# specification of the path to the input data (this path may vary for you depending on where you have your data file)
df = pd.read_csv(r'./drive/MyDrive/ML4Eng I/ML4Eng_I_Exercise_Pipeline_and_Regression/ML4Eng_I_dataset_energy_measurement.txt')
# if the link doesn't work, you'll need to adjust it depending on where you have stored the dataset in your Google Drive.
Feedback: ...
Solution: The above code gives us a descriptive statistic of the pandas dataframe which summarizes the central tendency,
dispersion and shape of a dataset’s distribution, excluding null values. Now we try to learn more about the data that we have to
understand if we need to make any changes i.e. remove outliers, add missing values etc.
1. Count: The total number of entries in that particular column. Here we see that the different features have a different count
indicating some missing values in the dataset.
2. Mean: The arithmetic mean (or simply mean) of a list of numbers, is the sum of all of the numbers divided by the number of
numbers
3. Standard deviation: A measure of the dispersion or variation in a distribution or set of data
4. Minimum: The smallest mathematical value in that particular section. For the attribute "axis" a value of -5 is listed indicating
an oulier.
5. 25%: The values corresponding to 25% percentile of the dataset
6. 50%: The values corresponding to 50% percentile of the dataset. The 50 % percentile is the same as the median.
7. 75%: The values corresponding to 75% percentile of the dataset
8. Maximum: The largest mathematical value in that particular section. For the attribute "axis" a value of 15 is listed indicating
an oulier.
More information can be found here - https://pandas.pydata.org/pandas-
docs/stable/reference/api/pandas.DataFrame.describe.html
Task: Search and implement a (simple) method in order to show the first rows of the dataset. Try to use Google/ the
documentation to find an appropriate one.
Feedback: ...
#############################
# Please add your code here #
#############################
###
# Solution
###
df.head(5)
###
# END Solution
###
3. Visualization
Following all attributes of the data set are plotted.
Feedback: ...
Solution:
1. Identify potential patterns in the data that can help us understand the data.
2. Clarify which factors influence our target variable the most.
3. Helps us fix the dataset in case there are outliers and missing values.
4. Help us to decide which models to use to successfully predict the target variable.
Task: Make same changes to the plot below. For example, you can adjust the number of bins, the name of the axis or the color of
the plot.
Feedback: ...
Task: Visualize the last attribute "Feed" according to the privious ones. You can copy the most of the code, nethertheless you
have to do some adjustments.
Feedback: ...
#############################
# Please add your code here #
#############################
###
# Solution
###
%matplotlib inline
plt.figure(figsize=(10,7))
plt.hist(df.Feed,bins =100, range = (df.Feed.min(), df.Feed.max()))
plt.xlabel('Feed [mm/min]')
plt.ylabel('Total quantity')
###
# END Solution
###
Missing values and outliers have to be detected and dealt with in order to prepare the data set for the following steps.
# Before we deal with missing values we visualize the first 10 instances of the data set.
# Missing values can be recognized here as NaN.
df.head(10)
Axis Feed Path Energy_Requirement
Task: Use the the dropna() function to remove all rows with missing values. Use the documantation if you need further
information about this method.
Feedback: ...
# We drop all rows with missing values using the 'dropna function'
#############################
# Please add your code here #
#############################
###
# Solution
###
df = df.dropna(subset=['Axis', 'Feed', 'Path'])
###
# END Solution
###
# After the removal of missing values we visualize the first 10 instances of the data set again.
df.head(10)
We can see that the rows 5, 6, 7 and 8 have been dropped as they contained some missing values (NaN).
Since we have performed the tests for the independent variables (feed, axis and path) we know the range of these values.
1. Axis: 1 to 3
2. distance: -60 to 60 [mm]
3. Feed rate: 500 to 3000 [mm/min]
All values outside these ranges are therefore outliers. The relevant instances should therefore be deleted.
Question: Why is it important to remove outliers from our dataset?
Feedback: ...
Solution: Outliers can provide information to our regression model that is different from the information provided by the rest of
the dataset. By removing outliers the regression model will perform better as it only learns the essential information of the
dataset.
# Before removing the ouliers we analize our data set with the desrcibe() method.
df.describe()
Task: Complete the following code line to remove all outliers for the attribute "Feed".
Feedback: ...
# Values of features outside the range of the known to us are treated as 'Outliers' and removed
# We only include those values of the feature that lie in the particular ranges of the feature
df = df.loc[(df.Axis >= 1) & (df.Axis <= 3) &
(df.Path >= -60) & (df.Path <= 60) &
#############################
# Please add your code here #
#############################
###
# Solution
###
(df.Feed >= 500) & (df.Feed <= 3000)
###
# END Solution
###
]
Feedback: ...
Solution: Some of the changes that can be observed in the dataset are,
1. Count: We notice that all the features now have the same count or same number of entries meaning there are no missing
values.
2. Mean and standard deviation: We notice that mean and standard deviation values have changed in the dataset indicating
that outliers and missing value problems have been resolved.
3. Minimum and Maximum values: We notice that the minimum and maximum values for the features have changed and are
now within the defined range, again indicating that outliers have been removed.
1. Training dataset: the training dataset is used to determine the models parameters based on the data it has seen. Here, the labels are
provided to the model so that it can learn about potential patterns in the data and thus adjust its parameters in such a way that it can
predict data points.
2. Test dataset: This dataset is used to test the performance of the model. It can be used to see if the model is able to perform well on data
which it has never seen before.
3. Target variable: Further we have to separate the target variable from the other attributes.
# We separate the features (axis, feed, distance) and store it in 'X_multi' then we store our target varible (energy) in 'Y_target' to t
X_multi = df.drop('Energy_Requirement', 1)
# Checking the shapes of the datasets so that we dont wrongly fit the data
print(X_train.shape, Y_train.shape, X_test.shape, Y_test.shape)
Question: How is the default setting of the train_test_split() method regarding the distribution of the training and test
dataset? Use the documentation for this method! And what are advantages/disadvantages of different distributions (big training
dataset, small test dataset or same size of the data sets?
Solution:
1. Default setting: Usually the training and test split is conducted in an 80/20 ratio. Meaning 80% of the data is used to train
the model and the performance of the model is tested in 20% of the data.
2. Big training dataset, small test dataset: By using a larger training set we ensure that the model captures the patterns in the
data and hence it can perform well. But we do not have enough test data to properly assess the performance of the model.
3. Same size of training and large test dataset: In this case we risk the model can not learn the patterns in the data properly
as it is trained only on a part (half) of the dataset. Such a distribution is normally not used.
6. Linear Regression
Now we can start to use machine learning algorithms to predict the required energy. For that we carry out the follwowing steps:
These steps are similar for the implemenation of different regression algorithms.
lreg = LinearRegression()
lreg.fit(X_train, Y_train)
LinearRegression()
6.3 Prediction
6.4 Calculate the different losses (Least Square Error, Mean Square Error)
Task: Print the calculated losses for the test data with the print() method and write a short explanatory text.
Feedback: ...
# Training data
MSE_linear_Train_Data = mean_squared_error(Y_train, pred_train)
MAE_linear_Train_Data = mean_absolute_error(Y_train, pred_train)
###
# Solution
###
print("\n""The Mean Square Error on the test data is:", MSE_linear_Test_Data)
print("The Mean Absolute Error on the test data is:", MAE_linear_Test_Data)
###
# END Solution
###
Feedback: ...
Solution:
1. Mean Square Error (MSE): The Mean Square Error (MSE) loss function penalizes the model for making large errors by
squaring them. Squaring a large quantity makes it even larger, thus helping it to identify the errors but this same property
makes the MSE cost function sensitive to outliers.
2. Mean Absolute Error (MAE): The Mean Absolute Error (MAE) cost is less sensitive to outliers compared to MSE.
3. Results: MAE and MSE are usually higher for the test datset than for the training data set. However, in our case they are
almost simular. The reason can be the distribution of the data set. This is not yet sufficient to evaluate whether this is good
enough for the planned application. On the one hand, an assessment or a target value must be determined by a domain
expert. On the other hand, a analysis of the average percentage deviation would be helpful.
1. A residual value is a measure of how much a regression line vertically misses a data point.
2. In a residual plot the residual values are on the vertical axis and the horizontal axis displays the independent variable.
3. Ideally, residual values should be equally and randomly distributed around the horizontal axis.
More on the parameters and information of this regressor can be found here -
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html
1. n_estimators : This is the number of trees you want to build before taking the maximum voting or averages of predictions. Higher number
of trees can increase the performance but make your code slower.
2. random_state : This parameter makes a solution easy to replicate. A definite value of random_state will always produce same results if
given with same parameters and training data.
RandomForestRegressor(n_estimators=1000, random_state=42)
7.3 Prediction
7.4 Calculate the different losses (Mean Absolute Error, Mean Square Error)
Task: Complete the following code lines in order to calcute the results according to the Linear Regression!
Feedback: ...
# Training data
MSE_rf_Train_Data = mean_squared_error(Y_train, rf_pred_train)
MAE_rf_Train_Data = mean_absolute_error(Y_train, rf_pred_train)
#############################
# Please add your code here #
#############################
###
# Solution
###
MSE_rf_Test_Data = mean_squared_error(Y_test, rf_pred_test)
MAE_rf_Test_Data = mean_absolute_error(Y_test, rf_pred_test)
###
# END Solution
###
Feedback: ...
#############################
# Please add your code here #
#############################
###
# Solution
###
plt.title('Residual plot for Random Forest Regressor')
plt.xlabel("Energy_Requirement - Target variable")
plt.ylabel("Residual value")
###
# END Solution
###
Text(0, 0.5, 'Residual value')
Here we see that the model performs well on both, the training dataset as well as the test dataset, as the blue and red points are fairly close to
the horizontal line and they are equally distributed.
More information on the parameters and kernels used in the SVR can be found here - https://scikit-
learn.org/stable/modules/generated/sklearn.svm.SVR.html
Task: Implement all the steps used for the Linear Regression model (6.1 to 6.5) to the given Support Vector Regressor. Use the
"rbf" kernel. Hint: You just have to copy and partly adapt the existing code!
Feedback: ...
#############################
# Please add your code here #
#############################
###
# Solution
###
from sklearn.svm import SVR
###
# END Solution
###
Feedback: ...
#############################
# Please add your code here #
#############################
###
# Solution
###
svr_regressor = SVR(kernel='rbf')
svr_regressor.fit(X_train, Y_train)
###
# END Solution
###
SVR()
8.3 Prediction
Feedback: ...
#############################
# Please add your code here #
#############################
###
# Solution
###
svr_pred_train = svr_regressor.predict(X_train)
svr_pred_test = svr_regressor.predict(X_test)
###
# END Solution
###
8.4 Calculate the different losses (Mean Absolute Error, Mean Square Error)
Feedback: ...
#############################
# Please add your code here #
#############################
###
# Solution
###
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
# Training data
MSE_svr_Train_Data = mean_squared_error(Y_train, svr_pred_train)
MAE_svr_Train_Data = mean_absolute_error(Y_train, svr_pred_train)
###
# END Solution
###
Feedback: ...
#############################
# Please add your code here #
#############################
###
# Solution
###
%matplotlib inline
plt.figure(figsize=(10,7))
train = plt.scatter(pred_train, (svr_pred_train-Y_train), c='b', alpha=0.5)
test = plt.scatter(pred_test, (svr_pred_test-Y_test), c='r', alpha=0.5)
plt.hlines(y=0, xmin=-0.5, xmax=0.5)
plt.legend((train, test), ('Training', 'Test'),loc='lower left')
plt.title('Residual plot for Support Vector Regressor')
plt.xlabel("Energy_Requirement - Target variable")
plt.ylabel("Residual value")
###
# END Solution
###
plt.xlabel('Error')
plt.ylabel('Error values')
plt.title('Performance of different regression models on the test data')
plt.legend(loc="upper left")
plt.show()
Question: Explain the results you obtained and choose the best model.
Your Answer: TBD
Feedback: ...
Solution: Here we see the performance of the different regression models on the the test dataset. Some of the points that can be
observed are,
1. The Random Forest Regressor has almost zero error on the test dataset. This shows that the model has correctly learned to
predict future values.
2. By comparing our different models we can see that Random Forest Regressor has the highest accuracy follwed by Linear
Regressor and Support Vector Regressor.
3. The accuracy can also be seen by the use of Residual Plots, these plots show us how much the regression line misses the
data points and helps us get a better understanding of whether the model can be further improved.
4. It can be said that the Random Forest Regressor does a good job of learning the pattern of the training dataset and applying
it to the test dataset to achieve a high accuracy. The remaining models do not achieve this accuracy as they may be not
able to learn the patterns effectively enough.
Therefore we choose the Random Forest Regressor.
Task: Predict the energy requirement for the given production settings using your best model and the predict() method.
Feedback: ...
#############################
# Please add your code here #
#############################
###
# Solution
###
print("Predicted Energy Requirement for setting 1 is", rf.predict([[2, 800, 60]]),"kJ.")
print("Predicted Energy Requirement for setting 2 is", rf.predict([[3, 2000, 40]]),"kJ.")
print("Predicted Energy Requirement for setting 3 is", rf.predict([[1, 1200, -20]]),"kJ.")
#You can also save the data points within a variable first.
set = [[2, 800, 60],[3, 2000, 40],[1, 1200, -20]]
print("Predicted Energy Requirement for all settings is", sum(rf.predict(set)),"kJ.")
###
# END Solution
###
/usr/local/lib/python3.7/dist-packages/sklearn/base.py:451: UserWarning: X does not have valid feature names, but RandomForestRegre
"X does not have valid feature names, but"
/usr/local/lib/python3.7/dist-packages/sklearn/base.py:451: UserWarning: X does not have valid feature names, but RandomForestRegre
"X does not have valid feature names, but"
/usr/local/lib/python3.7/dist-packages/sklearn/base.py:451: UserWarning: X does not have valid feature names, but RandomForestRegre
"X does not have valid feature names, but"
Predicted Energy Requirement for setting 1 is [0.05394835] kJ.
Predicted Energy Requirement for setting 2 is [0.28709159] kJ.
Predicted Energy Requirement for setting 3 is [0.02084422] kJ.
Predicted Energy Requirement for all settings is 0.3618841639999991 kJ.
/usr/local/lib/python3.7/dist-packages/sklearn/base.py:451: UserWarning: X does not have valid feature names, but RandomForestRegre
"X does not have valid feature names, but"
Question: What is the difference between parameters and hyperparameters of a model? Name an example for both.
Your Answer: TBD
Feedback: ...
Solution: A model parameter is a configuration variable which is internal to the model and whose value can be estimated from
data. • They are the result of the training process
• They are required by the model when making predictions.
• Example Random Forest: the threshold value chosen at each internal node of the decision tree(s)
A model hyperparameter is a configuration that is external to the model and whose value cannot be estimated from data.
• They define the training process.
• They can often be set using heuristics / They are often tuned for a given predictive modeling problem.
• Example Random Forest: the number of decision trees
Exercise 3
Introduction
In the production of electrical drives, a high product quality is needed. As the industry of electric drive production is confronted by trends such
as electric mobility and continuing industrial automation, efficient and flexible processes are needed more than ever. With current quality
monitoring technology, accurate quality checking is not feasible.
Electrical motors mainly consist of the rotor, the stator and the surrounding housing. The production process can be separated into multiple
sub-processes, which can be seen below. The exact sequence of these steps however depends on the motor type. First, the individual
components are manufactured and assembled into subassemblies such as the rotor and the stator. Finally, all components (the housing, the
stator, the rotor as well as bearings and end shields) are assembled and the motor is checked in an end-of-line (EOL) test.
This final assembly is of great importance, as all parts need to be assembled in the correct way, to ensure smooth operation. Therefore, a
quality monitoring system is needed, raising alarm if assembly errors are detected. However, especially in lot-size one production, traditional
computer vision systems might reach their limits and cannot be used anymore.
Thus, in this lab we will build a smart quality monitoring system for the electric drives production. An already existing visual sensor captures
images of the electric motor after assembly. These images show the part from the top, as well from the side perspective. It is now the target to
decide whether the motor is fully assembled, or whether one of multiple defects is present. There is data from three different defects available:
missing cover, missing screw and not screwed. Examples of these defects can be seen below. To achieve this, we will investigate two different
machine learning models: Support Vector Machines (SVM) and Convolutional Neural Networks (CNN).
Further background information can be found in this paper: Mayr et al., Machine Learning in Electric Motor Production - Potentials, Challenges
and Exemplary Applications
Introduction
Outline
This lab is structured into two main parts. In the first part, a subset of the problem will be analyzed step-by-step. Here, only images from the top
view are used and only two of the three defects, the defects missing cover and missing screw are considered. Your task will be to follow along,
fill out missing gaps, and answer problems throughout the notebook.
In the second part, you are tasked to expand the quality monitoring system to also detect the defect not screwed. Therefore, it might be helpful
to also consider images showing the parts in their side perspective. For this part, you are free to choose any of the tools and methods
introduced in the first part, and you can expand as you wish!
Deliverables
For completing this exercise successfully, you need to deliver certain results. Throughout the notebook you will find questions you need to
answer, and coding tasks where you need to modify existing code or fill in blanks. Answers to the questions need to be added in the prepared
Your answer here markdown fields. Coding tasks can be solved by modifying or inserting code in the cells below the task. If needed, you can
add extra cells to the notebook, as long as the order of the existing cells remains unchanged. Once you are finished with the lab, you can submit
it through the procedure described in the forum. Once the labs are submitted, you will receive feedback regarding the questions. Thus, the
Feedback boxes need to be left empty
Example:
Your answer: Have a look at the forum, maybe some of your peers already experienced similiar issues. Otherwise start a new
discussion to get help!
Feedback: This is a great approach! Besides asking in the forum I'd also suggest asking your tutor.
Ressources
If you are having issues while completing this lab, feel free to post your questions into the forum. Your peers as well as our teaching advisors
will screen the forum regularly and answer open questions. This way, the information is available for fellow students encountering the same
issues.
Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).
# As in the previous exercises, we'll import commonly used libraries right at the start
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import cv2
import random
import tensorflow as tf
# The check.py script contains the quality gates you can use for selftesting throughout the lab
from scripts.check import *
1 Part one
To achieve the solution mentioned above, we will execute the following steps in this lab:
1. First, we will code the necessary functions for loading and preprocessing of the data. We will also set up some methods that help us
displaying our progress throughout the exercise
2. Second, we will do a short analysis of the existing dataset
3. Afterwards, we will start building our first image classification model using SVMs
4. Once we are familiar and comfortable with SVMs, we will switch to neural networks and try out CNNs
5. Finally, we will introduce data augmentation for improvement of our prediction results.
(1024, 1024, 3)
With the snippet above, we are able to load the image from the file into a numpy array, while getting its label from the folder path the image is in.
To check the type of a python object, you can use the command type(img) . It should return numpy.ndarray.
type(img)
numpy.ndarray
Next, we want to plot the image. This can be achieved by executing the following cell.
Now it's your turn. For the further analysis, we need to load all the available images from the given data folder folder. Besides the image, we
need to also find the class of the respective image. The information of the class is encoded in the title of each image. You can use the helper
function get_label_from_name(path) to parse the filename to the class.
Task: Please complete the following function load_features_labels(folder). The function should read the image for a given file,
and return two lists:
import glob
def get_label_from_name(path):
if "_C_" in path:
return "Complete"
if "_MC_" in path:
return "Missing cover"
if "_MS_" in path:
return "Missing screw"
if "_NS_" in path:
return "Not screwed"
return "n/a" # TODO: Raise error
def load_features_labels(folder, size = (64,32), flatten = True, color = False, identifiers=['NS', 'MS', 'MC', 'C']):
features, labels = [], [] # Empty lists for storing the features and labels
# Iterate over all imagefiles in the given folder
for file in glob.glob(folder + "/*.JPG"):
if any(identifier in file for identifier in identifiers):
#############################
# Please add your code here #
#############################
###
# Solution
###
img = cv2.imread(file)
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
features.append(img)
labels.append(get_label_from_name(file))
###
# END Solution
###
If everything works as expected, the function should load 117 features and labels. The execution may take a while.
Image preprocessing
Before analyzing the images using machine learning, they need to be preprocessed. We will do preprocessing regarding three aspects:
Image size: As the raw images are available in rather high resolution, it might be beneficial to reduce the image resolution. Opencv
provides the function resize() which works great for that purpose
Image color: In many use cases, the benefit of considering color information might not outway the increased complexity, thus it might be
handy to convert the rgb image to bw. This can easily be done using the cvtColor function from opencv.
Image shape: Only some algorithms are capable of analyzing the 2.5D structure of image data. For the remaining algorithms, which
expect the data to be 1D vector, the image data needs to be flattened from 2.5D to 1D. This can be done using the numpy reshape
functionality.
Task: Please update your load_features_labels(...) function from above to do image wise data preprocessing using the
function image_preprocessing(...) . Note that the images shall have the size of 8x8 pixels and be flattened subsequent to the
resizing. Therefore be mindfull of argument passing between the two functions!
def load_features_labels(folder, size = (64,32), flatten = True, color = False, identifiers=['NS', 'MS', 'MC', 'C']):
features, labels = [], [] # Empty lists for storing the features and labels
# Iterate over all imagefiles in the given folder
for file in glob.glob(folder + "/*.JPG"):
if any(identifier in file for identifier in identifiers):
#############################
# Please add your code here #
#############################
###
# Solution
###
img = cv2.imread(file)
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
img = image_preprocessing(img, size=size, flatten=flatten, color=color)
features.append(img)
labels.append(get_label_from_name(file))
###
# END Solution
###
################
# Quality gate #
################
Task: Please create a plot showing the distribution of the different classes and discuss the distribution in the field below.
Counter({'Not screwed': 47, 'Missing screw': 42, 'Missing cover': 22, 'Complete': 6})
#############################
# Please add your code here #
#############################
###
# Solution
###
###
# END Solution
###
Question: Please discuss the class distribution. Which issues and challenges might appear during model training?
Feedback: ...
As we can see, we still load our 117 images, but the pixel values are now simply reshaped to 1D. Next, we need to separate our data into training
and testing datasets. This can be achieved using the train_test_split() function from sklearn. You can find the documentation here: Link.
Task: Fill in the following code so that 70% of the data is used for training, and the remaining 30% for testing. Also, the datasets
should be stratified by the label vector.
################
# Quality gate #
################
quality_gate_13(X_train, X_test)
Score: 0.6190476190476191
accuracy 0.62 21
macro avg 0.21 0.33 0.25 21
weighted avg 0.38 0.62 0.47 21
Task: Fill in the following code so that 70% of the data is used for training, and the remaining 30% for testing. Also the datasets
should be stratified by the label vector.
######################################
# Please complete the following line #
######################################
#X_train, X_test, y_train, y_test = train_test_split("""Your code goes here""", random_state=42)
###
# Solution
###
X_train, X_test, y_train, y_test = train_test_split(features, labels, train_size=0.7, stratify=labels, random_state=42)
The labels need to be one hot encoded. In one hot encoding, categorical values are transformed into a binary representation.
OneHotEncoding
# The sklearn preprocessing library contains a variety of useful data preprocessing tools such as one hot encoding
from sklearn.preprocessing import OneHotEncoder
# Display the first label before encoding
print("Label of first sample before OneHot encoding:", y_train[0])
# Create the encoder object
enc = OneHotEncoder(sparse=False) # Generate Encoder
# With the fit_transform function, the encoder is fitted to the existing labels and transforms the dataset into its binary representation
y_train = enc.fit_transform(y_train.reshape(-1, 1))
# Display the first label after encoding
print("Label of first sample after OneHot encoding:", y_train[0])
# Data preprocessing should always be fitted on the training dataset, but applied to both, the training and the testing dataset. Thus the fit
y_test = enc.transform(y_test.reshape(-1, 1))
Now, let's define a simple ANN with an input layer, 3 hidden layer and one output layer. In this lab we use the keras library to model the neural
network.
A simple ANN with multiple sequential layers can be created using the Sequential() model. Afterwards, various layers can be added to the
model through the command model.add(LAYER) with LAYER defining the layer to be added. In the first layer, the shape of the input needs to be
specified using the parameter input_shape . This is only necessary in the first, but not in consecutive layers.
Please have a look at the keras documentation regarding the sequential model and the various layers. For now, especially the core layers Dense
and Activation are of interest.
model = Sequential()
model.add(Dense(32, input_shape = X_train[0].shape))
model.add(Activation("relu"))
model.add(Dense(16))
model.add(Activation("relu"))
model.add(Dense(y_train[0].shape[0]))
model.add(Activation("softmax"))
print(model.summary())
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense (Dense) (None, 32) 1572896
activation (Activation) (None, 32) 0
=================================================================
Total params: 1,573,475
Trainable params: 1,573,475
Non-trainable params: 0
_________________________________________________________________
None
Once the model is created, model.summary() displays the architecture of the model. You can see that the created model consists of three
dense layers, each with an activation function. Also, the parameter for each layer are visible. Depending on the selected image size during
preprocessing, the input vector might be rather large, thus the high number of parameters in the first dense layer.
Next, the model needs to be compiled using a loss function and an optimizer . The loss function defines how the loss is computed during
model training, while the optimizer defines how the weights need to be adjusted during backpropagation. You can find more information
regarding the available losses here and regarding the optimizers here.
Now, the model can be trained using the datasets defined before.
Epoch 1/20
5/5 [==============================] - 1s 72ms/step - loss: 1412.3744 - accuracy: 0.2564 - val_loss: 646.9340 - val_accuracy: 0.6000
Epoch 2/20
5/5 [==============================] - 0s 17ms/step - loss: 333.3408 - accuracy: 0.5385 - val_loss: 44.7513 - val_accuracy: 0.6000
Epoch 3/20
5/5 [==============================] - 0s 17ms/step - loss: 130.8871 - accuracy: 0.5897 - val_loss: 8.7992 - val_accuracy: 0.7000
Epoch 4/20
5/5 [==============================] - 0s 18ms/step - loss: 46.9059 - accuracy: 0.6154 - val_loss: 107.1110 - val_accuracy: 0.4000
Epoch 5/20
5/5 [==============================] - 0s 16ms/step - loss: 53.4573 - accuracy: 0.6154 - val_loss: 76.1191 - val_accuracy: 0.4000
Epoch 6/20
5/5 [==============================] - 0s 22ms/step - loss: 37.7211 - accuracy: 0.5897 - val_loss: 42.5151 - val_accuracy: 0.6000
Epoch 7/20
5/5 [==============================] - 0s 16ms/step - loss: 15.9705 - accuracy: 0.7692 - val_loss: 12.5873 - val_accuracy: 0.7000
Epoch 8/20
5/5 [==============================] - 0s 17ms/step - loss: 7.3708 - accuracy: 0.8462 - val_loss: 24.3295 - val_accuracy: 0.7000
Epoch 9/20
5/5 [==============================] - 0s 17ms/step - loss: 7.7038 - accuracy: 0.9231 - val_loss: 63.5386 - val_accuracy: 0.5000
Epoch 10/20
5/5 [==============================] - 0s 18ms/step - loss: 2.9461 - accuracy: 0.9487 - val_loss: 55.2533 - val_accuracy: 0.6000
Epoch 11/20
5/5 [==============================] - 0s 17ms/step - loss: 1.6057 - accuracy: 0.9487 - val_loss: 14.6102 - val_accuracy: 0.8000
Epoch 12/20
5/5 [==============================] - 0s 17ms/step - loss: 0.0000e+00 - accuracy: 1.0000 - val_loss: 44.4014 - val_accuracy: 0.6000
Epoch 13/20
5/5 [==============================] - 0s 19ms/step - loss: 2.6602 - accuracy: 0.9231 - val_loss: 24.0462 - val_accuracy: 0.7000
Epoch 14/20
5/5 [==============================] - 0s 16ms/step - loss: 0.5815 - accuracy: 0.9744 - val_loss: 68.7479 - val_accuracy: 0.6000
Epoch 15/20
5/5 [==============================] - 0s 17ms/step - loss: 0.3020 - accuracy: 0.9744 - val_loss: 19.2934 - val_accuracy: 0.9000
Epoch 16/20
5/5 [==============================] - 0s 16ms/step - loss: 0.0000e+00 - accuracy: 1.0000 - val_loss: 18.7606 - val_accuracy: 0.7000
Epoch 17/20
5/5 [==============================] - 0s 17ms/step - loss: 0.0262 - accuracy: 0.9744 - val_loss: 23.5947 - val_accuracy: 0.8000
Epoch 18/20
5/5 [==============================] - 0s 16ms/step - loss: 0.0000e+00 - accuracy: 1.0000 - val_loss: 25.5570 - val_accuracy: 0.8000
Epoch 19/20
5/5 [==============================] - 0s 17ms/step - loss: 8.4530e-05 - accuracy: 1.0000 - val_loss: 26.7151 - val_accuracy: 0.8000
Epoch 20/20
5/5 [==============================] - 0s 19ms/step - loss: 0.3409 - accuracy: 0.9744 - val_loss: 23.9743 - val_accuracy: 0.8000
<keras.callbacks.History at 0x7fcbd09740d0>
y_pred = model.predict(X_test)
res = np.zeros_like(y_pred)
for i in range(len(np.argmax(y_pred, axis=1))):
res[i, np.argmax(y_pred,axis=1)[i]]=1
y_pred = res
cm = confusion_matrix(enc.inverse_transform(y_test), enc.inverse_transform(y_pred))
ax=sns.heatmap(cm, annot=True)
plt.xlabel('Predicted')
plt.ylabel('Truth')
print(classification_report(enc.inverse_transform(y_test), enc.inverse_transform(y_pred), zero_division=0))
accuracy 0.62 21
macro avg 0.39 0.42 0.40 21
weighted avg 0.56 0.62 0.59 21
Question: What behavior did you observe while training the model? How can the results be explained?
Feedback: ...
Architecture CNN
First, the data is loaded from file. As CNNs are capable and even excel on analyzing the multiple dimensional aspects of images, the images do
not need to be reshaped in a one-dimensional vector. Thus, we have to set the flag flatten to False . You can see, that the shape of the loaded
images is now a four-dimensional array with (number of samples, width image, height image, color channels image) .
Task: Fill in the following code so that 70% of the data is used for training, and the remaining 30% for testing. Also, the datasets
should be stratified by the label vector. Furthermore, add OneHot Encoding for the labels as seen before.
#######################################
# Please complete the following lines #
#######################################
###
# Solution
###
################
# Quality gate #
################
model = Sequential()
model.add(Conv2D(8, 5, input_shape = X_train[0].shape, activation = 'relu', padding="same"))
model.add(MaxPooling2D())
model.add(Conv2D(16, 3, activation = 'relu', padding="same"))
model.add(MaxPooling2D())
model.add(Conv2D(32, 3, activation = 'relu', padding="same"))
model.add(GlobalMaxPooling2D())
model.add(Dense(32, activation = 'relu'))
model.add(Dense(y_train[0].shape[0], activation = 'softmax'))
print(model.summary())
Model: "sequential_4"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv2d_9 (Conv2D) (None, 512, 512, 8) 608
=================================================================
Total params: 7,571
Trainable params: 7,571
Non-trainable params: 0
_________________________________________________________________
None
2/2 [==============================] - 9s 3s/step - loss: 0.0334 - accuracy: 1.0000 - val_loss: 0.2650 - val_accuracy: 0.8000
Epoch 48/75
2/2 [==============================] - 9s 3s/step - loss: 0.0313 - accuracy: 1.0000 - val_loss: 0.2527 - val_accuracy: 0.8000
Epoch 49/75
2/2 [==============================] - 9s 3s/step - loss: 0.0293 - accuracy: 1.0000 - val_loss: 0.2477 - val_accuracy: 0.8000
Epoch 50/75
2/2 [==============================] - 9s 3s/step - loss: 0.0281 - accuracy: 1.0000 - val_loss: 0.2456 - val_accuracy: 0.8000
Epoch 51/75
2/2 [==============================] - 9s 3s/step - loss: 0.0270 - accuracy: 1.0000 - val_loss: 0.2420 - val_accuracy: 0.8000
Epoch 52/75
2/2 [==============================] - 9s 3s/step - loss: 0.0263 - accuracy: 1.0000 - val_loss: 0.2421 - val_accuracy: 0.8000
Epoch 53/75
2/2 [==============================] - 9s 3s/step - loss: 0.0253 - accuracy: 1.0000 - val_loss: 0.2439 - val_accuracy: 0.8000
Epoch 54/75
2/2 [==============================] - 9s 3s/step - loss: 0.0242 - accuracy: 1.0000 - val_loss: 0.2491 - val_accuracy: 0.8000
Epoch 55/75
2/2 [==============================] - 9s 3s/step - loss: 0.0233 - accuracy: 1.0000 - val_loss: 0.2520 - val_accuracy: 0.8000
Epoch 56/75
2/2 [==============================] - 9s 3s/step - loss: 0.0229 - accuracy: 1.0000 - val_loss: 0.2481 - val_accuracy: 0.8000
Epoch 57/75
2/2 [==============================] - 9s 3s/step - loss: 0.0219 - accuracy: 1.0000 - val_loss: 0.2353 - val_accuracy: 0.8000
Epoch 58/75
2/2 [==============================] - 9s 3s/step - loss: 0.0216 - accuracy: 1.0000 - val_loss: 0.2275 - val_accuracy: 0.8000
Epoch 59/75
2/2 [==============================] - 9s 3s/step - loss: 0.0210 - accuracy: 1.0000 - val_loss: 0.2330 - val_accuracy: 0.8000
Epoch 60/75
2/2 [==============================] - 9s 3s/step - loss: 0.0200 - accuracy: 1.0000 - val_loss: 0.2492 - val_accuracy: 0.8000
Epoch 61/75
2/2 [==============================] - 9s 3s/step - loss: 0.0195 - accuracy: 1.0000 - val_loss: 0.2695 - val_accuracy: 0.8000
Epoch 62/75
2/2 [==============================] - 9s 3s/step - loss: 0.0192 - accuracy: 1.0000 - val_loss: 0.2789 - val_accuracy: 0.8000
Epoch 63/75
2/2 [==============================] - 9s 3s/step - loss: 0.0188 - accuracy: 1.0000 - val_loss: 0.2761 - val_accuracy: 0.8000
Epoch 64/75
2/2 [==============================] - 9s 3s/step - loss: 0.0177 - accuracy: 1.0000 - val_loss: 0.2606 - val_accuracy: 0.8000
Epoch 65/75
2/2 [==============================] - 9s 3s/step - loss: 0.0175 - accuracy: 1.0000 - val_loss: 0.2515 - val_accuracy: 0.8000
Epoch 66/75
2/2 [==============================] - 9s 3s/step - loss: 0.0172 - accuracy: 1.0000 - val_loss: 0.2552 - val_accuracy: 0.8000
Epoch 67/75
2/2 [==============================] - 9s 3s/step - loss: 0.0167 - accuracy: 1.0000 - val_loss: 0.2686 - val_accuracy: 0.8000
Epoch 68/75
2/2 [==============================] - 9s 3s/step - loss: 0.0158 - accuracy: 1.0000 - val_loss: 0.2855 - val_accuracy: 0.8000
Epoch 69/75
2/2 [==============================] - 9s 3s/step - loss: 0.0154 - accuracy: 1.0000 - val_loss: 0.3056 - val_accuracy: 0.8000
Epoch 70/75
2/2 [==============================] - 9s 3s/step - loss: 0.0159 - accuracy: 1.0000 - val_loss: 0.3183 - val_accuracy: 0.8000
Epoch 71/75
2/2 [==============================] - 9s 3s/step - loss: 0.0151 - accuracy: 1.0000 - val_loss: 0.3047 - val_accuracy: 0.8000
Epoch 72/75
2/2 [==============================] - 9s 3s/step - loss: 0.0144 - accuracy: 1.0000 - val_loss: 0.2869 - val_accuracy: 0.8000
Epoch 73/75
2/2 [==============================] - 9s 3s/step - loss: 0.0139 - accuracy: 1.0000 - val_loss: 0.2730 - val_accuracy: 0.8000
Epoch 74/75
2/2 [==============================] - 9s 3s/step - loss: 0.0138 - accuracy: 1.0000 - val_loss: 0.2662 - val_accuracy: 0.8000
Epoch 75/75
2/2 [==============================] - 9s 3s/step - loss: 0.0137 - accuracy: 1.0000 - val_loss: 0.2686 - val_accuracy: 0.8000
<keras.callbacks.History at 0x7fcbd015fcd0>
accuracy 0.67 21
macro avg 0.40 0.42 0.40 21
weighted avg 0.58 0.67 0.61 21
Question:
Feedback: ...
Task: With the above starter code, a first improvement in accuracy compared to the SVM and the ANN using only Dense layers
should be visible. However, the network could be further improved by adjusting the hyperparameters. Below you can find the full
snippet from data preprocessing to model training. Play around with the parameters and see whether you can find a model that
shows an even better performance!
For comparability, please don't change the ratios for train/test and train/validation!
np.random.seed(28)
####################################################
# Please modify the following lines #
# ! Don't change training/test/validation ratios ! #
####################################################
# Data preprocessing
features, labels = load_features_labels("./data/top", size=(512,512), color=True, flatten=False, identifiers=['MC', 'MS', 'C'])
features = np.array(features) # Datatype conversion of feature vector from list to array
labels = np.array(labels) # Datatype conversion of label vector from list to array
X_train, X_test, y_train, y_test = split_data(features, labels) # Split features and labels into training and testing datasets
y_train, y_test = encode_labels(y_train, y_test) # Encode labels
# Model definition
model = Sequential()
model.add(Conv2D(4, 5, input_shape = X_train[0].shape, activation = 'relu', padding="same"))
model.add(MaxPooling2D())
model.add(Conv2D(8, 3, activation = 'relu', padding="same"))
model.add(MaxPooling2D())
model.add(Conv2D(8, 3, activation = 'relu', padding="same"))
model.add(GlobalMaxPooling2D())
model.add(Dense(8, activation = 'relu'))
model.add(Dense(y_train[0].shape[0], activation = 'softmax'))
# Model compilation
optimizer=Adam(learning_rate=0.0005)
model.compile(loss = 'categorical_crossentropy', optimizer = optimizer, metrics = ['accuracy'])
# Model training
model.fit(np.array(X_train), np.array(y_train), epochs = 50, batch_size = 2, validation_split=0.1,
verbose = 1, sample_weight=compute_sample_weight('balanced', y_train))
# Model evaluation
evaluate_model(X_test, y_test, model)
Epoch 1/50
22/22 [==============================] - 9s 388ms/step - loss: 26.1337 - accuracy: 0.2955 - val_loss: 11.7893 - val_accuracy: 0.2000
Epoch 2/50
22/22 [==============================] - 8s 383ms/step - loss: 11.7874 - accuracy: 0.1364 - val_loss: 8.1721 - val_accuracy: 0.2000
Epoch 3/50
22/22 [==============================] - 8s 383ms/step - loss: 7.6837 - accuracy: 0.2500 - val_loss: 7.0914 - val_accuracy: 0.2000
Epoch 4/50
22/22 [==============================] - 8s 385ms/step - loss: 5.6548 - accuracy: 0.3182 - val_loss: 5.4500 - val_accuracy: 0.4000
Epoch 5/50
22/22 [==============================] - 8s 381ms/step - loss: 4.5269 - accuracy: 0.3864 - val_loss: 4.4434 - val_accuracy: 0.4000
Epoch 6/50
22/22 [==============================] - 8s 381ms/step - loss: 3.9237 - accuracy: 0.3864 - val_loss: 3.7421 - val_accuracy: 0.4000
Epoch 7/50
22/22 [==============================] - 8s 382ms/step - loss: 3.2746 - accuracy: 0.4091 - val_loss: 3.1884 - val_accuracy: 0.4000
Epoch 8/50
22/22 [==============================] - 8s 382ms/step - loss: 2.8123 - accuracy: 0.4091 - val_loss: 2.7209 - val_accuracy: 0.4000
Epoch 9/50
22/22 [==============================] - 8s 384ms/step - loss: 2.4667 - accuracy: 0.4091 - val_loss: 2.4462 - val_accuracy: 0.4000
Epoch 10/50
22/22 [==============================] - 8s 384ms/step - loss: 2.1635 - accuracy: 0.4091 - val_loss: 2.2040 - val_accuracy: 0.4000
Epoch 11/50
22/22 [==============================] - 8s 379ms/step - loss: 1.9306 - accuracy: 0.4091 - val_loss: 2.0059 - val_accuracy: 0.4000
Epoch 12/50
22/22 [==============================] - 8s 383ms/step - loss: 1.7306 - accuracy: 0.4091 - val_loss: 1.8478 - val_accuracy: 0.4000
Epoch 13/50
22/22 [==============================] - 8s 381ms/step - loss: 1.5747 - accuracy: 0.4091 - val_loss: 1.7066 - val_accuracy: 0.4000
Epoch 14/50
22/22 [==============================] - 8s 380ms/step - loss: 1.3976 - accuracy: 0.4091 - val_loss: 1.5534 - val_accuracy: 0.4000
Epoch 15/50
22/22 [==============================] - 8s 378ms/step - loss: 1.2548 - accuracy: 0.4091 - val_loss: 1.3888 - val_accuracy: 0.4000
Epoch 16/50
22/22 [==============================] - 8s 382ms/step - loss: 1.1533 - accuracy: 0.4091 - val_loss: 1.3450 - val_accuracy: 0.2000
Epoch 17/50
22/22 [==============================] - 8s 380ms/step - loss: 1.0260 - accuracy: 0.4091 - val_loss: 1.1637 - val_accuracy: 0.4000
Epoch 18/50
22/22 [==============================] - 8s 382ms/step - loss: 0.9248 - accuracy: 0.4091 - val_loss: 1.1008 - val_accuracy: 0.2000
Epoch 19/50
22/22 [==============================] - 8s 383ms/step - loss: 0.8120 - accuracy: 0.4091 - val_loss: 0.9707 - val_accuracy: 0.4000
Epoch 20/50
22/22 [==============================] - 8s 381ms/step - loss: 0.7136 - accuracy: 0.4091 - val_loss: 0.8725 - val_accuracy: 0.4000
Epoch 21/50
22/22 [==============================] - 8s 379ms/step - loss: 0.6277 - accuracy: 0.4091 - val_loss: 0.7792 - val_accuracy: 0.4000
Epoch 22/50
22/22 [==============================] - 8s 379ms/step - loss: 0.5672 - accuracy: 0.4091 - val_loss: 0.7057 - val_accuracy: 0.4000
Epoch 23/50
22/22 [==============================] - 8s 381ms/step - loss: 0.5100 - accuracy: 0.4091 - val_loss: 0.6321 - val_accuracy: 0.4000
Epoch 24/50
22/22 [==============================] - 8s 382ms/step - loss: 0.4656 - accuracy: 0.4091 - val_loss: 0.6048 - val_accuracy: 0.4000
Epoch 25/50
22/22 [==============================] - 8s 379ms/step - loss: 0.4298 - accuracy: 0.4091 - val_loss: 0.5514 - val_accuracy: 0.4000
Epoch 26/50
22/22 [==============================] - 8s 380ms/step - loss: 0.3959 - accuracy: 0.4091 - val_loss: 0.4881 - val_accuracy: 0.6000
Epoch 27/50
22/22 [==============================] - 8s 377ms/step - loss: 0.3602 - accuracy: 0.4091 - val_loss: 0.4679 - val_accuracy: 0.6000
Epoch 28/50
22/22 [==============================] - 8s 381ms/step - loss: 0.3350 - accuracy: 0.4773 - val_loss: 0.4212 - val_accuracy: 0.6000
Epoch 29/50
22/22 [==============================] - 8s 381ms/step - loss: 0.3285 - accuracy: 0.5909 - val_loss: 0.4223 - val_accuracy: 0.6000
Epoch 30/50
22/22 [==============================] - 8s 385ms/step - loss: 0.3127 - accuracy: 0.5909 - val_loss: 0.3797 - val_accuracy: 0.6000
Epoch 31/50
22/22 [==============================] - 8s 383ms/step - loss: 0.2905 - accuracy: 0.7045 - val_loss: 0.3845 - val_accuracy: 0.6000
Epoch 32/50
22/22 [==============================] - 8s 379ms/step - loss: 0.2945 - accuracy: 0.7500 - val_loss: 0.3831 - val_accuracy: 0.6000
Epoch 33/50
22/22 [==============================] - 8s 378ms/step - loss: 0.2760 - accuracy: 0.7500 - val_loss: 0.3529 - val_accuracy: 0.6000
Question: Describe your approach optimizing the hyperparameters. Which behavior did you observe?
Epoch 34/50
22/22 [==============================] - 8s 381ms/step - loss: 0.2549 - accuracy: 0.7955 - val_loss: 0.3466 - val_accuracy: 0.6000
Epoch 35/50
22/22 [==============================] - 8s 386ms/step - loss: 0.2554 - accuracy: 0.7500 - val_loss: 0.3345 - val_accuracy: 0.6000
Section 1.5: Data augmentation
Epoch 36/50
22/22 [==============================] - 9s 400ms/step - loss: 0.2546 - accuracy: 0.7955 - val_loss: 0.3323 - val_accuracy: 0.6000
Data Epoch
augmentation
37/50 is a technique for artificially increasing the dataset without the need for additional data acquisition. The reason for this is,
22/22 [==============================] - 8s 385ms/step - loss: 0.2354 - accuracy: 0.8182 - val_loss: 0.3249 - val_accuracy: 0.6000
that most machine learning models perform better the higher the available data volume is.
Epoch 38/50
22/22 [==============================] - 8s 381ms/step - loss: 0.2263 - accuracy: 0.8864 - val_loss: 0.3077 - val_accuracy: 0.6000
Data Epoch
augmentation
39/50
uses the principle of slight modifications to the original data to create new data, while using the labels of the existing
image. As those
22/22 modifications are rather small, -the8simage
[==============================] as a whole
382ms/step is not
- loss: changed
0.2294 by a lot and
- accuracy: the to- be
0.9091 identified0.3058
val_loss: object,-orval_accuracy:
in our case 0.6000
Epoch 40/50
image class, can still be recognized. However, the training process can be increased significantly. One can think of many variations of these
22/22 [==============================] - 8s 380ms/step - loss: 0.2153 - accuracy: 0.8864 - val_loss: 0.3014 - val_accuracy: 0.6000
slightEpoch
modifications
41/50 of an image. Typical examples include:
22/22 [==============================] - 8s 381ms/step - loss: 0.2150 - accuracy: 0.9091 - val_loss: 0.3025 - val_accuracy: 0.6000
Random
Epoch flipping of the image horizontally or vertically
42/50
22/22 [==============================] - 8s 380ms/step - loss: 0.2061 - accuracy: 0.8864 - val_loss: 0.3164 - val_accuracy: 0.6000
Epoch 43/50
Random rotations
22/22 [==============================] - 8s 379ms/step - loss: 0.1997 - accuracy: 0.9091 - val_loss: 0.2829 - val_accuracy: 0.6000
Random
Epoch shifts
44/50
22/22 [==============================] - 8s 384ms/step - loss: 0.1903 - accuracy: 0.9091 - val_loss: 0.2857 - val_accuracy: 0.6000
Blurring the image
Epoch 45/50
Adding[==============================]
22/22 artificially created noise - 9s 388ms/step - loss: 0.1847 - accuracy: 0.9318 - val_loss: 0.2616 - val_accuracy: 0.6000
Epoch 46/50
Cropping
22/22 [==============================] - 8s 387ms/step - loss: 0.1874 - accuracy: 0.9091 - val_loss: 0.2626 - val_accuracy: 0.6000
Changes
Epoch in contrast
47/50
22/22 [==============================] - 8s 381ms/step - loss: 0.1786 - accuracy: 0.9091 - val_loss: 0.2540 - val_accuracy: 0.6000
Elastic deformations
Epoch 48/50
Below22/22 [==============================]
you can - 8saugmentation
see some examples of the different 383ms/step -strategies
loss: 0.1699 - accuracy:
applied 0.9318 - val_loss: 0.2669 - val_accuracy:
to our dataset 0.6000
Epoch 49/50
22/22 [==============================] - 8s 379ms/step - loss: 0.1651 - accuracy: 0.9545 - val_loss: 0.2480 - val_accuracy: 0.6000
Implementation in keras
Epoch 50/50
22/22 [==============================] - 8s 381ms/step - loss: 0.1652 - accuracy: 0.9318 - val_loss: 0.2616 - val_accuracy: 0.6000
Keras includes its own procedure for image augmentation using the ImageDataGenerator generator. This generator offers a variety of data
precision recall f1-score support
augmentation strategies, that are directly applied to the raw data during model training. Thus, the augmented data does not need to be stored to
the disk. Complete 0.00 0.00 0.00 2
Missing cover 0.40 0.67 0.50 6
Missing
For this screw
exercise, 0.75to use the
we are going 0.46ImageDataGenerator
0.57 13
from keras. Please have a look at the documentation to get familiar:
https://keras.io/api/preprocessing/image/#imagedatagenerator-classData
accuracy 0.48 21 augmentation is a technique for artificially increasing the dataset
macro
without the needavg 0.38data acquisition.
for additional 0.38 0.36reason for
The 21this is, that most machine learning models perform better the higher the
weighted avg 0.58 0.48 0.50 21
available data volume is.
# Data preprocessing
features, labels = load_features_labels("./data/top", size=(512,512), color=True, flatten=False, identifiers=['MC', 'MS', 'C'])
features = np.array(features) # Datatype conversion of feature vector from list to array
labels = np.array(labels) # Datatype conversion of label vector from list to array
X_train, X_test, y_train, y_test = split_data(features, labels) # Split features and labels into training and testing datasets
y_train, y_test = encode_labels(y_train, y_test) # Encode labels
grid[0].imshow(features[random_index])
grid[0].set_title("Original")
for i, ax in enumerate(grid[1:]):
image = datagen.flow(features[[random_index]]).next()[0].astype(int)
ax.imshow(image) # Plot image
plt.show()
### Run model training with given data generator
X_train, X_validation, y_train, y_validation = train_test_split(X_train, y_train, train_size=0.9, stratify=y_train, random_state=21)
datagen = ImageDataGenerator(
featurewise_center=False,
featurewise_std_normalization=False,
rotation_range=10,
width_shift_range=0.2,
height_shift_range=0.2,
horizontal_flip=True,
vertical_flip=True)
datagen.fit(np.array(X_train))
model = Sequential()
model.add(Conv2D(8, 5, input_shape = X_train[0].shape, activation = 'relu', padding="same"))
model.add(MaxPooling2D())
model.add(Conv2D(8, 5, activation = 'relu', padding="same"))
model.add(MaxPooling2D())
model.add(Conv2D(16, 5, activation = 'relu', padding="same"))
model.add(MaxPooling2D())
model.add(Conv2D(16, 3, activation = 'relu', padding="same"))
model.add(GlobalMaxPooling2D())
model.add(Dense(64, activation = 'relu'))
model.add(Dense(16, activation = 'relu'))
model.add(Dense(y_train[0].shape[0], activation = 'softmax'))
Epoch 34/50
5/5 [==============================] - 13s 2s/step - loss: 0.5471 - accuracy: 0.8182 - val_loss: 0.7551 - val_accuracy: 0.6000
Epoch 35/50
5/5 [==============================] - 13s 3s/step - loss: 0.7288 - accuracy: 0.7045 - val_loss: 0.5190 - val_accuracy: 0.8000
Epoch 36/50
5/5 [==============================] - 13s 2s/step - loss: 0.5174 - accuracy: 0.7955 - val_loss: 0.4745 - val_accuracy: 0.8000
Epoch 37/50
5/5 [==============================] - 13s 2s/step - loss: 0.6051 - accuracy: 0.7727 - val_loss: 0.7809 - val_accuracy: 0.6000
Epoch 38/50
5/5 [==============================] - 13s 2s/step - loss: 0.6883 - accuracy: 0.7045 - val_loss: 0.5812 - val_accuracy: 0.8000
Epoch 39/50
5/5 [==============================] - 13s 2s/step - loss: 0.5558 - accuracy: 0.7500 - val_loss: 0.3944 - val_accuracy: 0.8000
Epoch 40/50
5/5 [==============================] - 13s 2s/step - loss: 0.6485 - accuracy: 0.7727 - val_loss: 0.4437 - val_accuracy: 0.8000
Epoch 41/50
5/5 [==============================] - 13s 2s/step - loss: 0.7244 - accuracy: 0.7500 - val_loss: 0.4760 - val_accuracy: 0.8000
Epoch 42/50
5/5 [==============================] - 13s 2s/step - loss: 0.7097 - accuracy: 0.8182 - val_loss: 0.7354 - val_accuracy: 0.8000
Epoch 43/50
5/5 [==============================] - 13s 2s/step - loss: 0.5350 - accuracy: 0.7955 - val_loss: 0.7730 - val_accuracy: 0.8000
Epoch 44/50
5/5 [==============================] - 13s 2s/step - loss: 0.6777 - accuracy: 0.6818 - val_loss: 0.6028 - val_accuracy: 0.8000
Epoch 45/50
5/5 [==============================] - 13s 3s/step - loss: 0.5212 - accuracy: 0.8182 - val_loss: 0.5707 - val_accuracy: 0.8000
Epoch 46/50
5/5 [==============================] - 13s 2s/step - loss: 0.6360 - accuracy: 0.7955 - val_loss: 0.5915 - val_accuracy: 0.8000
Epoch 47/50
5/5 [==============================] - 13s 2s/step - loss: 0.6162 - accuracy: 0.7727 - val_loss: 0.6657 - val_accuracy: 0.8000
Epoch 48/50
5/5 [==============================] - 13s 2s/step - loss: 0.7296 - accuracy: 0.7727 - val_loss: 0.6666 - val_accuracy: 0.8000
Epoch 49/50
5/5 [==============================] - 13s 2s/step - loss: 0.4567 - accuracy: 0.8636 - val_loss: 0.6394 - val_accuracy: 0.8000
Epoch 50/50
5/5 [==============================] - 13s 3s/step - loss: 0.5065 - accuracy: 0.7500 - val_loss: 0.4732 - val_accuracy: 0.8000
<keras.callbacks.History at 0x7fcbc4fef890>
accuracy 0.81 21
macro avg 0.60 0.56 0.57 21
weighted avg 0.79 0.81 0.78 21
Question: What behavior can be observed while training the model using the data augmentation? Did it improve?
Task & Question: Experiment with the different data augmentation parameters, are all of them similar effective?
Feedback: ...
Section 2: Expanding the project scope
So far, only the classes Complete, Missing cover, and Missing screw were investigated. Those defects are easy to observe in the top view. The
remaining defect Not screwed is hardly visible in the top view. Thus, information from the side view could be used to detect this defect.
Task: In this last part of the exercise, you are tasked to expand the current quality monitoring solution to also detect not fully
fastened screws. As mentioned above, it might be useful to investigate the side view images to achieve this purpose.
You can approach this problem using any of the above mentioned solutions such as SVMs, ANNs or CNNs.
Question: Please describe your approach for expanding the project scope briefly.
Feedback: ...
Question: What was your final prediction results? What would you do to further improve the results?
Feedback: ...
Question: Which challenges did you encounter while solving the problem? How did you solve those?
Feedback: ...