2021 01 Slides l4 ML
2021 01 Slides l4 ML
Machine Learning
Algorithms
KNIME AG
1
Structure of the Course
Session Topic
§ Tom Mitchell:
Machine Learning
McGraw Hill, 1997.
Model
Model
Model
Recommendation
IF è
Model
Suspicious Transaction
Transactions
§ Trx 1
§ Trx 2
§ Trx 3
§ Trx 4
§ Trx 5
§ Trx 6
§ …
Model
31 August
2007
Training Set
Only some Spectral Time Series shows the break down via REST
Model
Model parameters
𝑦 = 𝑓( 𝜷, 𝑿 ) with 𝜷 = [𝛽1, 𝛽2, … , 𝛽𝑚]
§ The model learns / is trained during the learning / training phase to produce
the right answer y (a.k.a., label)
x1
x1
x2
X
Age
Money
Temperature
Speed
Number of taxi
...
x1
§ Training phase: the algorithm trains a model using the data in the training set
§ Testing phase: a metric measures how well the model is performing on data in
a new dataset (the test set)
* sometimes
© 2021 KNIME AG. All rights reserved. 25
Data Science: Process Overview
Training Train
Set Model
Original
Data Set
Apply Score
Test Set Model Model
https://en.wikipedia.org/wiki/Cross_Industry_Standard_
Process_for_Data_Mining
Data Model
Model Training Model Testing Deployment
Preparation Optimization
Data Manipulation Model Training Parameter Tuning Performance Files & DBs
Data Blending Bag of Models Parameter Optimization Measures Dashboards
Missing Values Handling Model Selection Regularization Accuracy REST API
Feature Generation Ensemble Models Model Size ROC Curve SQL Code Export
Dimensionality Reduction Own Ensemble Model No. Iterations Cross-Validation Reporting
Feature Selection External Models … … …
Outlier Removal Import Existing Models
Normalization Model Factory
Partitioning …
…
𝑝! = "⁄!# 𝑝! = !#⁄!# = 1
𝑝$ = %⁄!# 𝑝$ = &⁄!# = 0
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑝 = − "⁄!# log $ "⁄!# + %⁄!# log $ %⁄!# 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑝 = − !#⁄ log !#⁄
!# $ !# + &⁄!# log $ &⁄!#
= 0,995 =0
Split criterion:
𝐸𝑛𝑡𝑟𝑜𝑝𝑦&'()*'
% " 𝐺𝑎𝑖𝑛 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦9:013: − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦;0.:3
= 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 ,
!# !#
% "
𝐺𝑎𝑖𝑛 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦9:013: − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦! − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦$
!# !#
Split criterion:
𝐺𝑖𝑛𝑖94:;< = ∑4123 𝑤1 𝐺𝑖𝑛𝑖1
8 6
𝐺𝑖𝑛𝑖94:;< = 𝐺𝑖𝑛𝑖3 + 𝐺𝑖𝑛𝑖5
37 37
𝐺𝑖𝑛𝑖! = 𝐺𝑖𝑛𝑖 +⁄", !⁄" 𝐺𝑖𝑛𝑖$ = 𝐺𝑖𝑛𝑖 $⁄%, +⁄% Next splitting feature:
𝑤! = "⁄!# 𝑤$ = %⁄!# Feature with lowest 𝐺𝑖𝑛𝑖&$'()
𝑥 = 1.2 𝑥=<
3.4𝑥 𝑥 = 1.7 𝑥 = 3.6
≥𝑥 𝑥 = 4.9
𝑥 = 9.2 𝑥=2 𝑥 = 12.6 𝑥 = 7.4 𝑥=8 𝑥 = 2.3
30 temp temp
≥ 10 < 10 ≥ 25 < 25
25
temp wind
20
≥ 22 < 22 ≥6 <6
15
temp
≥ 26 < 26
10
5 wind
≥6 <6
1 2 3 4 5 6 7 𝑤𝑖𝑛𝑑
Model memorizes
Model overlooks Model captures
the training set
underlying correlations in the
rather then finding
patterns in the training set
underlying patterns
training set
Overfitting Underfitting
§ Model that fits the training data too § A model that can neither model the
well, including details and noise training data nor generalize to new data
§ Negative impact on the model’s ability
to generalize
Techniques:
• Reduced Error Pruning
• Minimum description length
wind wind
Many misclassified
Example 1
samples in tree 1
temp
12 0 12 0 => DL(Tree 1) > DL(Tree 2)
6 7 => Select Tree 2
in tree 1
temp
12 0 1 13 12 0 => DL(Tree 1) < DL(Tree 2)
=> Select Tree 1
§ Definition:
§ Downsides:
§ Only considers the performance in general and not for the different classes
§ Therefore, not informative when the class distribution is unbalanced
*+# *-#
Ac𝑐𝑢𝑟𝑎𝑐𝑦 = = 0,96 Ac𝑐𝑢𝑟𝑎𝑐𝑦 = = 0,93
*,+ *,+
Arbitrarily define one class value as POSITIVE and the remaining class as
NEGATIVE
TRUE POSITIVE (TP): Actual and
Predicted class Predicted class
positive negative predicted class is positive
TRUE NEGATIVE (TN): Actual and
True class TRUE FALSE
positive POSITIVE NEGATIVE
predicted class is negative
FALSE NEGATIVE (FN): Actual class
True class FALSE TRUE
negative POSITIVE NEGATIVE
is positive and predicted negative
FALSE POSITIVE (FP): Actual class
is negative and predicted positive
Use these four statistics to calculate other evaluation metrics, such as overall
accuracy, true positive rate, and false positive rate
© 2021 KNIME AG. All rights reserved. 48
ROC Curve
§ The ROC Curve shows the false positive rate and true positive rate for
different threshold values
§ False positive rate (FPR)
§ negative events incorrectly classified as positive
§ True positive rate (TPR)
§ positive events correctly classified as positive
19 20 11 20
𝑝'! = × 𝑝'! = ×
100 100 100 100
81 80 89 80
𝑝'$ = × 𝑝'$ = ×
100 100 100 100
Overall
𝑝' = 𝑝'! + 𝑝'$ = 0.686 accuracy
𝑝' = 𝑝'! + 𝑝'$ = 0.734
89 81
𝑝- = = 0.89 𝑝- = = 0.81
100 100
κ = 1: perfect model
performance
9!:9" #.%#- 9!:9" #.#?,
𝜅= = ≈ 0.65 κ = 0: the model performance
;:9" #.*;- is equal to a random classifier
𝜅= = = 0.29
;:9" #.%,,
§ Dataset: Sales data of individual residential properties in Ames, Iowa from 2006
to 2010.
§ One of the columns is the overall condition ranking, with values between 1 and
10.
§ Goal: train a binary classification model, which can predict whether the overall
condition is high or low.
You can download the training workflows from the KNIME Hub:
https://hub.knime.com/knime/spaces/Education/latest/Courses/
1. Right click on
LOCAL and select
Import KNIME
Workflow….
3. Click on Finish
55
Regression Analysis
Applications
§ Forecasting
§ Quantitative Analysis
Methods
§ Linear
§ Polynomial
§ Regression Trees
§ Partial Least Squares
x
© 2021 KNIME AG. All rights reserved. 60
Simple Linear Regression
%
∑$!"; 𝑒!% = ∑$!"; 𝑦! − ∑$@"# 𝑎@ 𝑥@,! = 𝑦 − 𝑎𝑋 B 𝑦 − 𝑎𝑋
§ Solution:
𝑎j = 𝑋 B 𝑋 :; 𝑋 B 𝑦
§ Computational issues:
§ 𝑋 = 𝑋 must have full rank, and thus be invertible
(Problems arise if linear dependencies between input features exist)
§ Solution may be unstable, if input features are almost linearly dependent
§ Positive:
§ Strong mathematical foundation
§ Simple to calculate and to understand
(For moderate number of dimensions)
§ High predictive accuracy
(In many applications)
§ Negative:
§ Many dependencies are non-linear
(Can be generalized)
§ Model is global and cannot adapt well to locally different data distributions
But: Locally weighted regression, CART
0
Mean signed difference 1 Only informative about the direction
V 𝑦. − 𝑓 𝑥. of the error
𝑛
./!
0
Mean absolute percentage error 1 |𝑦. − 𝑓(𝑥. )| Requires non-zero target column
(MAPE) V values
𝑛 |𝑦. |
./!
MAE RMSE
Easy to interpret – mean average absolute error Cannot be directly interpreted as the average error
All errors are equally weighted Larger errors are weighted more
Generally smaller than RMSE Ideal when large deviations need to be avoided
Example:
Actual values = [2,4,5,8], MAE RMSE
R-squared RMSE
Example:
Actual values = [2,4,5,8], R-sq RMSE
76
Regression Tree: Goal
s x
y 𝑥 ≤ 93.5?
Y N
s x
Repeat the
y splitting process 𝑥 ≤ 93.5?
within each Y N
segment
𝑥 ≤ 70.5? 𝐶+ = 17.8
Y N
𝐶, = 33.9 𝐶- = 26.4
s x
§ Extensions:
§ Fuzzy trees (better interpolation)
§ Local models for each leaf (linear, quadratic)
9 6
2
7 6
2
8 9
7
3 3
7
9 5
6
…
1 4 1
5 2 5 7 7 6
2 9 6 7 2 8 9 3 3 9 5 7
X1 X2
… …
1 4 1 1 4 1
5 2 2 7 7 6 5 2 2 7 7 6
2 9 6 7 6 8 9 3 3 9 5 7 2 9 6 7 6 8 9 3 3 9 5 7
P1 P2 … Pn P1 P2 … Pn
y1OOB y2OOB
Build tree
5 2
2 9 6 7
Residual Residual
errors
… errors
from previous from previous
model model
…
1 4 1
5 2 5 7 7 6
2 9 6 7 2 8 9 3 3 9 5 7
Regression tree
with depth 1
… target value 𝑦
Functional relationship … class probability P (y = class i)
between features 𝑦 = 𝑓(𝑥! , … , 𝑥* , 𝛽& , … , 𝛽* )
and… 𝑦 = 𝛽& +𝛽! 𝑥! + ⋯ + 𝛽* 𝑥* 𝑃 𝑦 = 𝑐) = 𝑓 𝑥! , … , 𝑥* , 𝛽& , … , 𝛽*
§ Idea: Train a function, which gives us the probability for each class (0 and 1)
based on the input features
§ Recap on probabilities
§ Probabilities are always between 0 and 1
§ The probability of all classes sum up to 1
𝑃 𝑦 = 1 = 𝑝! => 𝑃 𝑦 = 0 = 1 − 𝑝!
1
𝑃 𝑦 = 1 = 𝑓 𝑥; , 𝑥% ; 𝛽# , 𝛽; , 𝛽% ≔
1 + 𝑒 :(D!ED2)2ED3)3)
§ Model:
;
𝜋 = 𝑃(𝑦 = 1) =
;EGHI(:J)
8 8
4 564K
𝐿 𝛽; 𝑦, 𝑋 = 0 𝑃(𝑦 = 𝑦3 ) = 0 𝜋3 K 1 − 𝜋3
375 375
𝜋3 𝑖𝑓 𝑦3 = 1
𝑃 𝑦 = 𝑦3 = $1 − 𝜋
3 𝑖𝑓 𝑦3 = 0 Remember:
𝜋! = P 𝑦 = 1
4 564K 𝑢# = 1 for 𝑢 ∈ ℝ
= 𝜋3 K 1 − 𝜋3 𝑢; = 𝑢 for 𝑢 ∈ ℝ
8
4 564K
max 𝐿 𝛽; 𝑦, 𝑋 = max 0 𝜋3 K 1 − 𝜋3
9 9 375
:
max 𝐿𝐿(𝜷; 𝒚, 𝑿) = max 9 𝑦3 ln 𝜋3 + 1 − 𝑦3 ln 1 − 𝜋3
9 9 375
§ To find the coefficients of our model we want to find 𝜷 so that the value of the
function 𝐿𝐿 𝜷; 𝒚, 𝑿 is maximal
Optimal 𝛽z
Optimal 𝛽z
Δ𝑠
Δ𝑠
§ Fixed:
Δ𝑠L = Δ𝑠#
§ Annealing:
Δ𝑠#
Δ𝑠L = 𝛼
1+
𝑘
with iteration number 𝑘 and decay rate 𝛼
§ Line Search: Learning rate strategy that tries to find the optimal learning rate
𝑙 𝛽; z 𝑦, 𝑋 + M ||𝛽||
z 𝑦, 𝑋 ≔ −𝐿𝐿 𝛽; z %%
%
$
§ 𝐿! regularizations = Coefficients are Laplace distributed with 𝜎 = O
z 𝑦, 𝑋 ≔ −𝐿𝐿 𝛽;
𝑙 𝛽; z 𝑦, 𝑋 + 𝜆||𝛽||
z ;
§ p- value < 𝛼: input feature has a significant impact on the dependent variable.
§ Regression Exercises:
§ Goal: Predicting the house price
§ 01_Linear_Regression
§ 02_Regression_Tree
§ Classification Exercises:
§ Goal: Predicting the house condition (high /low)
§ 03_Radom_Forest (with optional exercise to build a parameter
optimization loop)
§ 04_Logistic_Regression
131
Artificial Neurons and Networks
Biological vs. Artificial
𝑦 = 𝑓(𝑥! 𝑤! + 𝑥$ 𝑤$ + 𝑏) 𝑥;
𝑥; 𝑏
𝑤! 𝑏 = 𝑤& 𝑥% y
∑f( )σ y *
𝑤$ 𝑦 = 𝑓(\ 𝑥) 𝑤) ) 𝑥*
𝑥% )P&
𝑥-
𝑥% $
𝑊#,! # fully connected
𝑊!,#
$
𝑊#,$ 𝑜*% feed forward
∑ 𝑓
𝑥; 𝑦 = 𝑓(𝑊N* 𝒐)
𝑜%%
∑ 𝑓 ∑ 𝑓 𝑦
𝑥%
f( ) = activation function
𝑜*%
∑ 𝑓
1 𝑒 %ST − 1
𝑓 𝑎 = 𝑓 𝑎 = %ST 𝑓 𝑎 = 𝑚𝑎𝑥 0, ℎ𝑎
1 + 𝑒 :ST 𝑒 +1
1 1 0 1
0 1 0 0
𝑥; 𝑏
1 0
𝑤!
∑f( )σ y
𝑥%
𝑤$
?
0 1
y
1 0
0 1
x
x y
0
0 1
out y
2
1 -1
-
1
2
1
?
1 2
2 1
-1
-1 -1
1
x y x
1 2
y
out
2
1
2 1
-1
1
x y x
1 2
y 1 =0
out
1+ x - y < 0
2 2
=1
1
1+ x - y > 0
1
2
1
2 1
-1
1
x y x
1 2
y =0
out
2
=1
1
2- x- y < 0
2 =1 =0
-1
-1
1 2- x- y > 0
x y x
1 2
y =0
out
2
=1
1 -1
1
-
2 1
=1 =0
1
x y x
1 2
y =0
out
2
=1
1 -1
1 1
-
2 1
- - >0
2
=1 =0
1
x y x
1 2
y 1 =0
out
1+ x - y < 0
2 2
=1
-1 1
1
1 1+ x - y > 0
-
2 1
2
1 2 =1 =0
1
2
-1 -1
-1 2- x- y > 0 2- x- y < 0
1
x y x
1 2
𝑥% Gradient descent
𝑜*%
∑ 𝑓 𝜕𝐸
∆𝑤@! = −𝜂
𝜕𝑤@!
2 3 2 3
UV U 3 X4 :N4 U 3 X4 :N4 UN4 UN4
= = = − 𝑡@ − 𝑦@
UW45 UW45 UN4 UW45 UW45
= − 𝑡@ − 𝑦@ 𝑔′(ℎ@ )𝑥!
∆𝑤@! = −𝜂 𝑡@ − 𝑦@ 𝑔Z ℎ@ 𝑥! = −𝜂 𝛿@[\X 𝑥!
[ 0 ∑% 7 '() 0 ∑+
!" #$ !" &
7 %*,,-. \U*"
*" #$ *" !"
54& (U)
Z
…= − $ ∑U∈= ∑XWP! 2 𝑓 𝑎W1Y. (𝑥) − 𝑦W (𝑥) %*,,-.
[7*!
[ ∑% 7 '() 0 ∑+
!" #$ !" &
7 %*,,-. \U*"
*" #$ *" !"
…= −𝜂 ∑U∈= ∑XWP! 𝑓 𝑎W1Y. (𝑥) − 𝑦W (𝑥) 𝑓′ ∑SR " P! 𝑤R1Y.
"W 𝑓 ∑N S)TT:*
) " P! 𝑤) " R " c 𝑥) " %*,,-.
[7*!
[ ∑% 7 '() 0 ∑+
!" #$ !" &
7 %*,,-. \U*"
*" #$ *" !"
[0 ∑+ 7 %*,,-. \U*"
*" #$ *" !
1Y.
…= −𝜂 ∑U∈= ∑XWP! 𝛿W1Y. %*,,-. = −𝜂 ∑U∈= ∑XWP! 𝛿W1Y. 𝑤RW %*,,-.
[7*! [7*!
Do you understand
… = ∑U∈= −𝜂 c 𝛿RS)TT:* c 𝑥) now why the sigmoid
is a commonly used
activation function?
𝑥;
𝑜%%
∑ 𝑓 ∑ 𝑓 𝑦
1. Forward pass:
𝑥%
𝑜*%
𝒐 = 𝑓 𝑊)% 𝒙
∑ 𝑓 𝑦 = 𝑓(𝑊N* 𝒐)
𝑥; δhidde% 𝑦
n 𝑜%
∑ 𝑓 ∑ 𝑓 δout 2. Backward pass:
𝑥% UV U[4 (𝑜@ − 𝑡@ )𝑜@ (1 − 𝑜@ )
𝑜*% 𝛿@ = = •
U[4 U$(X4 ∑L∈^ 𝑤@L 𝛿L 𝑜@ (1 − 𝑜@ )
∑ 𝑓
∆𝑤!@ = −η 𝑜! 𝛿@
§ Weight Decay:
§ try to keep weights small
§ Momentum:
§ increase weight updates as long as they have the same sign
§ Resilient Backpropagation:
§ estimate optimum for weight based on assumption that error surface is a polynomial.
§ Possible Solution:
§ Local activity of neurons in hidden layer: Local Basis Function Networks
§ Recurrent Neural Network (RNN) are a family of neural networks used for
processing of sequential data
§ RNNs are used for all sorts of tasks:
§ Language modeling / Text generation
§ Text classification
§ Neural machine translation
§ Image captioning
§ Speech to text
§ Numerical time series data, e.g. sensor data
Ich I
“Mag ich Schokolade?”
mag like
=> “Do I like chocolate?”
Schokolade chocolate
§ Problems:
§ Each time step is completely independent
∑ σ
§ For translations we need context
∑ σ
§ More general: we need a network that remembers inputs from the past
∑ σ
𝑥 ∑ σ ∑σ 𝑦
§ Solution: Recurrent neural networks
∑ σ
∑ σ
∑ σ
Input x Output y
Ich I
mag like
Schokolade chocolate
𝑾𝟐𝒙 𝑾𝟑𝒚
∑ σ ∑ σ
𝑥;
𝑾𝟐𝒙 𝑾𝟑𝒚
∑ σ ∑ σ 𝑦 𝑥 ∑ σ 𝑦
𝑥%
∑ σ ∑ σ
𝑦# 𝑦$ 𝑦% 𝑦&
𝑾𝟑𝒚 𝑾𝟑𝒚 𝑾𝟑𝒚 𝑾𝟑𝒚
∑ σ ∑ σ ∑ σ ∑ σ
∑ σ ∑ σ ∑ σ ∑ σ
∑ σ ∑ σ ∑ σ ∑ σ
Many to Many
A A A A A A A A A A
I like to go sailing
IF +
THEN
Collaborative Filtering
IF + THEN
Antecedent Consequent
N shopping baskets
{A, B, F, H}
Search for {A, B, C}
frequent itemsets {B, C, H}
{D, E , F}
{D, E}
{A, B}
{A, C}
{H, F}
…
{A, B, F} è H
{A, B, H} è F
{A, B, F, H}
{A, F, H} è B
{A, B, F} è H
_`(a(b,c,d,e) How often these items
§ Item set support 𝒔 =
f are found together
_`(a(b,c,d,e)
§ Rule confidence 𝒄 = How often the antecedent
_`(a(b,c,d)
is together with the consequent
g\99[`X ( b,c,d ⇒e)
§ Rule lift =
g\99[`X b,c,d × g\99[`X(e) How often antecedent and
consequent happen together
compared with random
chance
The rules with support, confidence and lift above a threshold à most reliable ones
Two phases:
1. find all frequent itemsets (FI) ß Most of the complexity
§ Select itemsets with a minimum support
𝐹𝐼 = 𝑋, 𝑌 , 𝑋, 𝑌 ⊂ 𝐼|𝑠 𝑋, 𝑌 ≥ 𝑆N)*
2. build strong association rules User parameters
§ Select rules with a minimum confidence:
𝑅𝑢𝑙𝑒𝑠: 𝑋 ⇒ 𝑌, 𝑋, 𝑌 ⊂ 𝐹𝐼, p𝑐 𝑋 ⇒ 𝑌 ≥ 𝐶N)*
§ How often are they found together across all shopping baskets?
§ How often are they found together across all shopping baskets containing the
antecedents?
support
TID Transactions
𝑠 𝑚𝑖𝑙𝑘, 𝑑𝑖𝑎𝑝𝑒𝑟, 𝑏𝑒𝑒𝑟
1 Bread, Milk 𝑃 𝑚𝑖𝑙𝑘, 𝑑𝑖𝑎𝑝𝑒𝑟, 𝑏𝑒𝑒𝑟 2
= = = 0.4
2 Bread, Diaper, Beer, Eggs 𝑇 5
3 Milk, Diaper, Beer, Coke
𝑃 𝑚𝑖𝑙𝑘, 𝑑𝑖𝑎𝑝𝑒𝑟, 𝑏𝑒𝑒𝑟 2
4 Bread, Milk, Diaper, Beer 𝑐= = = 0.67
𝑃 𝑚𝑖𝑙𝑘, 𝑑𝑖𝑎𝑝𝑒𝑟 3
5 Bread, Milk, Diaper, Coke
confidence
Collaborative Filtering
Collaborative filtering systems have many forms, but many common systems can
be reduced to two steps:
1. Look for users who share the same rating patterns with the active user (the
user whom the recommendation is for)
2. Use the ratings from those like-minded users found in step 1 to calculate a
prediction for the active user
3. Implemented in Spark
https://www.knime.com/blog/movie-recommendations-with-spark-collaborative-filtering
Pearson correlation
∑!∈&89 𝑟\,! − 𝑟š\ 𝑟\7,! − 𝑟\7
𝑠𝑖𝑚𝑖𝑙(𝑢, 𝑢Z ) =
% %
∑!∈&89 𝑟\,! − 𝑟š\ ∑!∈&89 𝑟\7,! − 𝑟\7
§ Neural Network
§ Goal: Train an MLP to solve our
classification problem (rank: high/low)
§ 01_Simple_Neural_Network
Definition:
Given a data set 𝐷, 𝐷 = 𝑛. Determine a clustering 𝐶 of 𝐷 with:
𝐶 = 𝐶; , 𝐶% , ⋯ , 𝐶j
wher 𝐶! ⊆ 𝐷 and ž 𝐶! = 𝐷
that best fits the given data set 𝐷. e ;k!kL
Clustering Methods:
Inside the space Cover the whole space
1. partitioning
2. hierarchical (linkage based)
3. density-based
10 10
9 9
8 8
7 7
5
Calculation of 6
4 new centroids 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Cluster assignment
10 10
9 9
8 8
7 7
6 6
5 5
4 4
2
Calculation of 3
1 new centroids 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
§ Advantages:
§ Relatively efficient
§ Simple implementation
§ Weaknesses:
§ Often terminates at a local optimum
§ Applicable only when mean is defined (what about categorical data?)
§ Need to specify k, the number of clusters, in advance
§ Unable to handle noisy data and outliers
§ Not suitable to discover clusters with non-convex shapes
Within-Cluster Variation
Bad 5 5
x
x
Clustering
x
1 1 x Centroide
1 5 1 5
x
Good 5 5
x
Clustering x
1 1 x Centroide
1 5 1 5
Between-Cluster Variation
𝑇𝐷 % = p p 𝑑𝑖𝑠𝑡(𝑝, 𝜇l5 )%
!"; 9∈𝑪𝒊
§ Between-Cluster Variation:
L L
𝐵𝐶 % = p p 𝑑𝑖𝑠𝑡(𝜇l4 , 𝜇l5 )%
@"; !";
𝑏 𝑥 − 𝑎(𝑥)
𝑠 𝑥 =
max{𝑎 𝑥 , 𝑏(𝑥)}
Good clustering…
Cluster 1
Cluster 2
𝑎(𝑥)
𝑏(𝑥)
𝑎 𝑥 ≪ 𝑏(𝑥)
…not so good…
Cluster 1 Cluster 2
𝑎(𝑥)
𝑏(𝑥)
𝑎(𝑥) ≈ 𝑏(𝑥)
𝑏(𝑥) − 𝑎(𝑥) 0
𝑠(𝑥) = ≈ =0
max{ 𝑎(𝑥), 𝑏(𝑥)} 𝑏(𝑥)
…bad clustering.
Cluster 1
𝑎(𝑥) Cluster 2
𝑏(𝑥)
𝑎(𝑥) ≫ 𝑏(𝑥)
§ Silhouette coefficient 𝑠n for clustering 𝐶 is the average silhouette over all objects
𝑥∈𝐶
1
𝑠n = p 𝑠(𝑥)
𝑛
)∈l
Method
§ For 𝑘=2, 3, ⋯, 𝑛−1, determine one clustering each
§ Choose 𝑘 resulting in the highest clustering quality
§ Lp-Metric (Minkowski-Distance) T
/
𝑑𝑖𝑠𝑡(𝑥, 𝑦) = \ 𝑥) − 𝑦) ,
)P!
§ Euclidean Distance (𝑝 = 2)
T
𝑑𝑖𝑠𝑡(𝑥, 𝑦) = \ 𝑥) − 𝑦) $
)P!
§ Manhattan-Distance (𝑝 = 1) T
𝑑𝑖𝑠𝑡 𝑥, 𝑦 = \ 𝑥) − 𝑦)
)P!
§ Maximum-Distance (𝑝 = ∞)
𝑑𝑖𝑠𝑡 𝑥, 𝑦 = max 𝑥) − 𝑦)
!`)`T
Goal
§ Construction of a hierarchy of clusters (dendrogram)
by merging/separating clusters with minimum/maximum distance
Dendrogram:
§ A tree representing hierarchy of clusters,
with the following properties:
Distance
§ Root: single cluster with the whole data set.
§ Leaves: clusters containing a single object.
§ Branches: merges / separations between larger
clusters and smaller clusters / objects
§ Example dendrogram
2
8 9
7
5
2 4 1 distance between
6
1
3
5 clusters
1
0
1 5 1 2 3 4 5 6 7 8 9
1. Form initial clusters consisting of a single object, and compute the distance
between each pair of clusters.
2. Merge the two clusters having minimum distance.
3. Calculate the distance between the new cluster and all other clusters.
4. If there is only one cluster containing all objects:
Stop, otherwise go to step 2.
Ci
Ci
1
𝐷𝑖𝑠𝑡Tpq (𝐶; , 𝐶% ) = p p 𝑑𝑖𝑠𝑡(𝑝, 𝑞)
𝐶; ⋅ 𝐶%
9∈l2 9∈l3
§ Merge Step:
§ union of two subsets of data points
§ construct the mean point of the two clusters
Distance
- Sensitive to noise (Single-Link)
(a „line“ of objects can connect two clusters)
- Inefficient
à Runtime complexity at least O(n2) for n objects
§ Single Linkage:
§ Prefers well-separated clusters
§ Complete Linkage:
§ Prefers small, compact clusters
§ Average Linkage:
§ Prefers small, well-separated clusters…
Clusters are built by joining core and density-reachable points to one another.
n § t = Core point
Core Point s § s = Boarder point
vs. Border Point
§ n = Noise point
vs. Noise t
Note: But t is not density reachable from s, because s is not a Core point
§ For each point, DBSCAN determines the e-environment and checks whether it
contains more than MinPts data points è core point
§ Iteratively increases the cluster by adding density-reachable points
Clustering:
§ A density-based clustering 𝐶 of a dataset D w.r.t. 𝜀 and MinPts is the set of all
density-based clusters 𝐶! w.r.t. 𝜀 and MinPts in D.
§ The set 𝑁𝑜𝑖𝑠𝑒𝐶𝐿 („noise“) is defined as the set of all objects in D which do not
belong to any of the clusters.
Property:
§ Let 𝐶! be a density-based cluster and 𝑝Î𝐶! be a core object.
Example:
§ Lengths in cm (100 – 200) and weights in kilogram (30 – 150) fall both in
approximately the same scale
§ What about lengths in m (1-2) and weights also in gram (30000 – 150000)?
à The weight values in mg dominate over the length values for the similarity of
records!
Goal of normalization:
§ Transformation of attributes to make record ranges comparable
§ min-max normalization
):);5<
𝑦= 𝑦oT) − 𝑦o!$ + 𝑦o!$
);=8 : );5<
§ z-score normalization
𝑥 − 𝑚𝑒𝑎𝑛(𝑥)
𝑦=
𝑠𝑡𝑑𝑑𝑒𝑣(𝑥)
§ Missing Completely At Random (MCAR): reason does not depend on its value
or lack of value.
There may be no particular reason why some people told you their weights and others
didn’t.
§ An outlier could be, for example, rare behavior, system defect, measurement
error, or reaction to an unexpected event
§ Knowledge-based
§ Statistics-based
§ Distance from the median
§ Position in the distribution tails
§ Distance to the closest cluster center
§ Error produced by an autoencoder
§ Number of random splits to isolate a data point
from other data
https://www.knime.com/blog/four-techniques-for-outlier-detection
§ Measure based
§ Ratio of missing values
§ Low variance
§ High Correlation
§ Transformation based
§ Principal Component Analysis (PCA)
§ Linear Discriminant Analysis (LDA)
§ t-SNE
§ Machine Learning based
§ Random Forest of shallow trees
§ Neural auto-encoder
x1
© 2021 KNIME AG. All rights reserved. 256
Principal Component Analysis (PCA)
§ 𝑃𝐶; describes most of the variability in the data, 𝑃𝐶% adds the next big
contribution, and so on. In the end, the last PCs do not bring much more
information to describe the data.
§ Thus, to describe the data we could use only the top 𝑚 < 𝑛 (i.e.,
𝑃𝐶; , 𝑃𝐶% , ⋯ 𝑃𝐶o ) components with little - if any - loss of information
Dimensionality Reduction
§ Caveats:
§ Results of PCA are quite difficult to interpret
§ Normalization required
§ Only effective on numeric columns
§ PCA : unsupervised
§ LDA : supervised
§ 𝐿𝐷; describes best the class separation in the data, 𝐿𝐷% adds the next big
contribution, and so on. In the end, the last LDs do not bring much more
information to separate the classes.
§ Thus, for our classification problem we could use only the top 𝑚 < 𝑛 (i.e.,
𝐿𝐷; , 𝐿𝐷% , ⋯ 𝐿𝐷o ) discriminants with little - if any - loss of information
§ That is, it compresses the input vector (dimension n) into a smaller vector space
on layer “code” (dimension m<n) and then it reconstructs the original vector onto
the output layer.
§ If the network was trained well, the reconstruction operation happens with
minimal loss of information.
https://thenewstack.io/3-new-techniques-for-data-dimensionality-reduction-in-machine-learning/
§ Both methods are used for reducing the number of features in a dataset.
However:
§ Feature selection is simply selecting and excluding given features without
changing them.
§ Dimensionality reduction might transform the features into a lower dimension.
§ Feature selection is often a somewhat more aggressive and more
computationally expensive process.
§ Backward Feature Elimination
§ Forward Feature Construction
1. First, train n separate models on one single input feature and keep the feature
that produces the best accuracy.
2. Then, train 𝑛 − 1 separate models on 2 input features, the selected one and
one more. At the end keep the additional feature that produces the best
accuracy.
3. And so on … Continue until an acceptable error rate is reached.
https://thenewstack.io/3-new-techniques-for-data-dimensionality-reduction-in-machine-learning/
§ Coordinate Transformations
Remember PCA and LDA?
Polar coordinates , …
Thank you!
§ Clustering
§ Goal: Cluster location data from California
§ 01_Clustering
§ Data Preparation
§ 02_Missing_Value_Handling
§ 03_Outlier_Detection
§ 04_Dimensionality_Reduction
§ 05_Feature_Selection
https://www.knime.com/sites/default/files/110519_KNIME_Machine_Learning_Cheat%20Sheet.pdf
277