0% found this document useful (0 votes)
93 views

2021 01 Slides l4 ML

This document outlines an introduction to machine learning algorithms course. The course consists of 5 sessions covering topics like decision trees, regression models, neural networks, clustering, and data preparation. Each session includes discussion of exercises, a lecture, and introduction of new exercises. The course materials recommended several textbooks on intelligent data analysis, machine learning, and data mining. The document also provides examples of potential use cases for machine learning like churn prediction, customer segmentation, risk assessment, demand prediction, and fraud detection. It defines key concepts in data science including learning algorithms, algorithm training, and the three main types of learning: supervised, unsupervised, and semi-supervised learning.

Uploaded by

sajjad Baloch
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
93 views

2021 01 Slides l4 ML

This document outlines an introduction to machine learning algorithms course. The course consists of 5 sessions covering topics like decision trees, regression models, neural networks, clustering, and data preparation. Each session includes discussion of exercises, a lecture, and introduction of new exercises. The course materials recommended several textbooks on intelligent data analysis, machine learning, and data mining. The document also provides examples of potential use cases for machine learning like churn prediction, customer segmentation, risk assessment, demand prediction, and fraud detection. It defines key concepts in data science including learning algorithms, algorithm training, and the three main types of learning: supervised, unsupervised, and semi-supervised learning.

Uploaded by

sajjad Baloch
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 253

[L4-ML] Introduction to

Machine Learning
Algorithms
KNIME AG

1
Structure of the Course
Session Topic

Session 1 Introduction & Decision Tree Algorithm

Session 2 Regression Models, Ensemble Models, & Logistic Regression

Session 3 Neural Networks & Recommendation Engines

Session 4 Clustering & Data Preparation

Session 5 Last Exercise and Q&A

§ Structure of each session


§ Discussion of past exercises (10 minutes)
§ Course (60 minutes)
§ Introduction of next exercises (5 minutes)

© 2021 KNIME AG. All rights reserved. 3


Material
§ Michael Berthold, Christian Borgelt, Frank Höppner, Frank Klawonn:
Guide to Intelligent Data Analysis
Springer, 2010.

§ Tom Mitchell:
Machine Learning
McGraw Hill, 1997.

§ David Hand, Heikki Mannila, Padhraic Smyth:


Principles of Data Mining
MIT Press, 2001.

§ Michael Berthold, David Hand (eds):


Intelligent Data Analysis, An Introduction
(2nd Edition) Springer Verlag, 2003.

© 2021 KNIME AG. All rights reserved. 5


What is Data Science?

[Wikipedia quoting Dhar 13, Leek 13]

Data science is a multi-disciplinary field that uses scientific methods, processes,


algorithms and systems to extract knowledge and insights from structured and
unstructured data.

[Fayyad, Piatetsky-Shapiro & Smyth 96]

Knowledge discovery in databases (KDD) is the process of


(semi-)automatic extraction of knowledge from databases which is valid,
previously unknown, and potentially useful.

© 2021 KNIME AG. All rights reserved. 6


Some Clarity about Words

§ (semi)-automatic: no manual analysis, though some user interaction required


§ valid: in the statistical sense
§ previously unknown: not explicit, no „common sense knowledge“
§ potentially useful: for a given application
§ structured data: numbers
§ unstructured data: everything else (images, texts, networks, chem. compounds,
…)
Data Mining à
Data Science
Machine Data
Learning Preparation

Structured & Big Data


Unstructured
Data

© 2021 KNIME AG. All rights reserved. 7


Use Case Collection
Churn Prediction

CRM System § Churn Prediction


Data about your customer § Upselling Likelihood
§ Demographics § Product Propensity /NBO
§ Behavior § Campaign Management
§ Revenues § Customer Segmentation
§ …

Model

© 2021 KNIME AG. All rights reserved. 9


Customer Segmentation

CRM System § Churn Prediction


Data about your customer § Upselling Likelihood
§ Demographics § Product Propensity /NBO
§ Behavior § Campaign Management
§ Revenues § Customer Segmentation
§ …

Model

© 2021 KNIME AG. All rights reserved. 10


Risk Assessment

Customer History Risk Prognosis

Oct 2015 Oct 2015 § High Risk


CvrerhfdNov CvrerhfdNov
Vdsyh 2016 Vdsyh 2016 § Low Risk
dfgh dfgh
ddgd Vdsyh
Oct 2015
Cvrerhfd
CvrerhfdNov
Cvrerhfd
ddgd Vdsyh § High Risk
Jun 2017 Vdsyh 2016
dfgh
Cvrerhfdddgd
Jun 2017dfgh § Very High Risk
dfgh Cvrerhfd Cvrerhfdddgd
Apr 2018
Vdsyh ddgd Vdsyh Vdsyh
dfgh Cvrerhfd dfgh
Apr 2018
Cvrerhfd
§ Very Low Risk
Jun 2017
Oct 2015Vdsyh dfgh
ddgd Cvrerhfd
CvrerhfdNov
dfgh
ddgd
Feb 2019
ddgd Vdsyh
dfgh Feb 2019
§ Medium Risk
Vdsyh Apr 2018
Vdsyh 2016
dfgh
ddgd Cvrerhfd
dfgh Vdsyh Cvrerhfd ddgd Cvrerhfd
Vdsyh
§ …
Cvrerhfd
ddgd Vdsyh
ddgd Vdsyh dfgh dfgh Feb 2019 dfgh
Jun 2017 ddgd Cvrerhfd ddgd
dfgh ddgd
Cvrerhfdddgd Vdsyh
Vdsyh Apr 2018 dfgh
dfgh Cvrerhfd ddgd
ddgd Vdsyh
dfgh Feb 2019
ddgd Cvrerhfd
Vdsyh
dfgh
ddgd
Model

© 2021 KNIME AG. All rights reserved. 11


Demand Prediction

§ How many taxis do I need in NYC on Wednesday at noon?

Model

© 2021 KNIME AG. All rights reserved. 12


Recommendation Engines / Market Basket Analysis

Recommendation

IF è

Model

© 2021 KNIME AG. All rights reserved. 13


Fraud Detection

Suspicious Transaction

Transactions
§ Trx 1
§ Trx 2
§ Trx 3
§ Trx 4
§ Trx 5
§ Trx 6
§ …
Model

© 2021 KNIME AG. All rights reserved. 14


Sentiment Analysis

© 2021 KNIME AG. All rights reserved. 15


Anomaly Detection

Predicting mechanical failure as late as possible but before it happens

A1-SV3 [0, 100]


Hz

31 August
2007

A1-SV3 [500, 600]


Predictive Breaking point Hz
Maintenance July 21, 2008

Training Set

Only some Spectral Time Series shows the break down via REST

© 2021 KNIME AG. All rights reserved. 16


Basic Concepts in Data Science
What is a Learning Algorithm?

𝑿 = (𝑥1, 𝑥2, … , 𝑥𝑛)


§ Class
§ Input features § Label
§ Input attributes 𝑦 § Target
§ Independent variables § Output feature/attribute
§ Dependent variable

Model
Model parameters
𝑦 = 𝑓( 𝜷, 𝑿 ) with 𝜷 = [𝛽1, 𝛽2, … , 𝛽𝑚]

A learning algorithm adjusts (learns) the model parameters 𝜷 throughout a


number of iterations to maximize/minimize a likelihood/error function on 𝑦.

© 2021 KNIME AG. All rights reserved. 19


Algorithm Training / Learning

§ The model learns / is trained during the learning / training phase to produce
the right answer y (a.k.a., label)

§ That is why machine learning J

§ Many different algorithms for three ways of learning:


§ Supervised
§ Unsupervised
§ Semi-supervised

© 2021 KNIME AG. All rights reserved. 20


Supervised Learning

§ 𝑿 = (𝑥1, 𝑥2) and 𝑦 = {𝑦𝑒𝑙𝑙𝑜𝑤, 𝑔𝑟𝑎𝑦}


§ A training set with many examples of (𝑿, 𝑦)
§ The model learns on the examples of the training set to produce the right value
of y for an input vector 𝑿
x2
𝑿 model 𝑦
Age
Money Sunny vs. Cloudy
Temperature Healthy vs. Sick
Speed Churn vs. Remain
Number of taxi Increase vs.
... Deacrease
...

x1

© 2021 KNIME AG. All rights reserved. 21


Unsupervised Learning

§ 𝑿 = (𝑥1, 𝑥2) and 𝑦 = {𝑦𝑒𝑙𝑙𝑜𝑤, 𝑔𝑟𝑎𝑦}


§ A training set with many examples of (𝑿, 𝑦)
§ The model learns to group the examples 𝑿 of the training set based on similarity
(closeness) or probability
model
x2
𝑿
Age
Money
Temperature
Speed
Number of taxi
...

x1

© 2021 KNIME AG. All rights reserved. 22


Semi-Supervised Learning

§ 𝑿 = (𝑥1, 𝑥2) and 𝑦 = {𝑦𝑒𝑙𝑙𝑜𝑤, 𝑔𝑟𝑎𝑦}


§ A training set with many examples of 𝑿, 𝑦 and some samples 𝑿, 𝑦
§ The model labels the data in the training set using a modified unsupervised
learning procedure

x2
X
Age
Money
Temperature
Speed
Number of taxi
...

x1

© 2021 KNIME AG. All rights reserved. 23


Supervised Learning: Classification vs. Numerical Predictions

§ 𝑿 = (𝑥1, 𝑥2) and 𝑦 = {𝑙𝑎𝑏𝑒𝑙 1, … , 𝑙𝑎𝑏𝑒𝑙 𝑛} or 𝑦 ∈ ℝ


§ A training set with many examples of (𝑿, 𝑦)
§ The model learns on the examples of the training set to produce the right value
of 𝑦 for an input vector 𝑿

Classification Numerical Predictions


y = {yellow, gray} y = temperature
y = {churn, no churn} y = number of visitors
y = {increase, unchanged, decrease} y = number of kW
y = {blonde, gray, brown, red, black} y = price
y = {job 1, job 2, ... , job n} y = number of hours

© 2021 KNIME AG. All rights reserved. 24


Training vs. Testing: Partitioning

§ Training phase: the algorithm trains a model using the data in the training set
§ Testing phase: a metric measures how well the model is performing on data in
a new dataset (the test set)

Training Set Evaluation Set* Test Set

* sometimes
© 2021 KNIME AG. All rights reserved. 25
Data Science: Process Overview

Train and apply Evaluate


Partition data performance
model

Training Train
Set Model

Original
Data Set

Apply Score
Test Set Model Model

© 2021 KNIME AG. All rights reserved. 26


The CRISP-DM Cycle

https://en.wikipedia.org/wiki/Cross_Industry_Standard_
Process_for_Data_Mining

© 2021 KNIME AG. All rights reserved. 27


A Classic Data Science Project

Data Model
Model Training Model Testing Deployment
Preparation Optimization

It always starts with


some data …

Data Manipulation Model Training Parameter Tuning Performance Files & DBs
Data Blending Bag of Models Parameter Optimization Measures Dashboards
Missing Values Handling Model Selection Regularization Accuracy REST API
Feature Generation Ensemble Models Model Size ROC Curve SQL Code Export
Dimensionality Reduction Own Ensemble Model No. Iterations Cross-Validation Reporting
Feature Selection External Models … … …
Outlier Removal Import Existing Models
Normalization Model Factory
Partitioning …

© 2021 KNIME AG. All rights reserved. 28


Decision Tree Algorithm
Goal: A Decision Tree
Outlook Wind Temp Storage Sailing
sunny 3 30 no yes
sunny 3 25 no no
rain 12 15 no yes
overcast 15 2 yes no
rain 16 25 no yes
sunny 14 18 no yes
rain 3 5 yes no
sunny 9 20 no yes
overcast 14 5 yes no
sunny 1 7 yes no
rain 4 25 no no
rain 14 24 no yes
sunny 11 20 no yes
sunny 2 18 no no
overcast 8 22 no yes
overcast 13 24 no yes

© 2021 KNIME AG. All rights reserved. 30


How can we Train a Decision Tree with KNIME Analytics Platform

© 2021 KNIME AG. All rights reserved. 31


Goal: A Decision Tree
Outlook Wind Temp Storage Sailing Option 1
sunny 3 30 yes yes
sunny 3 25 yes no
rain 12 15 yes yes
overcast 15 2 no no
rain 16 25 yes yes
sunny 14 18 yes yes
Option 2
rain 3 5 no no
sunny 9 20 yes yes
overcast 14 5 no no
sunny 1 7 no no
rain 4 25 yes no
rain 14 24 yes yes
sunny 11 20 yes yes How can we measure which is
sunny 2 18 yes no
the best feature for a split?
overcast 8 22 yes yes
overcast 13 24 yes yes

© 2021 KNIME AG. All rights reserved. 32


Possible Split Criterion: Gain Ratio

Based on entropy = measure for information / uncertainty

𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑝 = − ∑$!"# 𝑝! log % 𝑝! for 𝑝 ∈ ℚ$

𝑝! = "⁄!# 𝑝! = !#⁄!# = 1

𝑝$ = %⁄!# 𝑝$ = &⁄!# = 0

𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑝 = − "⁄!# log $ "⁄!# + %⁄!# log $ %⁄!# 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑝 = − !#⁄ log !#⁄
!# $ !# + &⁄!# log $ &⁄!#
= 0,995 =0

© 2021 KNIME AG. All rights reserved. 33


Possible Split Criterion: Gain Ratio

Split criterion:
𝐸𝑛𝑡𝑟𝑜𝑝𝑦&'()*'
% " 𝐺𝑎𝑖𝑛 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦9:013: − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦;0.:3
= 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 ,
!# !#
% "
𝐺𝑎𝑖𝑛 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦9:013: − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦! − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦$
!# !#

Next splitting feature: Feature with


highest 𝐺𝑎𝑖𝑛

Problem: Favors features with many


different values

Solution: Gain Ratio


+ ! $ +
𝐸𝑛𝑡𝑟𝑜𝑝𝑦! = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 ,
" "
𝐸𝑛𝑡𝑟𝑜𝑝𝑦$ = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 ,
% %
𝑤! = "⁄!# '()* 2*.31,4!"#$%" 5∑)
&'( 7& 2*.31,4&
𝑤$ = %⁄!# 𝐺𝑎𝑖𝑛𝑅𝑎𝑡𝑖𝑜 = +,-)./*01 = ∑)
&'( 7& -18* 7&

© 2021 KNIME AG. All rights reserved. 34


Possible Split Criterion: Gini Index

Gini index is based on Gini impurity:


𝑝! = %⁄!#
𝐺𝑖𝑛𝑖(𝑝) = 1 − ∑4123 𝑝15 for 𝑝 ∈ ℚ4
𝑝$ = "⁄!#
6< 8<
𝐺𝑖𝑛𝑖 𝑝 = 1 − 37< − 37<

Split criterion:
𝐺𝑖𝑛𝑖94:;< = ∑4123 𝑤1 𝐺𝑖𝑛𝑖1
8 6
𝐺𝑖𝑛𝑖94:;< = 𝐺𝑖𝑛𝑖3 + 𝐺𝑖𝑛𝑖5
37 37

𝐺𝑖𝑛𝑖! = 𝐺𝑖𝑛𝑖 +⁄", !⁄" 𝐺𝑖𝑛𝑖$ = 𝐺𝑖𝑛𝑖 $⁄%, +⁄% Next splitting feature:
𝑤! = "⁄!# 𝑤$ = %⁄!# Feature with lowest 𝐺𝑖𝑛𝑖&$'()

© 2021 KNIME AG. All rights reserved. 35


What happens for numerical Input Features?

Subset for each value? – NO


Solution: Binary splits

𝑥 = 1.2 𝑥=<
3.4𝑥 𝑥 = 1.7 𝑥 = 3.6
≥𝑥 𝑥 = 4.9
𝑥 = 9.2 𝑥=2 𝑥 = 12.6 𝑥 = 7.4 𝑥=8 𝑥 = 2.3

© 2021 KNIME AG. All rights reserved. 36


The Deeper the Better?!
wind
≥4 <4
𝑡𝑒𝑚𝑝

30 temp temp
≥ 10 < 10 ≥ 25 < 25
25
temp wind
20
≥ 22 < 22 ≥6 <6

15
temp
≥ 26 < 26
10

5 wind
≥6 <6
1 2 3 4 5 6 7 𝑤𝑖𝑛𝑑

© 2021 KNIME AG. All rights reserved. 37


Overfitting vs Underfitting

Underfitted Generalized Overfitted

Model memorizes
Model overlooks Model captures
the training set
underlying correlations in the
rather then finding
patterns in the training set
underlying patterns
training set

© 2021 KNIME AG. All rights reserved. 38


Overfitting vs Underfitting

Overfitting Underfitting

§ Model that fits the training data too § A model that can neither model the
well, including details and noise training data nor generalize to new data
§ Negative impact on the model’s ability
to generalize

Overfitted Generalized Underfitted

© 2021 KNIME AG. All rights reserved. 39


Controlling the Tree Depth

Goal: Tree that generalizes to new data and doesn’t overfit

Pruning Early stopping


Idea: Cut branches that seem as Idea: Define a minimum size for the
result from overfitting tree leaves

Techniques:
• Reduced Error Pruning
• Minimum description length

© 2021 KNIME AG. All rights reserved. 40


Pruning - Minimum Description Length Pruning (MDL)

Definition: Description length = #bits(tree) + #bits(misclassified samples)

Tree 1 Tree 2 Note

wind wind
Many misclassified
Example 1

samples in tree 1
temp
12 0 12 0 => DL(Tree 1) > DL(Tree 2)
6 7 => Select Tree 2

wind wind Only 1 misclassified sample


Example 2

in tree 1
temp
12 0 1 13 12 0 => DL(Tree 1) < DL(Tree 2)
=> Select Tree 1

© 2021 KNIME AG. All rights reserved. 41


Applying the Model – What are the Outputs?

© 2021 KNIME AG. All rights reserved. 42


No True Child Strategy

Outlook Wind Temp Storage Sailing Training:


sunny 3 30 yes yes outlook
sunny 3 25 yes no sunny rain
Training

rain 12 15 yes yes


rain 16 25 yes yes
sunny 14 18 yes yes
rain 3 5 no no
What happens with outlook = overcast?
sunny 9 20 yes yes
sunny 1 7 no no
rain 4 25 yes no
rain 14 24 yes yes
Testing

sunny 11 20 yes yes


sunny 2 18 yes no
overcast 8 22 yes yes
overcast 13 24 yes yes

© 2021 KNIME AG. All rights reserved. 43


Evaluation of Classification
Models
Evaluation Metrics

§ Why evaluation metrics?


§ Quantify the power of a model
§ Compare model configurations and/or models, and select the best performing one
§ Obtain the expected performance of the model for new data
§ Different model evaluation techniques are available for
§ Classification/regression models
§ Imbalanced/balanced target class distributions

© 2021 KNIME AG. All rights reserved. 45


Overall Accuracy

§ Definition:

# 𝑪𝒐𝒓𝒓𝒆𝒄𝒕 𝒄𝒍𝒂𝒔𝒔𝒊𝒇𝒊𝒄𝒂𝒕𝒊𝒐𝒏𝒔 (𝒕𝒆𝒔𝒕 𝒔𝒆𝒕)


𝑶𝒗𝒆𝒓𝒂𝒍𝒍 𝒂𝒄𝒄𝒖𝒓𝒂𝒄𝒚 =
# 𝑨𝒍𝒍 𝒆𝒗𝒆𝒏𝒕𝒔 (𝒕𝒆𝒔𝒕 𝒔𝒆𝒕)

§ The proportion of correct classifications

§ Downsides:
§ Only considers the performance in general and not for the different classes
§ Therefore, not informative when the class distribution is unbalanced

© 2021 KNIME AG. All rights reserved. 46


Confusion Matrix for Sailing Example

Sailing Predicted Predicted Sailing Predicted Predicted


yes / no class: yes class: no yes / no class: yes class: no

True class: True class:


22 3 0 25
yes yes
True class: True class:
no 12 328 no 0 340

*+# *-#
Ac𝑐𝑢𝑟𝑎𝑐𝑦 = = 0,96 Ac𝑐𝑢𝑟𝑎𝑐𝑦 = = 0,93
*,+ *,+

§ Rows – true class values


§ Columns – predicted class values
§ Numbers on main diagonal – correctly classified samples
§ Numbers off the main diagonal – misclassified samples

© 2021 KNIME AG. All rights reserved. 47


Confusion matrix

Arbitrarily define one class value as POSITIVE and the remaining class as
NEGATIVE
TRUE POSITIVE (TP): Actual and
Predicted class Predicted class
positive negative predicted class is positive
TRUE NEGATIVE (TN): Actual and
True class TRUE FALSE
positive POSITIVE NEGATIVE
predicted class is negative
FALSE NEGATIVE (FN): Actual class
True class FALSE TRUE
negative POSITIVE NEGATIVE
is positive and predicted negative
FALSE POSITIVE (FP): Actual class
is negative and predicted positive

Use these four statistics to calculate other evaluation metrics, such as overall
accuracy, true positive rate, and false positive rate
© 2021 KNIME AG. All rights reserved. 48
ROC Curve

§ The ROC Curve shows the false positive rate and true positive rate for
different threshold values
§ False positive rate (FPR)
§ negative events incorrectly classified as positive
§ True positive rate (TPR)
§ positive events correctly classified as positive

Predicted Predicted class Optimal


class positive negative threshold
True 𝑇𝑃
True Positive False Negative 𝑇𝑃𝑅 =
class 𝑇𝑃 + 𝐹𝑁
(TP) (FN)
positive
True
False True Negative
class
Positive (FP) (TN) 𝐹𝑃
negative 𝐹𝑃𝑅 =
𝐹𝑃 + 𝑇𝑁

© 2021 KNIME AG. All rights reserved. 49


Cohen‘s Kappa (κ) vs. Overall accuracy

Positive Negative Positive Negative

Positive 14 6 Switch TP Positive 6 14


and FP
Negative 5 75 Negative 5 75

19 20 11 20
𝑝'! = × 𝑝'! = ×
100 100 100 100
81 80 89 80
𝑝'$ = × 𝑝'$ = ×
100 100 100 100
Overall
𝑝' = 𝑝'! + 𝑝'$ = 0.686 accuracy
𝑝' = 𝑝'! + 𝑝'$ = 0.734
89 81
𝑝- = = 0.89 𝑝- = = 0.81
100 100
κ = 1: perfect model
performance
9!:9" #.%#- 9!:9" #.#?,
𝜅= = ≈ 0.65 κ = 0: the model performance
;:9" #.*;- is equal to a random classifier
𝜅= = = 0.29
;:9" #.%,,

© 2021 KNIME AG. All rights reserved. 50


Exercise: 01_Training_a_Decision_Tree_Model

§ Dataset: Sales data of individual residential properties in Ames, Iowa from 2006
to 2010.
§ One of the columns is the overall condition ranking, with values between 1 and
10.
§ Goal: train a binary classification model, which can predict whether the overall
condition is high or low.

You can download the training workflows from the KNIME Hub:
https://hub.knime.com/knime/spaces/Education/latest/Courses/

© 2021 KNIME AG. All rights reserved. 51


Exercise Session 1

§ Import the course material to KNIME Analytics Platform

2. Click on Browse and select


downloaded .knar file

1. Right click on
LOCAL and select
Import KNIME
Workflow….

3. Click on Finish

© 2021 KNIME AG. All rights reserved. 52


Exercise: Decision_Tree

© 2021 KNIME AG. All rights reserved. 53


Session II: Regression
Models, Ensemble Models
& Logistic Regression
Regression Problems

55
Regression Analysis

§ Prediction of numerical target values

§ Commonality with models for classification


§ First, construct a model
§ Second, use model to predict unknown value
§ Major method for prediction is regression in all its flavors
§ Simple and multiple regression
§ Linear and non-linear regression

§ Difference from classification


§ Classification aims at predicting categorical class label
§ Regression models aim at predicting values from continuous-valued functions

© 2021 KNIME AG. All rights reserved. 56


Regression
Predict numeric outcomes on existing data (supervised)

Applications
§ Forecasting
§ Quantitative Analysis

Methods
§ Linear
§ Polynomial
§ Regression Trees
§ Partial Least Squares

© 2021 KNIME AG. All rights reserved. 57


Linear Regression Algorithm
Linear Regression

Predicts the values of the target variable y


based on a linear combination of
the values of the input feature(s) xj
Two input features: 𝑦j = 𝑎# + 𝑎; 𝑥; + 𝑎% 𝑥%
p input features: 𝑦j = 𝑎# + 𝑎; 𝑥; + 𝑎% 𝑥% + ⋯ + 𝑎9 𝑥9

§ Simple regression: one input feature à regression line


§ Multiple regression: several input features à regression hyper-plane
§ Residuals: differences between observed and predicted values (errors)
Use the residuals to measure the model fit

© 2021 KNIME AG. All rights reserved. 59


Simple Linear Regression

Optimization goal: minimize sum of squared residuals

∑$!"; 𝑒!% = ∑$!"; 𝑦! − 𝑦m! %


y
𝑥
𝑎;
+
𝑎#
=
Residual 𝑦j
ei
yi

x
© 2021 KNIME AG. All rights reserved. 60
Simple Linear Regression

§ Think of a straight line 𝑦j = 𝑓 𝑥 = 𝑎 + 𝑏𝑥


§ Find 𝑎 and 𝑏 to model all observations (𝑥! , 𝑦! ) as close as possible
§ è SSE 𝐹 𝑎, 𝑏 = ∑$!";(𝑓 𝑥 − 𝑦! )% = ∑$!";(𝑎 + 𝑏𝑥! − 𝑦! )% should be minimal
§ That is:
$
𝜕𝐹
= p 2 𝑎 + 𝑏𝑥! − 𝑦! = 0
𝜕𝑎
!";
$
𝜕𝐹
= p 2 𝑎 + 𝑏𝑥! − 𝑦! 𝑥! = 0
𝜕𝑏
!";

§ è A unique solution exists for 𝑎 and 𝑏

© 2021 KNIME AG. All rights reserved. 61


Linear Regression

§ Optimization goal: minimize the squared residuals

%
∑$!"; 𝑒!% = ∑$!"; 𝑦! − ∑$@"# 𝑎@ 𝑥@,! = 𝑦 − 𝑎𝑋 B 𝑦 − 𝑎𝑋

§ Solution:
𝑎j = 𝑋 B 𝑋 :; 𝑋 B 𝑦

§ Computational issues:
§ 𝑋 = 𝑋 must have full rank, and thus be invertible
(Problems arise if linear dependencies between input features exist)
§ Solution may be unstable, if input features are almost linearly dependent

© 2021 KNIME AG. All rights reserved. 62


Linear Regression: Summary

§ Positive:
§ Strong mathematical foundation
§ Simple to calculate and to understand
(For moderate number of dimensions)
§ High predictive accuracy
(In many applications)

§ Negative:
§ Many dependencies are non-linear
(Can be generalized)
§ Model is global and cannot adapt well to locally different data distributions
But: Locally weighted regression, CART

© 2021 KNIME AG. All rights reserved. 63


Polynomial Regression

Predicts the values of the target variable y


based on a polynomial combination of degree d of
the values of the input feature(s) xj
9 9 9
ỹ = 𝑎# + ∑@"; 𝑎@,; 𝑥@ + ∑@"; 𝑎@,% 𝑥@% + ⋯ + ∑@"; 𝑎@,' 𝑥@'

§ Simple regression: one input feature à regression curve


§ Multiple regression: several input features à regression hypersurface
§ Residuals: differences between observed and predicted values (errors)
Use the residuals to measure the model fit

© 2021 KNIME AG. All rights reserved. 64


Evaluation of Regression Models
Numeric Errors: Formulas
Error Metric Formula Notes

R-squared ∑0./!(𝑦. −𝑓(𝑥. ))$ Universal range: the closer to 1 the


1− better
∑0./!(𝑦. −𝑦)$
0
Mean absolute error (MAE) 1 Equal weights to all distances
V |𝑦. − 𝑓(𝑥. )| Same unit as the target column
𝑛
./!
0
Mean squared error (MSE) 1 Common loss function
V(𝑦. − 𝑓(𝑥. ))$
𝑛
./!

Root mean squared error (RMSE) 0 Weights big differences more


1 Same unit as the target column
V(𝑦. − 𝑓(𝑥. ))$
𝑛
./!

0
Mean signed difference 1 Only informative about the direction
V 𝑦. − 𝑓 𝑥. of the error
𝑛
./!
0
Mean absolute percentage error 1 |𝑦. − 𝑓(𝑥. )| Requires non-zero target column
(MAPE) V values
𝑛 |𝑦. |
./!

© 2021 KNIME AG. All rights reserved. 67


MAE (Mean Absolute Error) vs. RMSE (Root Mean Squared Error)

MAE RMSE

Easy to interpret – mean average absolute error Cannot be directly interpreted as the average error

All errors are equally weighted Larger errors are weighted more

Generally smaller than RMSE Ideal when large deviations need to be avoided

Example:
Actual values = [2,4,5,8], MAE RMSE

Case 1: Predicted Values = [4, 6, 8, 10] Case 1 2.25 2.29

Case 2: Predicted Values = [4, 6, 8, 14] Case 2 3.25 3.64

© 2021 KNIME AG. All rights reserved. 68


R-squared vs. RMSE

R-squared RMSE

Relative measure: Absolute measure:


Proportion of variability explained by the model How much deviation at each point
Range: Same scale as the target
0 (no variability explained) to
1 (all variability explained)

Example:
Actual values = [2,4,5,8], R-sq RMSE

Case 1: Predicted Values = [3, 4, 5, 6] Case 1 0.96 1.12

Case 2: Predicted Values = [3, 3, 7, 7] Case 2 0.65 1.32

© 2021 KNIME AG. All rights reserved. 70


Numeric Scorer

§ Similar to scorer node, but for nodes with numeric predictions


§ Compare dependent variable values to predicted values to evaluate model
quality.
§ Report R2, RMSE, MAPE, etc.

© 2021 KNIME AG. All rights reserved. 75


Regression Tree

76
Regression Tree: Goal

y We want to model the target


variable with piecewise lines
à No knowledge of functional
form required

© 2021 KNIME AG. All rights reserved. 77


Regression Tree: Initial Split
Squared sum of errors:
y
Local mean:
𝐸X = 0 𝑦1 − 𝑐X 5
1
𝑐X = 0 𝑦1
𝑛 Optimal boundary S should minimize
For observations in
the total squared sum:
segment m
0 𝐸X
For all segments m

s x

© 2021 KNIME AG. All rights reserved. 78


Regression Tree: Initial Split

y 𝑥 ≤ 93.5?

Y N

𝐶! = 28.9 𝐶" = 17.8

s x

© 2021 KNIME AG. All rights reserved. 79


Regression Tree: Growing the Tree

Repeat the
y splitting process 𝑥 ≤ 93.5?

within each Y N

segment
𝑥 ≤ 70.5? 𝐶+ = 17.8

Y N

𝐶, = 33.9 𝐶- = 26.4

s x

© 2021 KNIME AG. All rights reserved. 80


Regression Tree: Final Model

© 2021 KNIME AG. All rights reserved. 81


Regression Tree: Algorithm

Start with a single node containing all points.


1. Calculate ci and Ei.
2. If all points have the same value for feature xj, stop.
3. Otherwise, find the best binary splits that reduces Ej,s as much as possible.
§ Ej,s doesn’t reduce as much à stop
§ A node contains less than the minimum node size à stop
§ Otherwise, take that split, creating two new nodes.
§ In each new node, go back to step 1.

© 2021 KNIME AG. All rights reserved. 84


Regression Trees: Summary

§ Differences to decision trees:


§ Splitting criterion: minimizing intra-subset variation (error)
§ Pruning criterion: based on numeric error measure
§ Leaf node predicts average target values of training instances reaching that node

§ Can approximate piecewise constant functions


§ Easy to interpret

© 2021 KNIME AG. All rights reserved. 85


Regression Trees: Pros & Cons

§ Finding of (local) regression values (average)


§ Problems:
§ No interpolation across borders
§ Heuristic algorithm: unstable and not optimal.

§ Extensions:
§ Fuzzy trees (better interpolation)
§ Local models for each leaf (linear, quadratic)

© 2021 KNIME AG. All rights reserved. 86


Ensemble Models
Tree Ensemble Models

§ General idea: take advantage of the


X
“wisdom of the crowd”
§ Ensemble models: Combining predictions

1 4 1

from many predictors, e.g. decision trees 2


5

9 6
2

7 6
2

8 9
7

3 3
7

9 5
6

§ Leads to a more accurate and robust model P1 P2 … Pn

§ Model is difficult to interpret


y
§ There are multiple trees in the model
Typically for classification, the
individual models vote and the
majority wins; for regression,
the individual predictions are
averaged

© 2021 KNIME AG. All rights reserved. 88


Bagging - Idea

§ One option is ”bagging” (Bootstrap AGGregatING)


§ For each tree / model a training set is generated by sampling uniformly with
replacement from the standard training set

Build tree Build tree Build tree


1 4 1

5 2 5 7 7 6

2 9 6 7 2 8 9 3 3 9 5 7

© 2021 KNIME AG. All rights reserved. 89


Example for Bagging

Full training set Sampled training set


RowID 𝒙𝟏 𝒙𝟐 𝒚 Sampled𝒙𝒙dataset
RowID
RowID 𝟏 𝟏
𝒙𝒙𝟐 𝟐
𝒚
𝒚

Row_1 2 6 Class 1 Row_3 9 3 Class 2

Row_2 4 1 Class 2 Row_6 2 6 Class 1

Row_3 9 3 Class 2 Row_1 2 6 Class 1

Row_4 2 7 Class 1 Row_3 9 3 Class 2

Row_5 8 1 Class 2 Row_5 8 1 Class 2

Row_6 2 6 Class 1 Row_6 2 6 Class 1

Row_7 5 2 Class 2 Row_1 2 6 Class 1

© 2021 KNIME AG. All rights reserved. 91


An Extra Benefit of Bagging: Out of Bag Estimation

§ Able to evaluate the model using the training data


§ Apply trees to samples that haven’t been used for training

X1 X2

… …
1 4 1 1 4 1

5 2 2 7 7 6 5 2 2 7 7 6

2 9 6 7 6 8 9 3 3 9 5 7 2 9 6 7 6 8 9 3 3 9 5 7

P1 P2 … Pn P1 P2 … Pn

y1OOB y2OOB

© 2021 KNIME AG. All rights reserved. 92


Random Forest

§ Bag of decision trees, with an extra element of


randomization
§ Each node in the decision tree only “sees” a subset of
the input features, typically 𝑁 to pick from
§ Random forests tend to be very robust w.r.t. overfitting

Build tree

5 2

2 9 6 7

© 2021 KNIME AG. All rights reserved. 93


Boosting - Idea

§ Starts with a single tree built from the data


§ Fits a tree to residual errors from the previous model to refine the model
sequentially

Residual Residual
errors
… errors
from previous from previous
model model

Build tree Build tree Build tree


1 4 1

5 2 5 7 7 6

2 9 6 7 2 8 9 3 3 9 5 7

© 2021 KNIME AG. All rights reserved. 94


Boosting - Idea

§ Gradient boosting method


§ A shallow tree (depth 4 or less) is built at each step
§ To fit residual errors from the previous step
§ Resulting in a tree ℎN (𝑥)
§ The resulting tree is added to the latest model to update
𝐹N 𝑥 = 𝐹N5! 𝑥 + 𝛾N ℎN (𝑥)
§ Where 𝐹N5! (𝑥) is the model from the previous step
§ The weight 𝛾N is chosen to minimize the loss function
§ Loss function: quantifies the difference between model predictions and data

© 2021 KNIME AG. All rights reserved. 95


Gradient Boosting Example – Regression

Regression tree
with depth 1

© 2021 KNIME AG. All rights reserved. 96


Gradient Boosted Trees

§ Can be used for classification and regression


§ Large number of iterations – prone to overfitting
§ ~100 iterations are sufficient
§ Can introduce randomness in choice of data subsets (“stochastic gradient
boosting”) and choice of input features

© 2021 KNIME AG. All rights reserved. 97


Ensemble Tree Nodes in KNIME Analytics Platform

Classification Problems Regression Problems

© 2021 KNIME AG. All rights reserved. 98


Parameter Optimization

© 2021 KNIME AG. All rights reserved. 99


Logistic Regression
What is a Logistic Regression (algorithm)?

§ Another algorithm to train a classification model

I know already the decision


tree algorithm and tree
ensembles. Why do I need
another one?

© 2021 KNIME AG. All rights reserved. 101


Why Shouldn’t we Always use the Decision Tree?

© 2021 KNIME AG. All rights reserved. 102


Decision Boundary of a Logistic Regression

© 2021 KNIME AG. All rights reserved. 103


Linear Regression vs. Logistic Regression

Linear Regression Logistic Regression

Target variable y Numeric 𝑦 ∈ (−∞, ∞)/[𝑎, 𝑏] Nominal 𝑦 ∈ 0, 1, 2, 3 /{𝑟𝑒𝑑, 𝑤ℎ𝑖𝑡𝑒}

… target value 𝑦
Functional relationship … class probability P (y = class i)
between features 𝑦 = 𝑓(𝑥! , … , 𝑥* , 𝛽& , … , 𝛽* )
and… 𝑦 = 𝛽& +𝛽! 𝑥! + ⋯ + 𝛽* 𝑥* 𝑃 𝑦 = 𝑐) = 𝑓 𝑥! , … , 𝑥* , 𝛽& , … , 𝛽*

Goal: Find the regression coefficients 𝛽# , … , 𝛽$

© 2021 KNIME AG. All rights reserved. 104


Let’s find out how Binary Logistic Regression works!

§ Idea: Train a function, which gives us the probability for each class (0 and 1)
based on the input features

§ Recap on probabilities
§ Probabilities are always between 0 and 1
§ The probability of all classes sum up to 1

𝑃 𝑦 = 1 = 𝑝! => 𝑃 𝑦 = 0 = 1 − 𝑝!

è It’s sufficient to model the probability for one class

© 2021 KNIME AG. All rights reserved. 106


Let’s Find Out How Binary Logistic Regression Works!

1
𝑃 𝑦 = 1 = 𝑓 𝑥; , 𝑥% ; 𝛽# , 𝛽; , 𝛽% ≔
1 + 𝑒 :(D!ED2)2ED3)3)

Feature space Probability function given 𝑥; = 2

© 2021 KNIME AG. All rights reserved. 107


More General: Binary Logistic Regression

§ Model:
;
𝜋 = 𝑃(𝑦 = 1) =
;EGHI(:J)

With 𝑧 = 𝛽& + 𝛽! 𝑥! + ⋯ + 𝛽* 𝑥* = 𝑿𝜷.


§ Goal: Find the regression coefficients 𝜷 = (𝛽# , … , 𝛽$ )
§ Notation:
§ 𝑦) is the class value for sample i
§ 𝑥! , … , 𝑥* is the set of input features, 𝑿 = (1, 𝑥! , … , 𝑥* )
§ The training data set has m observations (𝑦) , 𝑥!) , … , 𝑥*) )

© 2021 KNIME AG. All rights reserved. 110


How can we Find the Best Coefficients β?

Maximize the product of the probabilities è Likelihood function

8 8
4 564K
𝐿 𝛽; 𝑦, 𝑋 = 0 𝑃(𝑦 = 𝑦3 ) = 0 𝜋3 K 1 − 𝜋3
375 375

Why does it make sense to maximize this function?

𝜋3 𝑖𝑓 𝑦3 = 1
𝑃 𝑦 = 𝑦3 = $1 − 𝜋
3 𝑖𝑓 𝑦3 = 0 Remember:
𝜋! = P 𝑦 = 1
4 564K 𝑢# = 1 for 𝑢 ∈ ℝ
= 𝜋3 K 1 − 𝜋3 𝑢; = 𝑢 for 𝑢 ∈ ℝ

© 2021 KNIME AG. All rights reserved. 111


Max Likelihood and Log Likelihood Functions

§ Maximize the Likelihood function 𝐿 𝜷; 𝒚, 𝑿

8
4 564K
max 𝐿 𝛽; 𝑦, 𝑋 = max 0 𝜋3 K 1 − 𝜋3
9 9 375

§ Equivalent to maximizing the Log Likelihood function 𝐿𝐿 𝜷; 𝒚, 𝑿

:
max 𝐿𝐿(𝜷; 𝒚, 𝑿) = max 9 𝑦3 ln 𝜋3 + 1 − 𝑦3 ln 1 − 𝜋3
9 9 375

© 2021 KNIME AG. All rights reserved. 112


How can we find this Coefficients?

§ To find the coefficients of our model we want to find 𝜷 so that the value of the
function 𝐿𝐿 𝜷; 𝒚, 𝑿 is maximal

§ KNIME Analytics Platform provides two algorithms


§ Iteratively re-weighted least squares
§ Uses the idea of the newton method
§ Stochastic average gradient descent

© 2021 KNIME AG. All rights reserved. 113


Idea: Gradient Descent Method

max 𝐿𝐿(𝜷; 𝑿, 𝒚) ⟺ min −𝐿𝐿(𝜷; 𝑿, 𝒚)

Optimal 𝛽z

© 2021 KNIME AG. All rights reserved. 116


Idea: Gradient Descent Method
max 𝐿𝐿(𝜷; 𝑿, 𝒚) ⟺ min −𝐿𝐿(𝜷; 𝑿, 𝒚)

§ Example: min −𝐿𝐿 𝛽 ≔ 𝑓(𝛽)


§ Start from an arbitrary point
§ Move towards the minimum
Δs
§ With step size Δ𝑠
§ If 𝑓(𝛽) is strictly convex
è Only one global minimum exists
§ Z normalization of the input data
lead to better convergence

Optimal 𝛽z

© 2021 KNIME AG. All rights reserved. 117


Learning Rate / Step Length Δ𝑠

Δ𝑠 too small Δ𝑠 too large Just right

Δ𝑠

Δ𝑠

© 2021 KNIME AG. All rights reserved. 118


Learning Rate Δ𝑠

§ Fixed:
Δ𝑠L = Δ𝑠#
§ Annealing:
Δ𝑠#
Δ𝑠L = 𝛼
1+
𝑘
with iteration number 𝑘 and decay rate 𝛼
§ Line Search: Learning rate strategy that tries to find the optimal learning rate

© 2021 KNIME AG. All rights reserved. 119


Is there a way to handle Overfitting as well? (optional)

§ To avoid overfitting: add regularization by penalizing large weights


!
§ 𝐿$ regularizations = Coefficients are Gauss distributed with 𝜎 = O

𝑙 𝛽; z 𝑦, 𝑋 + M ||𝛽||
z 𝑦, 𝑋 ≔ −𝐿𝐿 𝛽; z %%
%

$
§ 𝐿! regularizations = Coefficients are Laplace distributed with 𝜎 = O

z 𝑦, 𝑋 ≔ −𝐿𝐿 𝛽;
𝑙 𝛽; z 𝑦, 𝑋 + 𝜆||𝛽||
z ;

=> The smaller 𝜎, the smaller the coefficients 𝛽#

© 2021 KNIME AG. All rights reserved. 122


Impact of Regularization

© 2021 KNIME AG. All rights reserved. 123


Interpretation of the Coefficients

§ Interpretation of the sign


§ 𝛽) > 0 : Bigger 𝑥) lead to higher probability
§ 𝛽) < 0 : Bigger 𝑥) lead to smaller probability

© 2021 KNIME AG. All rights reserved. 124


Interpretation of the p Value

§ p- value < 𝛼: input feature has a significant impact on the dependent variable.

© 2021 KNIME AG. All rights reserved. 125


Summary Logistic Regression

§ Logistic regression is used for classification problems


§ The regression coefficients are calculated by maximizing the likelihood function,
which has no closed form solution, hence iterative methods are used.
§ Regularization can be used to avoid overfitting.
§ The p-value shows us whether an independent variable is significant

© 2021 KNIME AG. All rights reserved. 129


Exercises

§ Regression Exercises:
§ Goal: Predicting the house price
§ 01_Linear_Regression
§ 02_Regression_Tree
§ Classification Exercises:
§ Goal: Predicting the house condition (high /low)
§ 03_Radom_Forest (with optional exercise to build a parameter
optimization loop)
§ 04_Logistic_Regression

© 2021 KNIME AG. All rights reserved. 130


Session 3: Neural Networks and
Recommendation Engines

131
Artificial Neurons and Networks
Biological vs. Artificial

Biological Neuron Biological Neural Networks

Artificial Neuron (Perceptron) Artificial Neural Networks


(Multilayer Perceptron, MLP)

𝑦 = 𝑓(𝑥! 𝑤! + 𝑥$ 𝑤$ + 𝑏) 𝑥;
𝑥; 𝑏
𝑤! 𝑏 = 𝑤& 𝑥% y
∑f( )σ y *
𝑤$ 𝑦 = 𝑓(\ 𝑥) 𝑤) ) 𝑥*
𝑥% )P&
𝑥-

© 2021 KNIME AG. All rights reserved. 133


Architecture / Topology

Input Hidden Output


Layer Layer Layer
Forward pass:
𝑜;%
∑ 𝑓 𝒐 = 𝑓 𝑊)% 𝒙
$
𝑊!,! #
𝑊!,!
$
𝑥; 𝑊!,$ 𝑦 = 𝑓(𝑊N* 𝒐)
$
𝑜%%
𝑊$,!
∑ 𝑓 #
𝑊!,$ ∑ 𝑓 𝑦
$
𝑊$,$

𝑥% $
𝑊#,! # fully connected
𝑊!,#
$
𝑊#,$ 𝑜*% feed forward
∑ 𝑓

© 2021 KNIME AG. All rights reserved. 134


Same with Matrix Notations

Input Hidden Output


Layer Layer Layer Forward pass:
𝑾𝟐𝒙 𝑜;% 𝑾𝟑𝒚
∑ 𝑓 𝒐 = 𝑓 𝑊)% 𝒙

𝑥; 𝑦 = 𝑓(𝑊N* 𝒐)
𝑜%%
∑ 𝑓 ∑ 𝑓 𝑦

𝑥%
f( ) = activation function
𝑜*%
∑ 𝑓

© 2021 KNIME AG. All rights reserved. 135


Neural Architectures

completely feedforward recurrent


connected (directed, a-cyclic) (feedback connections)

example: example: example:


§ Associative § auto associative § recurrent neural
neural network neural network network (for time
§ Hopfield § Multi Layer Perceptron series recognition)

© 2021 KNIME AG. All rights reserved. 136


Frequently used activation functions

Sigmoid Tanh Rectified Linear Unit (ReLU)

1 𝑒 %ST − 1
𝑓 𝑎 = 𝑓 𝑎 = %ST 𝑓 𝑎 = 𝑚𝑎𝑥 0, ℎ𝑎
1 + 𝑒 :ST 𝑒 +1

© 2021 KNIME AG. All rights reserved. 137


What can a single Perceptron do?

1 1 0 1

0 1 0 0

𝑥; 𝑏
1 0
𝑤!
∑f( )σ y
𝑥%
𝑤$
?
0 1

© 2021 KNIME AG. All rights reserved. 138


What can a 3-neuron MLP do?

y
1 0

0 1
x

x y
0
0 1

© 2021 KNIME AG. All rights reserved. 139


MLP: Example

out y
2

1 -1
-
1
2
1
?
1 2
2 1
-1
-1 -1
1
x y x
1 2

© 2021 KNIME AG. All rights reserved. 140


MLP: Example

y
out
2

1
2 1
-1
1
x y x
1 2

© 2021 KNIME AG. All rights reserved. 141


MLP: Example

y 1 =0
out
1+ x - y < 0
2 2
=1
1
1+ x - y > 0
1
2

1
2 1
-1
1
x y x
1 2

© 2021 KNIME AG. All rights reserved. 142


MLP: Example

y =0
out
2
=1

1
2- x- y < 0
2 =1 =0
-1
-1
1 2- x- y > 0
x y x
1 2

© 2021 KNIME AG. All rights reserved. 143


MLP: Example

y =0
out
2
=1
1 -1
1
-
2 1

=1 =0

1
x y x
1 2

© 2021 KNIME AG. All rights reserved. 144


MLP: Example

y =0
out
2

=1
1 -1
1 1
-
2 1
- - >0
2

=1 =0

1
x y x
1 2

© 2021 KNIME AG. All rights reserved.


MLP: Example

y 1 =0
out
1+ x - y < 0
2 2
=1
-1 1
1
1 1+ x - y > 0
-
2 1
2

1 2 =1 =0
1
2
-1 -1
-1 2- x- y > 0 2- x- y < 0
1
x y x
1 2

© 2021 KNIME AG. All rights reserved. 146


Back-Propagation
Training of a Feed Forward Neural Network - MLP

§ Teach (ensemble of) neuron(s) a desired input-output behavior.


§ Show examples from the training set repeatedly
§ Networks adjusts parameters to fit underlying function
§ topology
§ weights
§ internal functional parameters

© 2021 KNIME AG. All rights reserved. 148


Training of a Feed Forward Neural Network - MLP

Input Hidden Output Forward pass:


Layer Layer Layer 𝒐 = 𝑓 𝑊)% 𝒙
𝑦 = 𝑓(𝑊N* 𝒐)
𝑾𝟐𝒙 𝑜;% 𝑾𝟑𝒚
∑ 𝑓 1 %
𝐸=p 𝑡 − 𝑦@
2 @
@
𝑥;
𝑜%%
Target (j) Network output (j)
∑ 𝑓 ∑ 𝑓 𝑦

𝑥% Gradient descent
𝑜*%
∑ 𝑓 𝜕𝐸
∆𝑤@! = −𝜂
𝜕𝑤@!

© 2021 KNIME AG. All rights reserved. 149


... Some Calculations for the Output Layer ....

2 3 2 3
UV U 3 X4 :N4 U 3 X4 :N4 UN4 UN4
= = = − 𝑡@ − 𝑦@
UW45 UW45 UN4 UW45 UW45

UN4 US4 US4 U ∑6 )6 W46


= − 𝑡@ − 𝑦@ = − 𝑡@ − 𝑦@ 𝑔′(ℎ@ ) = − 𝑡@ − 𝑦@ 𝑔′(ℎ@ )
US4 UW45 UW45 UW45

= − 𝑡@ − 𝑦@ 𝑔′(ℎ@ )𝑥!

∆𝑤@! = −𝜂 𝑡@ − 𝑦@ 𝑔Z ℎ@ 𝑥! = −𝜂 𝛿@[\X 𝑥!

© 2021 KNIME AG. All rights reserved. 150


… some Calculations for the Hidden Layer …
1 X
𝜕 ∑U∈= ∑XWP! 𝑓 𝑎W1Y. (𝑥) − 𝑦W (𝑥) $
𝜂 𝜕 𝑓 𝑎W1Y. (𝑥) − 𝑦W (𝑥) $
∆𝑤)RS)TT:* = 2 =− \\
𝜕𝑤)RS)TT:* 2 𝜕𝑤)RS)TT:*
U∈= WP!

[ 0 ∑% 7 '() 0 ∑+
!" #$ !" &
7 %*,,-. \U*"
*" #$ *" !"
54& (U)
Z
…= − $ ∑U∈= ∑XWP! 2 𝑓 𝑎W1Y. (𝑥) − 𝑦W (𝑥) %*,,-.
[7*!

[ ∑% 7 '() 0 ∑+
!" #$ !" &
7 %*,,-. \U*"
*" #$ *" !"
…= −𝜂 ∑U∈= ∑XWP! 𝑓 𝑎W1Y. (𝑥) − 𝑦W (𝑥) 𝑓′ ∑SR " P! 𝑤R1Y.
"W 𝑓 ∑N S)TT:*
) " P! 𝑤) " R " c 𝑥) " %*,,-.
[7*!

[ ∑% 7 '() 0 ∑+
!" #$ !" &
7 %*,,-. \U*"
*" #$ *" !"
[0 ∑+ 7 %*,,-. \U*"
*" #$ *" !
1Y.
…= −𝜂 ∑U∈= ∑XWP! 𝛿W1Y. %*,,-. = −𝜂 ∑U∈= ∑XWP! 𝛿W1Y. 𝑤RW %*,,-.
[7*! [7*!

… = −𝜂 ∑U∈= ∑XWP! 𝛿W1Y. 𝑤RW 𝑓 ∑N


1Y. _ S)TT:*
) " P! 𝑤) " R
1Y. _
c 𝑥) " c 𝑥) = −𝜂 ∑U∈= ∑XWP! 𝛿W1Y. 𝑤RW 𝑓 𝑎RS)TT:* c 𝑥)

Do you understand
… = ∑U∈= −𝜂 c 𝛿RS)TT:* c 𝑥) now why the sigmoid
is a commonly used
activation function?

© 2021 KNIME AG. All rights reserved. 151


Step 1. Forward Pass

Input Hidden Output


Layer Layer Layer

𝑾𝟐𝒙 𝑜;% 𝑾𝟑𝒚


∑ 𝑓

𝑥;
𝑜%%
∑ 𝑓 ∑ 𝑓 𝑦

1. Forward pass:
𝑥%
𝑜*%
𝒐 = 𝑓 𝑊)% 𝒙
∑ 𝑓 𝑦 = 𝑓(𝑊N* 𝒐)

© 2021 KNIME AG. All rights reserved. 152


Step 1. Backward Pass

Input Hidden Output


Layer Layer Layer

𝑾𝟐𝒙 𝑜;% 𝑾𝟑𝒚


∑ 𝑓

𝑥; δhidde% 𝑦
n 𝑜%
∑ 𝑓 ∑ 𝑓 δout 2. Backward pass:
𝑥% UV U[4 (𝑜@ − 𝑡@ )𝑜@ (1 − 𝑜@ )
𝑜*% 𝛿@ = = •
U[4 U$(X4 ∑L∈^ 𝑤@L 𝛿L 𝑜@ (1 − 𝑜@ )
∑ 𝑓
∆𝑤!@ = −η 𝑜! 𝛿@

© 2021 KNIME AG. All rights reserved. 153


Learning Rate η

η too small η too large η just right

© 2021 KNIME AG. All rights reserved. 154


Training: Batch vs. Online

§ Batch Training: Weight update after all patterns


§ correct
§ computationally expensive and slow
§ works with reasonably large learning rates (fewer updates!)
§ Online Training: Weight update after each pattern
§ Approximation
§ can (in theory) run into oscillations
§ faster (fewer epochs!)
§ smaller learning rates necessary

© 2021 KNIME AG. All rights reserved. 155


Back-Propagation: Optimizations

§ Weight Decay:
§ try to keep weights small
§ Momentum:
§ increase weight updates as long as they have the same sign
§ Resilient Backpropagation:
§ estimate optimum for weight based on assumption that error surface is a polynomial.

© 2021 KNIME AG. All rights reserved. 156


Overfitting

§ MLP describe potentially very complex relationships


§ Danger of fitting training data too well: Overfitting
§ Modeling of training data instead of underlying concept
§ Modeling of artifacts or outliers

© 2021 KNIME AG. All rights reserved. 157


Knowledge Extraction and MLPs

§ MLPs are powerful but black boxes


§ Rule extraction only possible in some cases
§ VI-Analysis (interval propagation)
§ extraction of decision trees
§ Problems:
§ Global influence of each neuron
§ Interpretation of hidden layer(s) complicated

§ Possible Solution:
§ Local activity of neurons in hidden layer: Local Basis Function Networks

© 2021 KNIME AG. All rights reserved. 158


Deep Learning
Recurrent Neural Networks
What are Recurrent Neural Networks?

§ Recurrent Neural Network (RNN) are a family of neural networks used for
processing of sequential data
§ RNNs are used for all sorts of tasks:
§ Language modeling / Text generation
§ Text classification
§ Neural machine translation
§ Image captioning
§ Speech to text
§ Numerical time series data, e.g. sensor data

© 2021 KNIME AG. All rights reserved. 161


Why do we need RNNs for Sequential Data?

§ Goal: Translation network from German to English


∑ σ
“Ich mag Schokolade” ∑ σ
=> “I like chocolate”
∑ σ
𝑥 ∑ σ ∑σ 𝑦
§ One option: Use feed forward network to ∑ σ
translate word by word ∑ σ
∑ σ

§ But what happens with this question?


Input x Output y

Ich I
“Mag ich Schokolade?”
mag like
=> “Do I like chocolate?”
Schokolade chocolate

© 2021 KNIME AG. All rights reserved. 162


Why do we need RNNs for Sequential Data?

§ Problems:
§ Each time step is completely independent
∑ σ
§ For translations we need context
∑ σ
§ More general: we need a network that remembers inputs from the past
∑ σ
𝑥 ∑ σ ∑σ 𝑦
§ Solution: Recurrent neural networks
∑ σ

∑ σ
∑ σ

Input x Output y

Ich I
mag like
Schokolade chocolate

© 2021 KNIME AG. All rights reserved. 163


What are RNNs?

Image Source: Christopher Olah, https://colah.github.io/posts/2015-08-Understanding-LSTMs/

© 2021 KNIME AG. All rights reserved. 164


From Feed Forward to Recurrent Neural Networks

𝑾𝟐𝒙 𝑾𝟑𝒚
∑ σ ∑ σ
𝑥;
𝑾𝟐𝒙 𝑾𝟑𝒚
∑ σ ∑ σ 𝑦 𝑥 ∑ σ 𝑦
𝑥%
∑ σ ∑ σ

© 2021 KNIME AG. All rights reserved. 165


From Feed Forward to Recurrent Neural Networks

𝑦# 𝑦$ 𝑦% 𝑦&
𝑾𝟑𝒚 𝑾𝟑𝒚 𝑾𝟑𝒚 𝑾𝟑𝒚

∑ σ ∑ σ ∑ σ ∑ σ

∑ σ ∑ σ ∑ σ ∑ σ

∑ σ ∑ σ ∑ σ ∑ σ

𝑾𝟐𝒙 𝑾𝟐𝒙 𝑾𝟐𝒙 𝑾𝟐𝒙


𝑥# 𝑥$ 𝑥% 𝑥&

© 2021 KNIME AG. All rights reserved. 166


Simple RNN unit

Image Source: Christopher Olah, https://colah.github.io/posts/2015-08-Understanding-LSTMs/

© 2021 KNIME AG. All rights reserved. 167


Limitations of Simple Layer Structures

The “memory” of simple RNNs is sometimes too limited to be useful


§ “Cars drive on the ” (road)
§ “I love the beach – my favorite sound is the crashing of the “
(cars? glass? waves?)

© 2021 KNIME AG. All rights reserved. 168


LSTM = Long Short Term Memory Unit

§ Special type of unit with three gates


§ Forget gate
§ Input gate
§ Output gate

Image Source: Christopher Olah, https://colah.github.io/posts/2015-08-Understanding-LSTMs/

© 2021 KNIME AG. All rights reserved. 169


Different Network-Structures and Applications

Many to Many

Ich gehe gerne segeln

I like sailing <eos>


D D D D
A A A A
E E E Ich gehe gerne
<sos> I like sailing
I like sailing

Language model Neural machine translation

© 2021 KNIME AG. All rights reserved. 170


Different Network-Structures and Applications

Many to one One to many


English Couple sailing on a lake

A A A A A A A A A A

I like to go sailing

Language classification Image captioning


Text classification

© 2021 KNIME AG. All rights reserved. 171


Neural Network: Code-free

© 2021 KNIME AG. All rights reserved. 172


Convolutional Neural Networks
(CNN)
Convolutional Neural Networks (CNN)

§ Used when data has spatial relationships,


e.g. images
§ Instead of connecting every neuron to the
new layer a sliding window is used
§ Some convolutions may detect edges or
corners, while others may detect cats,
dogs, or street signs inside an image

Image from: https://towardsdatascience.com/a-


comprehensive-guide-to-convolutional-neural-networks-
the-eli5-way-3bd2b1164a53

© 2021 KNIME AG. All rights reserved. 174


Convolutional Neural Networks

© 2021 KNIME AG. All rights reserved. 175


Building CNNs with KNIME

© 2021 KNIME AG. All rights reserved. 176


Recommendation Engines
Recommendation Engines and Market Basket Analysis

From the analysis of many A-priori algorithm Recommendation


shopping baskets ...

IF +

THEN

© 2021 KNIME AG. All rights reserved. 178


Recommendation Engines or Market Basket Analysis
From the analysis of the reactions
of many people to the same item ... Recommendation

Collaborative Filtering

IF A has the same opinion as B on


an item,
THEN A is more likely to have B's
opinion on another item than that of
a randomly chosen person

© 2021 KNIME AG. All rights reserved. 179


A-priori Algorithm: the Association Rule

IF + THEN

Antecedent Consequent

© 2021 KNIME AG. All rights reserved. 180


Building the Association Rule

N shopping baskets

{A, B, F, H}
Search for {A, B, C}
frequent itemsets {B, C, H}
{D, E , F}
{D, E}
{A, B}
{A, C}
{H, F}

© 2021 KNIME AG. All rights reserved. 181


From “Frequent Itemsets“ to “Rules“

{A, B, F} è H

{A, B, H} è F

{A, B, F, H}
{A, F, H} è B

Which rules shall I choose?


{B, F, H} è A

© 2021 KNIME AG. All rights reserved. 182


Support, Confidence, and Lift

{A, B, F} è H
_`(a(b,c,d,e) How often these items
§ Item set support 𝒔 =
f are found together

_`(a(b,c,d,e)
§ Rule confidence 𝒄 = How often the antecedent
_`(a(b,c,d)
is together with the consequent
g\99[`X ( b,c,d ⇒e)
§ Rule lift =
g\99[`X b,c,d × g\99[`X(e) How often antecedent and
consequent happen together
compared with random
chance
The rules with support, confidence and lift above a threshold à most reliable ones

© 2021 KNIME AG. All rights reserved. 183


Association Rule Mining (ARM): Two Phases

Discover all frequent and strong association rules


XÞY à “if X then Y”
with sufficient support and confidence

Two phases:
1. find all frequent itemsets (FI) ß Most of the complexity
§ Select itemsets with a minimum support
𝐹𝐼 = 𝑋, 𝑌 , 𝑋, 𝑌 ⊂ 𝐼|𝑠 𝑋, 𝑌 ≥ 𝑆N)*
2. build strong association rules User parameters
§ Select rules with a minimum confidence:
𝑅𝑢𝑙𝑒𝑠: 𝑋 ⇒ 𝑌, 𝑋, 𝑌 ⊂ 𝐹𝐼, p𝑐 𝑋 ⇒ 𝑌 ≥ 𝐶N)*

© 2021 KNIME AG. All rights reserved. 184


A-Priori Algorithm: Example

§ Let‘s consider milk, diaper, and beer: 𝑚𝑖𝑙𝑘, 𝑑𝑖𝑎𝑝𝑒𝑟 ⇒ 𝑏𝑒𝑒𝑟

§ How often are they found together across all shopping baskets?
§ How often are they found together across all shopping baskets containing the
antecedents?
support
TID Transactions
𝑠 𝑚𝑖𝑙𝑘, 𝑑𝑖𝑎𝑝𝑒𝑟, 𝑏𝑒𝑒𝑟
1 Bread, Milk 𝑃 𝑚𝑖𝑙𝑘, 𝑑𝑖𝑎𝑝𝑒𝑟, 𝑏𝑒𝑒𝑟 2
= = = 0.4
2 Bread, Diaper, Beer, Eggs 𝑇 5
3 Milk, Diaper, Beer, Coke
𝑃 𝑚𝑖𝑙𝑘, 𝑑𝑖𝑎𝑝𝑒𝑟, 𝑏𝑒𝑒𝑟 2
4 Bread, Milk, Diaper, Beer 𝑐= = = 0.67
𝑃 𝑚𝑖𝑙𝑘, 𝑑𝑖𝑎𝑝𝑒𝑟 3
5 Bread, Milk, Diaper, Coke
confidence

© 2021 KNIME AG. All rights reserved. 185


A-priori algorithm: an example

§ Let‘s consider milk, diaper, and beer: 𝑚𝑖𝑙𝑘, 𝑑𝑖𝑎𝑝𝑒𝑟 ⇒ 𝑏𝑒𝑒𝑟


§ How often are they found together across all shooping baskets?
§ How often are they found together across all shopping baskets containing the
antecedents?
𝑃 𝑚𝑖𝑙𝑘, 𝑑𝑖𝑎𝑝𝑒𝑟 3
𝑠(𝑚𝑖𝑙𝑘, 𝑑𝑖𝑎𝑝𝑒𝑟) = = = 0.6
𝑇 5
TID Transactions
𝑃 𝑏𝑒𝑒𝑟 3
1 Bread, Milk 𝑠(𝑏𝑒𝑒𝑟) = = = 0.6
𝑇 5
2 Bread, Diaper, Beer, Eggs
𝑠 𝑚𝑖𝑙𝑘, 𝑑𝑖𝑎𝑝𝑒𝑟, 𝑏𝑒𝑒𝑟
3 Milk, Diaper, Beer, Coke 𝑅𝑢𝑙𝑒 𝑙𝑖𝑓𝑡 =
𝑠 𝑚𝑖𝑙𝑘, 𝑑𝑖𝑎𝑝𝑒𝑟 ×𝑠(𝑏𝑒𝑒𝑟)
4 Bread, Milk, Diaper, Beer
0.4
5 Bread, Milk, Diaper, Coke = = 1.11
0.6 ×0.6

© 2021 KNIME AG. All rights reserved. 186


Association Rule Mining: Is it Useful?

§ David J. Hand (2004):


„Association Rule Mining is likely the field with the highest ratio of number of
published papers per reported application.“

§ KNIME Blog post:


https://www.knime.com/knime-applications/market-basket-analysis-and-recommendation-engines

© 2021 KNIME AG. All rights reserved. 187


Recommendation Engines or Market Basket Analysis
From the analysis of the reactions Recommendation
of many people to the same item ...

Collaborative Filtering

IF A has the same opinion as B on


an item,
THEN A is more likely to have B's
opinion on another item than that of
a randomly chosen person

© 2021 KNIME AG. All rights reserved. 188


Collaborative Filtering (CF)

Collaborative filtering systems have many forms, but many common systems can
be reduced to two steps:

1. Look for users who share the same rating patterns with the active user (the
user whom the recommendation is for)
2. Use the ratings from those like-minded users found in step 1 to calculate a
prediction for the active user
3. Implemented in Spark

https://www.knime.com/blog/movie-recommendations-with-spark-collaborative-filtering

© 2021 KNIME AG. All rights reserved. 189


Collaborative Filtering: Memory based approach

§ User u to give recommendations to


§ U = set of top N users most similar to user u
§ Rating of user u on item i calculated as average of ratings of all similar users in
U:
# ;
𝑟',) = ∑ Y 𝑟'Y ,) or weighted 𝑟\,! = ∑\7∈i 𝑠𝑖𝑚𝑖𝑙 𝑢, 𝑢Z 𝑟\7,!
* ' ∈, f

Pearson correlation
∑!∈&89 𝑟\,! − 𝑟š\ 𝑟\7,! − 𝑟\7
𝑠𝑖𝑚𝑖𝑙(𝑢, 𝑢Z ) =
% %
∑!∈&89 𝑟\,! − 𝑟š\ ∑!∈&89 𝑟\7,! − 𝑟\7

Set of items rated by both user x and y

© 2021 KNIME AG. All rights reserved. 190


Exercises:

§ Neural Network
§ Goal: Train an MLP to solve our
classification problem (rank: high/low)
§ 01_Simple_Neural_Network

§ Market Basket Analysis


§ 02_Build_Association_Rules_for_MarketBasketAnalysis
§ 03_Apply_Association_Rules_for_MarketBasketAnalysis

© 2021 KNIME AG. All rights reserved. 191


Session 4: Clustering & Data
Preparation
Unsupervised Learning:
Clustering
Goal of Cluster Analysis
Discover hidden structures in unlabeled data (unsupervised)

Clustering identifies a finite set of groups (clusters) 𝐶; , 𝐶% ⋯ , 𝐶L


in the dataset such that:
§ Objects within the same cluster 𝐶) shall be as similar as possible
§ Objects of different clusters 𝐶) , 𝐶R (𝑖 ≠ 𝑗) shall be as dissimilar as possible

© 2021 KNIME AG. All rights reserved. 194


Cluster Properties
§ Clusters may have different sizes, shapes, densities
§ Clusters may form a hierarchy
§ Clusters may be overlapping or disjoint

© 2021 KNIME AG. All rights reserved. 195


Clustering Applications

§ Find “natural” clusters and desc Methods


§ Data understanding
§ K-means
§ Find useful and suitable groups § Hierarchical
§ Data Class Identification
§ DBScan
§ Find representatives for
homogenous groups
§ Data Reduction Examples
§ Find unusual data objects § Customer segmentation
§ Outlier Detection
§ Molecule search
§ Find random perturbations of the
§ Anomaly detection
data
§ Noise Detection

© 2021 KNIME AG. All rights reserved. 196


Clustering as Optimization Problem

Definition:
Given a data set 𝐷, 𝐷 = 𝑛. Determine a clustering 𝐶 of 𝐷 with:
𝐶 = 𝐶; , 𝐶% , ⋯ , 𝐶j
wher 𝐶! ⊆ 𝐷 and ž 𝐶! = 𝐷
that best fits the given data set 𝐷. e ;k!kL

Clustering Methods:
Inside the space Cover the whole space
1. partitioning
2. hierarchical (linkage based)
3. density-based

© 2021 KNIME AG. All rights reserved. 197


Clustering: Partitioning
k-Means
Partitioning
Goal:
A (disjoint) partitioning into k clusters with minimal costs
§ Local optimization method:
§ choose k initial cluster representatives
§ optimize these representatives iteratively
§ assign each object to its most similar cluster representative
§ Types of cluster representatives:
§ Mean of a cluster (construction of central points)
§ Median of a cluster (selection of representative points)
§ Probability density function of a cluster (expectation maximization)

© 2021 KNIME AG. All rights reserved. 199


k-Means-Algorithm

Given k, the k-Means algorithm is implemented in four steps:


1. Partition objects into 𝑘 non-empty subsets, calculate their centroids (i.e.,
mean point, of the cluster)
2. Assign each object to the cluster with the nearest centroid Euclidean distance
3. Compute the centroids from the current partition
4. Go back to Step 2, repeat until the updated centroids stop moving significantly

© 2021 KNIME AG. All rights reserved. 200


k-Means Algorithm

10 10

9 9

8 8

7 7

5
Calculation of 6

4 new centroids 4

3 3

2 2

1 1

0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

Cluster assignment
10 10

9 9

8 8

7 7

6 6

5 5

4 4

2
Calculation of 3

1 new centroids 1

0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

© 2021 KNIME AG. All rights reserved. 201


Comments of the k-Means Method

§ Advantages:
§ Relatively efficient
§ Simple implementation
§ Weaknesses:
§ Often terminates at a local optimum
§ Applicable only when mean is defined (what about categorical data?)
§ Need to specify k, the number of clusters, in advance
§ Unable to handle noisy data and outliers
§ Not suitable to discover clusters with non-convex shapes

© 2021 KNIME AG. All rights reserved. 202


Outliers: k-Means vs k-Medoids

Problem with K-Means


An object with an extremely large value can substantially distort the
distribution of the data.
One solution: K-Medoids
Instead of taking the mean value of the objects in a cluster as a reference
point, medoids can be used, which are the most centrally located objects
in a cluster.
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

© 2021 KNIME AG. All rights reserved. 203


Clustering: Quality Measures
Silhouette
Optimal Clustering: Example

Within-Cluster Variation

Bad 5 5
x
x
Clustering
x
1 1 x Centroide
1 5 1 5

x
Good 5 5
x
Clustering x
1 1 x Centroide
1 5 1 5

Between-Cluster Variation

© 2021 KNIME AG. All rights reserved. 205


Cluster Quality Measures

Centroid 𝜇l : mean vector of all objects in clustering C


§ Within-Cluster Variation:
L

𝑇𝐷 % = p p 𝑑𝑖𝑠𝑡(𝑝, 𝜇l5 )%
!"; 9∈𝑪𝒊

§ Between-Cluster Variation:
L L

𝐵𝐶 % = p p 𝑑𝑖𝑠𝑡(𝜇l4 , 𝜇l5 )%
@"; !";

§ Clustering Quality (one possible measure):


𝐵𝐶 %
𝐶𝑄 =
𝑇𝐷 %

© 2021 KNIME AG. All rights reserved. 206


Silhouette-Coefficient for object 𝑥

Silhouette-Coefficient [Kaufman & Rousseeuw 1990] measures the quality of


clustering

§ 𝑎(𝑥): distance of object 𝑥 to its cluster representative


§ 𝑏(𝑥): distance of object 𝑥 to the representative of the „second-best“ cluster
§ Silhouette 𝑠(𝑥) of 𝑥

𝑏 𝑥 − 𝑎(𝑥)
𝑠 𝑥 =
max{𝑎 𝑥 , 𝑏(𝑥)}

© 2021 KNIME AG. All rights reserved. 207


Silhouette-Coefficient

Good clustering…
Cluster 1
Cluster 2
𝑎(𝑥)

𝑏(𝑥)

𝑎 𝑥 ≪ 𝑏(𝑥)

𝑏(𝑥) − 𝑎(𝑥) 𝑏(𝑥)


𝑠(𝑥) = ≈ =1
max{ 𝑎(𝑥), 𝑏(𝑥)} 𝑏(𝑥)

© 2021 KNIME AG. All rights reserved. 208


Silhouette-Coefficient

…not so good…
Cluster 1 Cluster 2

𝑎(𝑥)

𝑏(𝑥)

𝑎(𝑥) ≈ 𝑏(𝑥)

𝑏(𝑥) − 𝑎(𝑥) 0
𝑠(𝑥) = ≈ =0
max{ 𝑎(𝑥), 𝑏(𝑥)} 𝑏(𝑥)

© 2021 KNIME AG. All rights reserved. 209


Silhouette-Coefficient

…bad clustering.
Cluster 1

𝑎(𝑥) Cluster 2

𝑏(𝑥)
𝑎(𝑥) ≫ 𝑏(𝑥)

𝑏(𝑥) − 𝑎(𝑥) −𝑎(𝑥)


𝑠(𝑥) = ≈ = −1
max{ 𝑎(𝑥), 𝑏(𝑥)} 𝑎(𝑥)

© 2021 KNIME AG. All rights reserved. 210


Silhouette-Coefficient for Clustering C

§ Silhouette coefficient 𝑠n for clustering 𝐶 is the average silhouette over all objects
𝑥∈𝐶

1
𝑠n = p 𝑠(𝑥)
𝑛
)∈l

§ Interpretation of silhouette coefficient:


§ 𝑠X > 0.7 : strong cluster structure,
§ 𝑠X > 0.5 : reasonable cluster structure,
§ ...

© 2021 KNIME AG. All rights reserved. 211


Choice of Parameter 𝑘

Method
§ For 𝑘=2, 3, ⋯, 𝑛−1, determine one clustering each
§ Choose 𝑘 resulting in the highest clustering quality

Measure of clustering quality


§ Uncorrelated with 𝑘
§ for k-means and k-medoid:
𝑇𝐷 % and 𝑇𝐷 decrease monotonically with increasing 𝑘

© 2021 KNIME AG. All rights reserved. 212


Summary: Clustering by Partitioning

§ Scheme always similar:


§ Find (random) starting clusters
§ Iteratively improve cluster positions
(compute new mean, swap medoids, compute new distribution parameters,…)
§ Important:
§ Number of clusters k
§ Initial cluster position influences (heavily):
§ quality of results
§ speed of convergence
§ Problems for iterative clustering methods:
§ Clusters of varied size, density and shape

© 2021 KNIME AG. All rights reserved. 213


Clustering: Distance Functions
Distance Functions for Numeric Attributes
For two objects 𝑥 = 𝑥; , 𝑥% , ⋯ , 𝑥' and 𝑦 = 𝑦; , 𝑦% , ⋯ , 𝑦' :

§ Lp-Metric (Minkowski-Distance) T
/

𝑑𝑖𝑠𝑡(𝑥, 𝑦) = \ 𝑥) − 𝑦) ,

)P!

§ Euclidean Distance (𝑝 = 2)
T

𝑑𝑖𝑠𝑡(𝑥, 𝑦) = \ 𝑥) − 𝑦) $

)P!

§ Manhattan-Distance (𝑝 = 1) T

𝑑𝑖𝑠𝑡 𝑥, 𝑦 = \ 𝑥) − 𝑦)
)P!

§ Maximum-Distance (𝑝 = ∞)
𝑑𝑖𝑠𝑡 𝑥, 𝑦 = max 𝑥) − 𝑦)
!`)`T

© 2021 KNIME AG. All rights reserved. 215


Influence of Distance Function / Similarity

§ Clustering vehicles: The distance function


§ red Ferrari affects the shape of the
§ green Porsche clusters
§ red Bobby car

§ Distance Function based on maximum speed


(numeric distance function):
§ Cluster 1: Ferrari & Porsche
§ Cluster 2: Bobby car

§ Distance Function based on color


(nominal attributes):
§ Cluster 1: Ferrari and Bobby car
§ Cluster 2: Porsche

© 2021 KNIME AG. All rights reserved. 216


Clustering: Linkage
Hierarchical Clustering
Linkage Hierarchies: Basics

Goal
§ Construction of a hierarchy of clusters (dendrogram)
by merging/separating clusters with minimum/maximum distance

Dendrogram:
§ A tree representing hierarchy of clusters,
with the following properties:

Distance
§ Root: single cluster with the whole data set.
§ Leaves: clusters containing a single object.
§ Branches: merges / separations between larger
clusters and smaller clusters / objects

© 2021 KNIME AG. All rights reserved. 218


Linkage Hierarchies: Basics

§ Example dendrogram

2
8 9
7
5
2 4 1 distance between
6
1
3
5 clusters
1
0
1 5 1 2 3 4 5 6 7 8 9

§ Types of hierarchical methods


§ Bottom-up construction of dendrogram (agglomerative)
§ Top-down construction of dendrogram (divisive)

© 2021 KNIME AG. All rights reserved. 219


Agglomerative vs. Divisive Hierarchical Clustering

step 1 step 2 step 3 step 4 step 5


AGlomerative NESting
(AGNES)
A
A,B
B
A,B,C,D,E
C
C,D,E
D
D,E
E
DIvisive ANAlysis
(DIANA)
step 5 step 4 step 3 step 2 step 1

© 2021 KNIME AG. All rights reserved. 220


Base Algorithm

1. Form initial clusters consisting of a single object, and compute the distance
between each pair of clusters.
2. Merge the two clusters having minimum distance.
3. Calculate the distance between the new cluster and all other clusters.
4. If there is only one cluster containing all objects:
Stop, otherwise go to step 2.

© 2021 KNIME AG. All rights reserved. 221


Single Linkage

§ Distance between clusters (nodes):

𝐷𝑖𝑠𝑡(𝐶# , 𝐶$ ) = min {𝑑𝑖𝑠𝑡(𝑝, 𝑞)}


-∈.Z ,/∈.[

Distance of the closest two points, one from each cluster

§ Merge Step: Union of two subsets of data points


Cj

Ci

© 2021 KNIME AG. All rights reserved. 222


Complete Linkage

§ Distance between clusters (nodes):

𝐷𝑖𝑠𝑡(𝐶# , 𝐶$ ) = m𝑎𝑥 {𝑑𝑖𝑠𝑡(𝑝, 𝑞)}


-∈.Z ,/∈.[

Distance of the farthest two points, one from each cluster

§ Merge Step: Union of two subsets of data points


Cj

Ci

© 2021 KNIME AG. All rights reserved. 223


Average Linkage / Centroid Method

§ Distance between clusters (nodes):

1
𝐷𝑖𝑠𝑡Tpq (𝐶; , 𝐶% ) = p p 𝑑𝑖𝑠𝑡(𝑝, 𝑞)
𝐶; ⋅ 𝐶%
9∈l2 9∈l3

Average distance of all possible pairs of points between 𝐶; and 𝐶%

𝐷𝑖𝑠𝑡o(T$ 𝐶; , 𝐶% = 𝑑𝑖𝑠𝑡 𝑚𝑒𝑎𝑛 𝐶; , 𝑚𝑒𝑎𝑛 𝐶%

Distance between two centroids

§ Merge Step:
§ union of two subsets of data points
§ construct the mean point of the two clusters

© 2021 KNIME AG. All rights reserved. 224


Comments on Single Linkage and Variants

+ Finds not only a „flat“ clustering, but a hierarchy of clusters


(dendrogram)
+ A single clustering can be obtained from the dendrogram
(e.g., by performing a horizontal cut)

- Decisions (merges/splits) cannot be undone

Distance
- Sensitive to noise (Single-Link)
(a „line“ of objects can connect two clusters)
- Inefficient
à Runtime complexity at least O(n2) for n objects

© 2021 KNIME AG. All rights reserved. 225


Linkage Based Clustering

§ Single Linkage:
§ Prefers well-separated clusters
§ Complete Linkage:
§ Prefers small, compact clusters
§ Average Linkage:
§ Prefers small, well-separated clusters…

© 2021 KNIME AG. All rights reserved. 226


Clustering: Density
DBSCAN
Clustering: DBSCAN

DBSCAN - a density-based clustering algorithm - defines five types of points in a


dataset.
§ Core Points are points that have at least a minimum number of neighbors
(MinPts) within a specified distance (𝜀).
§ Noise Points are neither core points nor border points.
§ Border Points are points that are within 𝜀 of a core point, but have less than
MinPts neighbors.
§ Directly Density Reachable Points are within 𝜀 of a core point.
§ Density Reachable Points are reachable with a chain of Directly Density
Reachable points.

Clusters are built by joining core and density-reachable points to one another.

© 2021 KNIME AG. All rights reserved. 228


Example with MinPts = 3

n § t = Core point
Core Point s § s = Boarder point
vs. Border Point
§ n = Noise point
vs. Noise t

Directly Density Reachable § z is directly density


s reachable from t
vs. Density Reachable
z § s is not directly density
t reachable from t, but
density reachable via z

Note: But t is not density reachable from s, because s is not a Core point

© 2021 KNIME AG. All rights reserved. 229


DBSCAN [Density Based Spatial Clustering of Applications with Noise]

§ For each point, DBSCAN determines the e-environment and checks whether it
contains more than MinPts data points è core point
§ Iteratively increases the cluster by adding density-reachable points

© 2021 KNIME AG. All rights reserved. 230


Summary: DBSCAN

Clustering:
§ A density-based clustering 𝐶 of a dataset D w.r.t. 𝜀 and MinPts is the set of all
density-based clusters 𝐶! w.r.t. 𝜀 and MinPts in D.
§ The set 𝑁𝑜𝑖𝑠𝑒𝐶𝐿 („noise“) is defined as the set of all objects in D which do not
belong to any of the clusters.
Property:
§ Let 𝐶! be a density-based cluster and 𝑝Î𝐶! be a core object.

𝐶¦ = 𝑜Î𝐷 𝑜 density-reachable from 𝑝 w.r.t. 𝜀 and MinPts}.

© 2021 KNIME AG. All rights reserved. 231


DBSCAN [Density Based Spatial Clustering of Applications with Noise]

§ DBSCAN uses (spatial) index structures for determining the e-environment:


à computational complexity 𝑂(𝑛 log 𝑛) instead of 𝑂(𝑛2)
§ Arbitrary shape clusters found by DBSCAN
§ Parameters: 𝜀 and 𝑀𝑖𝑛𝑃𝑡𝑠

© 2021 KNIME AG. All rights reserved. 232


Data Preparation
Motivation
§ Real world data is „dirty“
à Contains missing values, noises, outliers, inconsistencies

§ Comes from different information sources


à Different attribute names, values expressed differently, related tuples
§ Different value ranges and hierarchies
à One attribute range may overpower another

§ Huge amount of data


à Makes analyis difficult and time consuming

© 2021 KNIME AG. All rights reserved. 234


Data Preparation

§ Data Cleaning & Standardization (domain dependent)


§ Aggregations (often domain dependent)
§ Normalization
§ Dimensionality Reduction
§ Outlier Detection
§ Missing Value Imputation
§ Feature Selection
§ Feature Engineering
§ Sampling
§ Integration of multiple Data Sources

© 2021 KNIME AG. All rights reserved. 236


Data Preparation: Normalization
Normalization: Motivation

Example:
§ Lengths in cm (100 – 200) and weights in kilogram (30 – 150) fall both in
approximately the same scale
§ What about lengths in m (1-2) and weights also in gram (30000 – 150000)?
à The weight values in mg dominate over the length values for the similarity of
records!

Goal of normalization:
§ Transformation of attributes to make record ranges comparable

© 2021 KNIME AG. All rights reserved. 238


Normalization: Techniques

§ min-max normalization

):);5<
𝑦= 𝑦oT) − 𝑦o!$ + 𝑦o!$
);=8 : );5<

§ z-score normalization

𝑥 − 𝑚𝑒𝑎𝑛(𝑥)
𝑦=
𝑠𝑡𝑑𝑑𝑒𝑣(𝑥)

§ normalization by decimal scaling


)
𝑦= where j is the smallest integer for max(𝑦) < 1
;#4
Here [𝑦𝑚𝑖𝑛, 𝑦𝑚𝑎𝑥] is [0,1] PMML

© 2021 KNIME AG. All rights reserved. 239


PMML

§ Predictive Model Mark-up Language (PMML) standard XML-based


interchange format for predictive models.
§ Interchange. PMML provides a way to describe and exchange predictive
models produced by machine learning algorithms
§ Standard. In theory, a PMML model exported from KNIME can be read by
PMML compatible functions in other tools
§ It does not work that well for the modern / ensemble algorithms, such as random
forest or deep learning. In this case, other formats have been experimented.

© 2021 KNIME AG. All rights reserved. 240


Data Preparation: Missing Value
Imputation
Missing Value Imputation: Motivation

Data is not always available


§ E.g., many tuples have no recorded value for several attributes, such as weight
in a people database
Missing data may be due to
§ Equipment malfunctioning
§ Inconsistency with other recorded data and thus deleted
§ Data not entered (manually)
§ Data not considered important at the time of collection
§ Data format / contents of database changes

© 2021 KNIME AG. All rights reserved. 242


Missing Values: Types

Types of missing values:


Example: Suppose you are modeling weight Y as a function of sex X

§ Missing Completely At Random (MCAR): reason does not depend on its value
or lack of value.
There may be no particular reason why some people told you their weights and others
didn’t.

§ Missing At Random (MAR): the probability that Y is missing depends only on


the value of X.
One sex X may be less likely to disclose its weight Y.
§ Not Missing At Random (NMAR): the probability that Y is missing depends on
the unobserved value of Y itself.
Heavy (or light) people may be less likely to disclose their weight.

© 2021 KNIME AG. All rights reserved. 243


Missing Values Imputation

How to handle missing values?


§ Ignore the record
§ Remove the record
§ Fill in missing value as:
§ Fixed value: e.g., “unknown”, -9999, etc.
§ Attribute mean / median / max. / min.
§ Attribute most frequent value
§ Next / previous /avg interpolation / moving avg value (in time series)
§ A predicted value based on the other attributes (inference-based such as Bayesian, Decision Tree,
...)

© 2021 KNIME AG. All rights reserved. 244


Data Preparation:
Outlier Detection
Outlier Detection

§ An outlier could be, for example, rare behavior, system defect, measurement
error, or reaction to an unexpected event

© 2021 KNIME AG. All rights reserved. 246


Outlier Detection: Motivation

§ Why finding outliers is important?


§ Summarize data by statistics that represent the majority of the data
§ Train a model that generalizes to new data
§ Finding the outliers can also be the focus of the analysis and not only data cleaning

© 2021 KNIME AG. All rights reserved. 247


Outlier Detection Techniques

§ Knowledge-based
§ Statistics-based
§ Distance from the median
§ Position in the distribution tails
§ Distance to the closest cluster center
§ Error produced by an autoencoder
§ Number of random splits to isolate a data point
from other data

© 2021 KNIME AG. All rights reserved. 248


Material

https://www.knime.com/blog/four-techniques-for-outlier-detection

© 2021 KNIME AG. All rights reserved. 249


Data Preparation:
Dimensionality Reduction
Is there such a thing as “too much data”?

“Too much data”:


§ Consumes storage space
§ Eats up processing time
§ Is difficult to visualize
§ Inhibits ML algorithm performance
§ Beware of the model: Garbage in à Garbage out

© 2021 KNIME AG. All rights reserved. 251


Dimensionality Reduction Techniques

§ Measure based
§ Ratio of missing values
§ Low variance
§ High Correlation
§ Transformation based
§ Principal Component Analysis (PCA)
§ Linear Discriminant Analysis (LDA)
§ t-SNE
§ Machine Learning based
§ Random Forest of shallow trees
§ Neural auto-encoder

© 2021 KNIME AG. All rights reserved. 252


Missing Values Ratio

IF (% missing value > threshold ) THEN remove column

© 2021 KNIME AG. All rights reserved. 253


Low Variance

Note: requires min-


max-normalization,
and only works for
numeric columns

§ If column has constant value (variance = 0), it contains no useful information


§ In general: IF (variance < threshold ) THEN remove column

© 2021 KNIME AG. All rights reserved. 254


High Correlation

§ Two highly correlated input variables probably carry similar information


§ IF ( corr(var1, var2) > threshold ) => remove var1

Note: requires min-max-normalization of numeric columns

© 2021 KNIME AG. All rights reserved. 255


Principal Component Analysis (PCA)
§ PCA is a statistical procedure that orthogonally transforms the
original n coordinates of a data set into a new set of n coordinates,
called principal components.
𝑃𝐶! , 𝑃𝐶$ , ⋯ 𝑃𝐶* = 𝑃𝐶𝐴 𝑋! , 𝑋$ , ⋯ 𝑋*

§ The first principal component 𝑃𝐶! follows the direction (eigenvector)


of the largest possible variance (largest eigenvalue of the
covariance matrix) in the data. Image from Wikipedia
x2
§ Each succeeding component 𝑃𝐶W follows the direction of the next
largest possible variance under the constraint that it is orthogonal
to (i.e., uncorrelated with) the preceding components PC2
𝑃𝐶! , 𝑃𝐶$ , ⋯ 𝑃𝐶W5! .
PC1
If you’re still curious, there’s LOTS of different ways to think about PCA:
https://stats.stackexchange.com/questions/2691/making-sense-of-
principal-component-analysis-eigenvectors-eigenvalues

x1
© 2021 KNIME AG. All rights reserved. 256
Principal Component Analysis (PCA)

§ 𝑃𝐶; describes most of the variability in the data, 𝑃𝐶% adds the next big
contribution, and so on. In the end, the last PCs do not bring much more
information to describe the data.

§ Thus, to describe the data we could use only the top 𝑚 < 𝑛 (i.e.,
𝑃𝐶; , 𝑃𝐶% , ⋯ 𝑃𝐶o ) components with little - if any - loss of information

Dimensionality Reduction
§ Caveats:
§ Results of PCA are quite difficult to interpret
§ Normalization required
§ Only effective on numeric columns

© 2021 KNIME AG. All rights reserved. 257


Linear Discriminant Analysis (LDA)

§ LDA is a statistical procedure that orthogonally transforms the original n


coordinates of a data set into a new set of n coordinates, called linear
discriminants.
𝐿𝐷; , 𝐿𝐷% , ⋯ 𝐿𝐷$ = 𝐿𝐷𝐴 𝑋; , 𝑋% , ⋯ 𝑋$
§ Here, however, discriminants (components)
maximize the separation between classes

§ PCA : unsupervised
§ LDA : supervised

© 2021 KNIME AG. All rights reserved. 258


Linear Discriminant Analysis (LDA)

§ 𝐿𝐷; describes best the class separation in the data, 𝐿𝐷% adds the next big
contribution, and so on. In the end, the last LDs do not bring much more
information to separate the classes.

§ Thus, for our classification problem we could use only the top 𝑚 < 𝑛 (i.e.,
𝐿𝐷; , 𝐿𝐷% , ⋯ 𝐿𝐷o ) discriminants with little - if any - loss of information

§ Caveats: Dimensionality Reduction


§ Results of LDA are quite difficult to interpret
§ Normalization required
§ Only effective on numeric columns

© 2021 KNIME AG. All rights reserved. 259


Ensembles of Shallow Decision Trees

§ Often used for classification, but can be used for


feature selection too

§ Generate a large number (we used 2000) of trees


that are very shallow (2 levels, 3 sampled features)

§ Calculate the statistics of candidates and selected


features. The more often a feature is selected in
such trees, the more likely it contains predictive
information

§ Compare the same statistics with a forest of trees


trained on a random dataset.

© 2021 KNIME AG. All rights reserved. 260


Autoencoder

§ Feed-Forward Neural Network architecture Image: Wikipedia

with encoder / decoder structure.


The network is trained to reproduce the
input vector onto the output layer.

§ That is, it compresses the input vector (dimension n) into a smaller vector space
on layer “code” (dimension m<n) and then it reconstructs the original vector onto
the output layer.

§ If the network was trained well, the reconstruction operation happens with
minimal loss of information.

© 2021 KNIME AG. All rights reserved. 261


Material

https://thenewstack.io/3-new-techniques-for-data-dimensionality-reduction-in-machine-learning/

© 2021 KNIME AG. All rights reserved. 262


Data Preparation:
Feature Selection
Feature Selection vs. Dimensionality Reduction

§ Both methods are used for reducing the number of features in a dataset.
However:
§ Feature selection is simply selecting and excluding given features without
changing them.
§ Dimensionality reduction might transform the features into a lower dimension.
§ Feature selection is often a somewhat more aggressive and more
computationally expensive process.
§ Backward Feature Elimination
§ Forward Feature Construction

© 2021 KNIME AG. All rights reserved. 264


Backward Feature Elimination (greedy top-down)

1. First train one model on n input features


2. Then train n separate models each on 𝑛 − 1 input features and remove the
feature whose removal produced the least disturbance
3. Then train 𝑛 − 1 separate models each on 𝑛 − 2 input features and remove
the feature whose removal produced the least disturbance
4. And so on. Continue until desired maximum error rate on training data is
reached.

© 2021 KNIME AG. All rights reserved. 265


Backward Feature Elimination

© 2021 KNIME AG. All rights reserved. 266


Forward Feature Construction (greedy bottom-up)

1. First, train n separate models on one single input feature and keep the feature
that produces the best accuracy.
2. Then, train 𝑛 − 1 separate models on 2 input features, the selected one and
one more. At the end keep the additional feature that produces the best
accuracy.
3. And so on … Continue until an acceptable error rate is reached.

© 2021 KNIME AG. All rights reserved. 267


Material

https://thenewstack.io/3-new-techniques-for-data-dimensionality-reduction-in-machine-learning/

© 2021 KNIME AG. All rights reserved. 268


Data Preparation:
Feature Engineering
Feature Engineering: Motivation

Sometimes transforming the original data allows for better discrimination


by ML algorithms.

© 2021 KNIME AG. All rights reserved. 270


Feature Engineering: Techniques

§ Coordinate Transformations
Remember PCA and LDA?
Polar coordinates , …

§ Distances to cluster centres, after data clustering


§ Simple math transformations on single columns
(𝑒𝑥 , 𝑥2, 𝑥3, tanh(𝑥), log(𝑥) , …)
§ Combining together multiple columns in math functions
(𝑓(𝑥1, 𝑥2, … 𝑥𝑛), 𝑥2 – 𝑥1, …)
§ The whole process is domain dependent

© 2021 KNIME AG. All rights reserved. 271


Feature Engineering in Time Series Analysis

§ Second order differences: 𝑦 = 𝑥(𝑡) – 𝑥(𝑡 − 1) & 𝑦‘(𝑡) = 𝑦(𝑡) – 𝑦(𝑡 − 1)


§ Logarithm: log(𝑦‘(𝑡))

© 2021 KNIME AG. All rights reserved. 272


Confirmation of Attendance and Survey

§ If you would like to get a “Confirmation of


Attendance” please click on the link below*

Confirmation of Attendance and Survey

§ The link also takes you to our course


feedback survey. Filling it in is optional but
highly appreciated!

Thank you!

*Please send your request within the next 3 days

© 2020 KNIME AG. All rights reserved. 273


Exercises

§ Clustering
§ Goal: Cluster location data from California
§ 01_Clustering
§ Data Preparation
§ 02_Missing_Value_Handling
§ 03_Outlier_Detection
§ 04_Dimensionality_Reduction
§ 05_Feature_Selection

© 2021 KNIME AG. All rights reserved. 274


Machine Learning Cheat Sheet

https://www.knime.com/sites/default/files/110519_KNIME_Machine_Learning_Cheat%20Sheet.pdf

© 2021 KNIME AG. All rights reserved. 275


Thank You!

277

You might also like