0% found this document useful (0 votes)
41 views

L4-ML Introduction To Machine Learning Algorithms 2024-01

Uploaded by

Wilder Medina
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views

L4-ML Introduction To Machine Learning Algorithms 2024-01

Uploaded by

Wilder Medina
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

21/06/24

What is Data Science? Some Clarity about Words


[Wikipedia quoting Dhar 13, Leek 13] § (semi)-automatic: no manual analysis, though some user interaction required
§ valid: in the statistical sense
Data science is a multi-disciplinary field that uses scientific methods, processes, algorithms § previously unknown: not explicit, no „common sense knowledge“
and systems to extract knowledge and insights from structured and unstructured data. § potentially useful: for a given application
§ structured data: numbers
§ unstructured data: everything else (images, texts, networks, chem. compounds, …)
[Fayyad, Piatetsky-Shapiro & Smyth 96]
Data Science/
Knowledge discovery in databases (KDD) is the process of Data Mining
Machine Data
(semi-)automatic extraction of knowledge from databases which is valid, previously Learning Preparation
unknown, and potentially useful.
Structured &
Unstructured
Data
These slides are a derivative of KNIME Course These slides are a derivative of KNIME Course
Material of KNIME AG used under CC BY 4.0 5 Material of KNIME AG used under CC BY 4.0 6

5 6

Session 1 – Introduction & Decision Tree Algorithm


In this session, the following subjects will be examined:
§ Use Case Collection
§ Basic Concepts in Data Science
§ Decision Tree Algorithm
[L4-ML] Introduction to Machine
§ Evaluation of Classification Models
Learning Algorithms

Session 1

These slides are a derivative of KNIME Course


Material of KNIME AG used under CC BY 4.0 8

7 8

Churn Prediction

CRM System § Churn Prediction


Use Case Collection Data about your customer § Upselling Likelihood
§ Demographics § Product Propensity /NBO
§ Behavior § Campaign Management
§ Revenues § Customer Segmentation
§ …

Model

These slides are a derivative of KNIME Course


Material of KNIME AG used under CC BY 4.0 10

9 10

1
21/06/24

Customer Segmentation Risk Assessment

Customer History Risk Prognosis

O ct O ct § High Risk
Low Risk
2015 2015
Nov
C v re rh fd2 0 1 6
Nov
C v re rh fd2 0 1 6 §
High Risk
V d s y h CO V d s y h C v re rh fd
d fg h V2d0
c trh fd
v re
s y1h5 Nov
d fg h Vd syh
§
dJdugnd
2 0 1 7 dV
dCfgvhre rh fd2 0 1 6
dd
dJdugnd

g sdy h C v re rh fd 2 0 1 7 d d g d
d fg h § Very High Risk
CRM System § Churn Prediction A
C v re rh fd d fg h
Vd syh dJ2du0
p
gn1
r
Vd syh
d 8 d fg h
C v re rh fd A p r
Vd syh 2018 § Very Low Risk
Data about your customer § Upselling Likelihood dOfgcht 2C0v1 re7rh fd
Fdedbg d d fg h C v re rh fd
Feb § Medium Risk
d2d0g1d5 NCVo vdre
vs yrh
h fd A p r ddgd Vd syh

§ Demographics § Product Propensity /NBO


C v re rh fd2d0fg
V d1
V d s y h Cdvdre
s yh
6h
2019
2018
C v re rh fd
d fg h
2019
C v re rh fd § …
d fgghrh d fd C v re rh fd ddgd
§ Behavior § Campaign Management d fg h
dJdugnd
Vdddsgydh
Vd syh
d fg
Vd syh
h
Feb
2 0 1 9
Vd syh
d fg h
d fg h
§ Revenues
d fg h
Customer Segmentation 2 0 1 7 d d g d d d gddd g d C v re rh fd ddgd
§ C v re rh fd A p r Vd syh

§ … Vd syh 2018
C v re rh fd
d fg h
ddgd
d fg h
Feb
ddgd Vd syh

Model Model
2019
d fg h
C v re rh fd
ddgd
Vd syh
d fg h
ddgd

These slides are a derivative of KNIME Course These slides are a derivative of KNIME Course
Material of KNIME AG used under CC BY 4.0 11 Material of KNIME AG used under CC BY 4.0 12

11 12

Demand Prediction Recommendation Engines / Market Basket Analysis


§ How many taxis do I need in NYC on Wednesday at noon?

Recommendation

IF è

Model Model

These slides are a derivative of KNIME Course These slides are a derivative of KNIME Course
Material of KNIME AG used under CC BY 4.0 13 Material of KNIME AG used under CC BY 4.0 14

13 14

Fraud Detection Sentiment Analysis

Suspicious Transaction

Transactions
§ Trx 1
§ Trx 2
§ Trx 3
§ Trx 4
§ Trx 5
§ Trx 6
§ …
Model

These slides are a derivative of KNIME Course These slides are a derivative of KNIME Course
Material of KNIME AG used under CC BY 4.0 15 Material of KNIME AG used under CC BY 4.0 16

15 16

2
21/06/24

Anomaly Detection
Predicting mechanical failure as late as possible but before it happens
A 1 -SV 3 [0 , 1 0 0 ]
Hz

3 1 A u g u st
2007
Basic Concepts in Data Science
Predictive B reakin g
A 1 -SV 3 [5 0 0 , 6 0 0 ]
Hz
Maintenance p o in t
Ju ly 2 1 , 2 0 0 8

Training Set

via REST
Only some Spectral Time Series shows the break down
These slides are a derivative of KNIME Course
Material of KNIME AG used under CC BY 4.0 17

17 19

What is a Learning Algorithm? Algorithm Training / Learning


𝑿 = (𝑥1, 𝑥2 , … , 𝑥𝑛)

§ The model learns / is trained during the learning / training phase to produce the right
§ Class answer y (a.k.a., label)
§ Input features § Label
§ Input attributes 𝑦 § Target
§ Independent variables § Output feature/attribute § That is why machine learning J
§ Dependent variable
§ Many different algorithms for three ways of learning:
Model § Supervised
§ Unsupervised
Model parameters § Semi-supervised

𝑦 = 𝑓( 𝜷, 𝑿 ) with 𝜷 = [𝛽 1, 𝛽 2, … , 𝛽 𝑚]

A learning algorithm adjusts (learns) the model parameters 𝜷 throughout a number of


iterations to maximize/minimize a likelihood/error function on 𝑦.
These slides are a derivative of KNIME Course These slides are a derivative of KNIME Course
Material of KNIME AG used under CC BY 4.0 20 Material of KNIME AG used under CC BY 4.0 21

20 21

Supervised Learning Supervised Learning: Classification vs. Regression

§ 𝑿 = (𝑥1, 𝑥2) and 𝑦 = {𝑦𝑒𝑙𝑙𝑜𝑤, 𝑔𝑟𝑎𝑦} § 𝑿 = (𝑥1, 𝑥2) and 𝑦 = {𝑙𝑎𝑏𝑒𝑙 1, … , 𝑙𝑎𝑏𝑒𝑙 𝑛} or 𝑦 ∈ ℝ
§ A training set with many examples of (𝑿, 𝑦) § A training set with many examples of (𝑿, 𝑦)
§ The model learns on the examples of the training set to produce the right value of y for § The model learns on the examples of the training set to produce the right value of 𝑦 for
an input vector 𝑿 an input vector 𝑿
x2
𝑿 model 𝑦 Classification Numerical Predictions (Regression)
Age y = {yellow, gray} y = temperature
Money Sunny vs. Cloudy y = {churn, no churn} y = number of visitors
Temperature Healthy vs. Sick y = {increase, unchanged, decrease} y = number of kW
Speed Churn vs. Remain
Number of taxi Increase vs. y = {blonde, gray, brown, red, black} y = price
... Deacrease y = {job 1, job 2, ... , job n} y = number of hours
...

x1
These slides are a derivative of KNIME Course These slides are a derivative of KNIME Course
Material of KNIME AG used under CC BY 4.0 22 Material of KNIME AG used under CC BY 4.0 23

22 23

3
21/06/24

Process Overview for Supervised Learning Unsupervised Learning


Train and apply Evaluate
Partition data performance § 𝑿 = (𝑥1, 𝑥2) and 𝑦 = {𝑦𝑒𝑙𝑙𝑜𝑤, 𝑔𝑟𝑎𝑦}
model
§ A training set with many examples of (𝑿, 𝑦)
§ The model learns to group the examples 𝑿 of the training set based on similarity
(closeness) or probability model
Training Train x2
Set Model 𝑿
Age
Money
Original Temperature
Data Set Speed
Number of taxi
Apply Score ...
Test Set Model Model

x1
These slides are a derivative of KNIME Course These slides are a derivative of KNIME Course
Material of KNIME AG used under CC BY 4.0 24 Material of KNIME AG used under CC BY 4.0 26

24 26

Semi-Supervised Learning The CRISP-DM Cycle


§ 𝑿 = (𝑥1, 𝑥2) and 𝑦 = {𝑦𝑒𝑙𝑙𝑜𝑤, 𝑔𝑟𝑎𝑦}
§ A training set with many examples of 𝑿, 𝑦 and some samples 𝑿, 𝑦
§ The model labels the data in the training set using a modified unsupervised learning
procedure

x2
X
Age
Money
Temperature
Speed
Number of taxi
...

https://en.wikipedia.org/wiki/Cross_Industry_Standard_
x1 Process_for_Data_Mining
These slides are a derivative of KNIME Course These slides are a derivative of KNIME Course
Material of KNIME AG used under CC BY 4.0 27 Material of KNIME AG used under CC BY 4.0 28

27 28

The Data Science Life Cycle KNIME Software for the entire Data Science Life Cycle
Blend & Transform Validate & Deploy

Model & Production Consume &


Creation Process
Production
Visualize Interact

Optimize & Capture Monitor & Update

These slides are a derivative of KNIME Course These slides are a derivative of KNIME Course
Material of KNIME AG used under CC BY 4.0 29 Material of KNIME AG used under CC BY 4.0 30

29 30

4
21/06/24

Goal: A Decision Tree


Outlook W ind Temp (W inter) Sailing
Storage

sunny 3 30 no yes

sunny 3 25 no no

rain 12 15 no yes

overcast 15 2 yes no

rain 16 25 no yes

sunny 14 18 no yes

Decision Tree Algorithm rain 3 5 yes no

sunny 9 20 no yes

overcast 14 5 yes no

sunny 1 7 yes no

rain 4 25 no no

rain 14 24 no yes

sunny 11 20 no yes

sunny 2 18 no no

overcast 8 22 no yes

overcast 13 24 no yes
These slides are a derivative of KNIME Course
Material of KNIME AG used under CC BY 4.0 32

31 32

How can we Train a Decision Tree with KNIME Analytics Platform Goal: A Decision Tree
Outlook W ind Temp Storage Sailing Option 1
sunny 3 30 yes yes

sunny 3 25 yes no

rain 12 15 yes yes

overcast 15 2 no no

rain 16 25 yes yes

sunny 14 18 yes yes


Option 2
rain 3 5 no no

sunny 9 20 yes yes

overcast 14 5 no no

sunny 1 7 no no

rain 4 25 yes no

rain 14 24 yes yes

sunny 11 20 yes yes How can we measure which is the


sunny 2 18 yes no best feature for a split?
overcast 8 22 yes yes

overcast 13 24 yes yes

These slides are a derivative of KNIME Course These slides are a derivative of KNIME Course
Material of KNIME AG used under CC BY 4.0 33 Material of KNIME AG used under CC BY 4.0 34

33 34

Possible Split Criterion: Gain Ratio Possible Split Criterion: Gain Ratio
Based on entropy = measure for information / uncertainty
Split criterion:
𝐺𝑎𝑖𝑛 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦<=>?@= − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦A>B=@
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑝 = − ∑'$%& 𝑝$ log ( 𝑝$ for 𝑝 ∈ ℚ'
, *
𝐺𝑎𝑖𝑛 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦<=>?@= − )+ 𝐸𝑛𝑡𝑟𝑜𝑝𝑦) − )+ 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(
𝐸𝑛𝑡𝑟𝑜𝑝𝑦89*+-9
7 4
= 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 35 , 35

Next splitting feature: Feature with


𝑝) = *⁄)+ 𝑝) = )+⁄)+ = 1 highest 𝐺𝑎𝑖𝑛

𝑝( = ,⁄)+ 𝑝( = &⁄)+ = 0 Problem: Favors features with many


different values

𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑝 = − *⁄)+ log( *⁄)+ + ,⁄)+ log( ,⁄)+ 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑝 = − )+⁄ log )+⁄
)+ ( )+ + &⁄)+ log( &⁄)+ Solution: Gain Ratio
= 0,995 =0
!"#$ ,$(-+&.! " # $ % " /∑)&' ( 1& ,$(-+&.&
𝐺𝑎𝑖𝑛𝑅𝑎𝑡𝑖𝑜 = = : 3 6 :
%&'#()$*+ / ∑)&' ( 1& '+2* 1& 𝐸𝑛𝑡𝑟𝑜𝑝𝑦3 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 , 𝐸𝑛𝑡𝑟𝑜𝑝𝑦6 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 ,
4 4 7 7
These slides are a derivative of KNIME Course These slides are a derivative of KNIME Course
Material of KNIME AG used under CC BY 4.0 35 Material of KNIME AG used under CC BY 4.0 36 𝑤3 = 4⁄35 𝑤6 = 7⁄35

35 36

5
21/06/24

Possible Split Criterion: Gini Index What happens for numerical Input Features?
Subset for each value? – NO
Gini index is based on Gini impurity:
𝑝3 = 7⁄35 Solution: Binary splits
𝐺𝑖𝑛𝑖(𝑝) = 1 − ∑ ' (
$%) 𝑝$ for 𝑝 ∈ ℚ '
𝑝6 = 4⁄
35

7( 6(
𝐺𝑖𝑛𝑖 𝑝 = 1 − ( − (
13 13

Split criterion:
𝐺𝑖𝑛𝑖C'D=E = ∑ '
$%) 𝑤$ 𝐺𝑖𝑛𝑖$
, * 𝑥 = 1.2 𝑥=<
3.4𝑥 𝑥 = 1.7 𝑥 = 3.6
≥𝑥 𝑥 = 4.9
𝐺𝑖𝑛𝑖C'D=E = 𝐺𝑖𝑛𝑖) + 𝐺𝑖𝑛𝑖( 𝑥 = 9.2 𝑥=2 𝑥 = 12.6 𝑥 = 7.4 𝑥=8 𝑥 = 2.3
)+ )+

𝐺𝑖𝑛𝑖3Next :⁄ 3⁄
= 𝐺𝑖𝑛𝑖splitting
4, 4 feature: 𝐺𝑖𝑛𝑖6 = 𝐺𝑖𝑛𝑖 6⁄7, :⁄7
Feature with lowest 𝐺𝑖𝑛𝑖C'D=E 𝑤3 = 4⁄35 𝑤6 = 7⁄35

These slides are a derivative of KNIME Course These slides are a derivative of KNIME Course
Material of KNIME AG used under CC BY 4.0 37 Material of KNIME AG used under CC BY 4.0 38

37 38

The Deeper the Better?! Overfitting vs Underfitting


wind
≥4 <4 Underfitted Generalized Overfitted
𝑡𝑒𝑚𝑝

30 temp temp
≥ 10 < 10 ≥ 25 < 25
25
temp wind
20
≥ 22 < 22 ≥6 <6

15
temp
≥ 26 < 26
10
Model memorizes the
5
wind Model overlooks Model captures training set rather then
≥6 <6 underlying patterns correlations in the
1 2 3 4 5 6 7 𝑤𝑖𝑛𝑑
finding underlying
in the training set training set patterns

These slides are a derivative of KNIME Course These slides are a derivative of KNIME Course
Material of KNIME AG used under CC BY 4.0 39 Material of KNIME AG used under CC BY 4.0 40

39 40

Overfitting vs Underfitting Controlling the Tree Depth


Underfitting Overfitting Goal: Tree that generalizes to new data and doesn’t overfit
§ A model that can neither model the § Model that fits the training data too
training data nor generalize to new well, including details and noise
data § Negative impact on the model’s ability Pruning Early stopping
to generalize
Idea: Cut branches that seem as result Idea: Define a minimum size for the tree
from overfitting leaves

Underfitted Generalized Overfitted Techniques:


• Reduced Error Pruning
• Minimum description length

These slides are a derivative of KNIME Course These slides are a derivative of KNIME Course
Material of KNIME AG used under CC BY 4.0 41 Material of KNIME AG used under CC BY 4.0 42

41 42

6
21/06/24

Pruning - Minimum Description Length Pruning (MDL) Applying the Model – What are the Outputs?
Definition: Description length = #bits(tree) + #bits(misclassified samples)

Tree 1 Tree 2 Note

wind wind
Many misclassified
Example 1

samples in tree 1
temp
12 0 12 0 è DL(Tree 1) > DL(Tree 2)
6 7 è Select Tree 2

wind wind Only 1 misclassified


Example 2

sample in tree 1
temp
12 0 1 13 12 0 è DL(Tree 1) < DL(Tree 2)
è Select Tree 1
These slides are a derivative of KNIME Course These slides are a derivative of KNIME Course
Material of KNIME AG used under CC BY 4.0 43 Material of KNIME AG used under CC BY 4.0 44

43 44
oving a branch with a small improvement in performance
smaller tree is preferred

No True Child Strategy


Outlook W ind Temp Storage Sailing Training:
sunny 3 30 yes yes outlook
sunny 3 25 yes no sunny rain
rain 12 15 yes yes
Training

rain 16 25 yes yes

sunny 14 18 yes yes

rain 3 5 no no
What happens with outlook = overcast?
sunny 9 20 yes yes Evaluation of Classification Models
sunny 1 7 no no

rain 4 25 yes no

rain 14 24 yes yes


Testing

sunny 11 20 yes yes

sunny 2 18 yes no

overcast 8 22 yes yes

overcast 13 24 yes yes

These slides are a derivative of KNIME Course


Material of KNIME AG used under CC BY 4.0 45

45 46

Evaluation Metrics Overall Accuracy


§ Why evaluation metrics? § Definition:
§ Quantify the power of a model
§ Compare model configurations and/or models, and select the best performing one # 𝑪𝒐𝒓𝒓𝒆𝒄𝒕 𝒄𝒍𝒂𝒔𝒔𝒊𝒇𝒊𝒄𝒂𝒕𝒊𝒐𝒏𝒔 (𝒕𝒆𝒔𝒕 𝒔𝒆𝒕)
§ Obtain the expected performance of the model for new data 𝑶𝒗𝒆𝒓𝒂𝒍𝒍 𝒂𝒄𝒄𝒖𝒓𝒂𝒄𝒚 =
# 𝑨𝒍𝒍 𝒆𝒗𝒆𝒏𝒕𝒔 (𝒕𝒆𝒔𝒕 𝒔𝒆𝒕)
§ Different model evaluation techniques are available for
§ Classification/regression models
§ The proportion of correct classifications
§ Imbalanced/balanced target class distributions

§ Downsides:
§ Only considers the performance in general and not for the different classes
§ Therefore, not informative when the class distribution is unbalanced

These slides are a derivative of KNIME Course These slides are a derivative of KNIME Course
Material of KNIME AG used under CC BY 4.0 47 Material of KNIME AG used under CC BY 4.0 48

47 48

7
21/06/24

Confusion Matrix for Sailing Example Confusion Matrix


Sailing Predicted Predicted Sailing Predicted Predicted
yes / no class: yes class: no yes / no class: yes class: no Arbitrarily define one class value as POSITIVE and the remaining class as NEGATIVE

True class: True class:


22 3 0 25 TRUE POSITIVE (TP): Actual and predicted
yes yes Predicted class Predicted class class is positive
True class: True class: positive negative
12 328 0 340
no no TRUE NEGATIVE (TN): Actual and predicted
True class TRUE FALSE
class is negative
+T& +U& positive POSITIVE NEGATIVE
Ac𝑐𝑢𝑟𝑎𝑐𝑦 = = 0,96 Ac𝑐𝑢𝑟𝑎𝑐𝑦 = +,T = 0,93
+,T FALSE NEGATIVE (FN): Actual class
True class
FALSE POSITIVE TRUE NEGATIVE is positive and predicted negative
negative
§ Rows – true class values
§ Columns – predicted class values FALSE POSITIVE (FP): Actual class
is negative and predicted positive
§ Numbers on main diagonal – correctly classified samples
§ Numbers off the main diagonal – misclassified samples Use these four statistics to calculate other evaluation metrics, such as overall accuracy, true
positive rate, and false positive rate
These slides are a derivative of KNIME Course These slides are a derivative of KNIME Course
Material of KNIME AG used under CC BY 4.0 49 Material of KNIME AG used under CC BY 4.0 50

49 50

ROC Curve Cohen‘s Kappa (κ) vs. Overall accuracy


Positive Negative Positive Negative
§ The ROC Curve shows the false positive rate and true positive rate for different threshold
Switch TP
values Positive 14 6 Positive 6 14
and FP
§ False positive rate (FPR) Negative 5 75 Negative 5 75
§ negative events incorrectly classified as positive
§ True positive rate (TPR) 𝑝93 =
19
×
20 11 20
100 100 𝑝93 = ×
§ positive events correctly classified as positive 100 100
81 80 89 80
𝑝96 = × 𝑝96 = ×
100 100 100 100
Predicted Predicted class Overall
class positive negative 𝑝9 = 𝑝93 + 𝑝96 = 0.686
accuracy
Optimal 𝑝9 = 𝑝93 + 𝑝96 = 0.734
True class True Positive False Negative 𝑇𝑃 threshold
𝑇𝑃𝑅 = 89 81
positive (TP) (FN) 𝑇𝑃 + 𝐹𝑁 𝑝< = = 0.89 𝑝< = = 0.81
100 100
κ = 1: perfect model
True class False Positive True Negative
performance
negative (FP) (TN) \]^\_ &.(&U \]^\_ &.&*,
𝜅= )^\_
= &.+)U ≈ 0.65 κ = 0: the model performance
𝜅= = = 0.29
𝐹𝑃 is equal to a random classifier )^\_ &.(,,
𝐹𝑃𝑅 =
𝐹𝑃 + 𝑇𝑁
These slides are a derivative of KNIME Course These slides are a derivative of KNIME Course
Material of KNIME AG used under CC BY 4.0 51 Material of KNIME AG used under CC BY 4.0 52

51 52

Course exercises
§ All exercises are available here:

Session 1 exercises

These slides are a derivative of KNIME Course


Material of KNIME AG used under CC BY 4.0 54

53 54

8
21/06/24

Download material Import material


§ Download the course material from the KNIME Community Hub here: § Import the course material to KNIME Analytics Platform

2. Click on the file of interest

1. Right click on the 3


dots and select
Import workflow….
3. Click Open
Note: You must be logged into the KNIME Hub to see that download icon for all course material

These slides are a derivative of KNIME Course These slides are a derivative of KNIME Course
Material of KNIME AG used under CC BY 4.0 55 Material of KNIME AG used under CC BY 4.0 56

55 56

Exercise: Decision_Tree_exercise Exercise: Decision_Tree_exercise


§ Dataset: Sales data of individual residential properties in Ames, Iowa from 2006 to 2010.
§ One of the columns is the overall condition ranking, with values between 1 and 10.
§ Goal: train a binary classification model, which can predict whether the overall condition
is high or low.

You can download the training workflows from the KNIME Community Hub at this link

These slides are a derivative of KNIME Course These slides are a derivative of KNIME Course
Material of KNIME AG used under CC BY 4.0 57 Material of KNIME AG used under CC BY 4.0 58

57 58

Session 2 – Regression Models, Ensemble Models & Logistic


Regression
In this session, the following subjects will be examined:
§ Regression Problems
§ Linear Regression Algorithm
§ Evaluation of Regression Models
[L4-ML] Introduction to Machine
§ Regression Tree
Learning Algorithms § Ensemble Models
§ Logistic Regression
Session 2

These slides are a derivative of KNIME Course


Material of KNIME AG used under CC BY 4.0 60

59 60

9
21/06/24

Regression Overview
§ Goal: Explain how target attribute depends on descripitive attributes
§ Target attribute è Response variable
§ Descriptive attribute(s) è Regressor variable(s)

§ As a parameterized function class f


§ Estimate parameters to describe the
Regression Problems relationship
§ Must be simple enough for interpolation and
extrapolation purposes
§ Example:
Line (black) v.s. Polynomial (blue) with degree 7

These slides are a derivative of KNIME Course


Material of KNIME AG used under CC BY 4.0 62

61

61 62

Regression
Predict numeric outcomes on existing data (supervised)

Applications
§ Forecasting
§ Quantitative Analysis

Methods
§ Linear
Linear Regression Algorithm
§ Polynomial
§ Regression Trees
§ Partial Least Squares

These slides are a derivative of KNIME Course


Material of KNIME AG used under CC BY 4.0 63

63 64

Regression Line Regression Line


§ Given a data set with two continuous attributes, 𝑥 and 𝑦 § Given a data set with two continuous attributes, 𝑥 and 𝑦
§ There is an approximate linear dependency between 𝑥 and 𝑦 § There is an approximate linear dependency between 𝑥 and 𝑦
Intercept Slope Intercept Slope
𝑦 ≈ 𝑎 + 𝑏𝑥 𝑦 ≈ 𝑎 + 𝑏𝑥
y § We find a regression line (i.e., determine the parameters 𝑎 and 𝑏) such that the line fits
the data as well as possible
𝑏 Slope
§ Examples:
1 § Trend estimation (e.g., oil price over time)
§ Epidemiology (e.g., cigarette smoking vs. lifespan)
§ Finance (e.g., return on investment vs. return on all risky assets)
§ Economics (e.g., spending vs. available income)

Intercept 𝑎 x
These slides are a derivative of KNIME Course These slides are a derivative of KNIME Course
Material of KNIME AG used under CC BY 4.0 65 Material of KNIME AG used under CC BY 4.0 66

65 66

10
21/06/24

Linear Regression Simple Linear Regression


Optimization goal: minimize sum of squared residuals
Predicts the values of the target variable y
based on a linear combination of
∑'$%) 𝑒$( = ∑'$%) 𝑦$ − 𝑦H$ (

the values of the input feature(s) xj


y
𝑥
Two input features: 𝑦
D = 𝑎& + 𝑎) 𝑥) + 𝑎( 𝑥( 𝑎)
+
p input features: 𝑦
D = 𝑎& + 𝑎) 𝑥) + 𝑎( 𝑥( + ⋯ + 𝑎\ 𝑥\ 𝑎&
=
Residual 𝑦D
ei
§ Simple regression: one input feature à regression line
yi
§ Multiple regression: several input features à regression hyper-plane
§ Residuals: differences between observed and predicted values (errors)
Use the residuals to measure the model fit

These slides are a derivative of KNIME Course These slides are a derivative of KNIME Course
x
Material of KNIME AG used under CC BY 4.0 67 Material of KNIME AG used under CC BY 4.0 68

67 68

Simple Linear Regression Linear Regression


§ Think of a straight line 𝑦f = 𝑓 𝑥 = 𝑎 + 𝑏𝑥 § Optimization goal: minimize the squared residuals
§ Find 𝑎 and 𝑏 to model all observations (𝑥$ , 𝑦$ ) as close as possible (
∑' ∑'
∑'$%) 𝑒$( = ∑'$%) 𝑦$ − ∑'a%& 𝑎a 𝑥a,$ = 𝑦 − 𝑎𝑋 `
𝑦 − 𝑎𝑋
è SSE 𝐹 𝑎, 𝑏 = $%)(𝑓 𝑥 − 𝑦$ )( = $%)(𝑎 + 𝑏𝑥$ − 𝑦$ )( should be minimal
§ That is:
'
𝜕𝐹 § Solution:
= i 2 𝑎 + 𝑏𝑥$ − 𝑦$ = 0
𝜕𝑎
$%) 𝑎D = 𝑋 ` 𝑋 ^)
𝑋` 𝑦
'
𝜕𝐹
= i 2 𝑎 + 𝑏𝑥$ − 𝑦$ 𝑥$ = 0
𝜕𝑏
$%) § Computational issues:
§ 𝑋`𝑋 must have full rank, and thus be invertible
(Problems arise if linear dependencies between input features exist)
è A unique solution exists for 𝑎 and 𝑏
§ Solution may be unstable, if input features are almost linearly dependent

These slides are a derivative of KNIME Course These slides are a derivative of KNIME Course
Material of KNIME AG used under CC BY 4.0 69 Material of KNIME AG used under CC BY 4.0 70

69 70

Linear Regression: Summary Polynomial Regression


§ Positive:
§ Strong mathematical foundation Predicts the values of the target variable y
§ Simple to calculate and to understand based on a polynomial combination of degree d of
(For moderate number of dimensions) the values of the input feature(s) xj
§ High predictive accuracy \ \ \
(In many applications) ỹ = 𝑎& + ∑a%) 𝑎a,) 𝑥a + ∑a%) 𝑎a,( 𝑥a( + ⋯ + ∑a%) 𝑎a,D 𝑥aD

§ Negative:
§ Many dependencies are non-linear § Simple regression: one input feature à regression curve
(Can be generalized)
§ Multiple regression: several input features à regression hypersurface
§ Model is global and cannot adapt well to locally different data distributions
But: Locally weighted regression, CART § Residuals: differences between observed and predicted values (errors)
Use the residuals to measure the model fit

These slides are a derivative of KNIME Course These slides are a derivative of KNIME Course
Material of KNIME AG used under CC BY 4.0 71 Material of KNIME AG used under CC BY 4.0 72

71 72

11
21/06/24

Numeric Errors: Formulas


Error Metric Formula Notes

R-squared ∑$#=3(𝑦# −𝑓(𝑥# ))6 Universal range: the closer to 1 the


1− better
∑$#=3(𝑦# −𝑦)6
$
Mean absolute error (MAE) 1 Equal weights to all distances
g |𝑦# − 𝑓(𝑥# )| Same unit as the target column
𝑛
#=3
$
Mean squared error (MSE) 1 Common loss function
g(𝑦# − 𝑓(𝑥# ))6
Evaluation of Regression Models 𝑛
#=3

Root mean squared error (RMSE) $ Weights big differences more


1 Same unit as the target column
g(𝑦# − 𝑓(𝑥# ))6
𝑛
#=3

$
Mean signed difference 1 Only informative about the direction
g 𝑦# − 𝑓 𝑥# of the error
𝑛
#=3
$
Mean absolute percentage error 1 |𝑦# − 𝑓(𝑥# )| Requires non-zero target column
(MAPE) g values
𝑛 |𝑦# |
#=3

These slides are a derivative of KNIME Course


Material of KNIME AG used under CC BY 4.0 75

74 75

MAE (Mean Absolute Error) vs. RMSE (Root Mean Squared Error) R-squared vs. RMSE
MAE RMSE R-squared RMSE

Easy to interpret – mean absolute error Cannot be directly interpreted as the average error Relative measure: Absolute measure:
Proportion of variability explained by the model How much deviation at each point
All errors are equally weighted Larger errors are weighted more Range: Usually between 0 and 1. Same scale as the target
0 = no variability explained
Generally smaller than RMSE Ideal when large deviations need to be avoided 1 = all variability explained

Example: Example:
Actual values = [2,4,5,8], MAE RMSE Actual values = [2,4,5,8], R-sq RMSE

Case 1: Predicted Values = [4, 6, 8, 10] Case 1 2.25 2.29 Case 1: Predicted Values = [3, 4, 5, 6] Case 1 0.96 1.12
Case 2: Predicted Values = [4, 6, 8, 14] Case 2: Predicted Values = [3, 3, 7, 7]
Case 2 3.25 3.64 Case 2 0.65 1.32

These slides are a derivative of KNIME Course These slides are a derivative of KNIME Course
Material of KNIME AG used under CC BY 4.0 76 Material of KNIME AG used under CC BY 4.0 77

76 77

Numeric Scorer
§ Similar to scorer node, but for nodes with numeric predictions

§ Compare dependent variable values to predicted values to evaluate model quality.

§ Report R 2, RMSE, MAPE, etc.

Regression Tree

These slides are a derivative of KNIME Course


Material of KNIME AG used under CC BY 4.0 78
79

78 79

12
21/06/24

Regression Tree: Goal Regression Tree: Initial Split


Sum of squared errors:
y We want to model the target variable y
with piecewise lines Local mean:
𝐸i = i 𝑦$ − 𝑐i (
à No knowledge of functional form 1
required 𝑐i = i 𝑦$
𝑛 Optimal boundary S should minimize
For observations in
the total squared sum:
segment m
i 𝐸i
For all segments m

x s x
These slides are a derivative of KNIME Course These slides are a derivative of KNIME Course
Material of KNIME AG used under CC BY 4.0 80 Material of KNIME AG used under CC BY 4.0 81

80 81

Regression Tree: Initial Split Regression Tree: Growing the Tree


Repeat the splitting
y 𝑥 ≤ 93.5? y process within each 𝑥 ≤ 93.5?

Y N segment Y N

𝐶+ = 28.9 𝐶, = 17.8 𝑥 ≤ 70.5? 𝐶- = 17.8

Y N

𝐶+ = 33.9 𝐶, = 26.4

s x s x
These slides are a derivative of KNIME Course These slides are a derivative of KNIME Course
Material of KNIME AG used under CC BY 4.0 82 Material of KNIME AG used under CC BY 4.0 83

82 83

Regression Tree: Final Model Regression Tree: Algorithm


y Start with a single node containing all points.
1. Calculate ci and E i.
2. If all points have the same value for feature xj, stop.
3. Otherwise, find the best binary splits that reduces E j,s as much as possible.
§ Ej,s doesn’t reduce as much à stop
§ A node contains less than the minimum node size à stop
§ Otherwise, take that split, creating two new nodes.
§ In each new node, go back to step 1.

x
These slides are a derivative of KNIME Course These slides are a derivative of KNIME Course
Material of KNIME AG used under CC BY 4.0 84 Material of KNIME AG used under CC BY 4.0 85

84 85

13
21/06/24

Regression Trees: Summary Regression Trees: Pros & Cons


§ Differences to decision trees: § Finding of (local) regression values (average)
§ Splitting criterion: minimizing intra-subset variation (error) § Problems:
§ Pruning criterion: based on numeric error measure § No interpolation across borders
§ Leaf node predicts average target values of training instances reaching that node § Heuristic algorithm: unstable and not optimal.

§ Can approximate piecewise constant functions § Extensions:


§ Easy to interpret § Fuzzy trees (better interpolation)
§ Local models for each leaf (linear, quadratic)

These slides are a derivative of KNIME Course These slides are a derivative of KNIME Course
Material of KNIME AG used under CC BY 4.0 86 Material of KNIME AG used under CC BY 4.0 87

86 87

Tree Ensemble Models


§ General idea: take advantage of the “wisdom of the X
crowd”
§ Ensemble models: Combining predictions from many 5
1

2 2
4

7 … 7
1

predictors, e.g. decision trees 2 9 6 7 6 8 9 3 3 9 5 7

§ Leads to a more accurate and robust model P1 P2 … Pn

§ Model is difficult to interpret


Ensemble Models § There are multiple trees in the model
y

Typically for classification, the


individual models vote and
the majority wins; for
regression, the individual
predictions are averaged

These slides are a derivative of KNIME Course


Material of KNIME AG used under CC BY 4.0 89

88 89

Bagging - Idea Example for Bagging


§ One option is ”bagging” (Bootstrap AGGregatING) Full training set Sampled training set
§ For each tree / model a training set is generated by sampling uniformly with RowID 𝒙𝟏 𝒙𝟐 𝒚 RowID
Sampled dataset
𝒙𝟏 𝒙𝟐 𝒚
replacement from the standard training set
Row_1 2 6 Class 1 Row_3 9 3 Class 2

Row_2 4 1 Class 2 Row_6 2 6 Class 1

Row_3 9 3 Class 2 Row_1 2 6 Class 1


… Row_4 2 7 Class 1 Row_3 9 3 Class 2

Row_5 8 1 Class 2 Row_5 8 1 Class 2

Row_6 2 6 Class 1 Row_6 2 6 Class 1


Build tree Build tree Build tree
Row_7 5 2 Class 2 Row_1 2 6 Class 1


1 4 1

5 2 5 7 7 6

2 9 6 7 2 8 9 3 3 9 5 7

These slides are a derivative of KNIME Course These slides are a derivative of KNIME Course
Material of KNIME AG used under CC BY 4.0 90 Material of KNIME AG used under CC BY 4.0 91

90 91

14
21/06/24

An Extra Benefit of Bagging: Out of Bag Estimation Random Forest


§ Able to evaluate the model using the training data § Bag of decision trees, with an extra element of
§ Apply trees to samples that haven’t been used for training randomization
§ Each node in the decision tree only “sees” a subset of the
input features, typically 𝑁 to pick from

X1 X2 § Random forests tend to be very robust w.r.t. overfitting

… …
1 4 1 1 4 1

5 2 2 7 7 6 5 2 2 7 7 6 Build tree
2 9 6 7 6 8 9 3 3 9 5 7 2 9 6 7 6 8 9 3 3 9 5 7

P1 P2 … Pn P1 P2 … Pn

y 1OOB y 2OOB 5 2

2 9 6 7

These slides are a derivative of KNIME Course These slides are a derivative of KNIME Course
Material of KNIME AG used under CC BY 4.0 92 Material of KNIME AG used under CC BY 4.0 93

92 93

Boosting - Idea Boosting - Idea


§ Starts with a single tree built from the data § Gradient boosting method
§ Fits a tree to residual errors from the previous model to refine the model sequentially § A shallow tree (depth 4 or less) is built at each step
§ To fit residual errors from the previous step
§ Resulting in a tree ℎi(𝑥)
§ The resulting tree is added to the latest model to update
𝐹i 𝑥 = 𝐹i^) 𝑥 + 𝛾iℎi(𝑥)
Residual Residual
errors
… errors § Where 𝐹i^)(𝑥) is the model from the previous step
from previous from previous § The weight 𝛾i is chosen to minimize the loss function
model model § Loss function: quantifies the difference between model predictions and data

Build tree Build tree Build tree


1 4 1

5 2 5 7 7 6

2 9 6 7 2 8 9 3 3 9 5 7

These slides are a derivative of KNIME Course These slides are a derivative of KNIME Course
Material of KNIME AG used under CC BY 4.0 94 Material of KNIME AG used under CC BY 4.0 95

94 95

Gradient Boosting Example – Regression Gradient Boosted Trees


§ Can be used for classification and regression
§ Large number of iterations – prone to overfitting
Regression tree § ~100 iterations are sufficient
with depth 1
§ Can introduce randomness in choice of data subsets (“stochastic gradient boosting”)
and choice of input features

These slides are a derivative of KNIME Course These slides are a derivative of KNIME Course
Material of KNIME AG used under CC BY 4.0 96 Material of KNIME AG used under CC BY 4.0 97

96 97

15
21/06/24

Ensemble Tree Nodes in KNIME Analytics Platform Parameter Optimization


Classification Problems Regression Problems

These slides are a derivative of KNIME Course These slides are a derivative of KNIME Course
Material of KNIME AG used under CC BY 4.0 98 Material of KNIME AG used under CC BY 4.0 99

98 99

What is a Logistic Regression (algorithm)?


§ Another algorithm to train a classification model

I know already the


decision tree algorithm
and tree ensembles.
Why do I need another
Logistic Regression one?

These slides are a derivative of KNIME Course


Material of KNIME AG used under CC BY 4.0 101

100 101

Why Shouldn’t we Always use the Decision Tree? Decision Boundary of a Logistic Regression

? ?

These slides are a derivative of KNIME Course These slides are a derivative of KNIME Course
Material of KNIME AG used under CC BY 4.0 102 Material of KNIME AG used under CC BY 4.0 103

102 103

16
21/06/24

Linear Regression vs. Logistic Regression Let’s find out how Binary Logistic Regression works!
§ Idea: Train a function, which gives us the probability for each class (0 and 1) based on
Linear Regression Logistic Regression the input features

Target variable y Numeric 𝑦 ∈ (−∞, ∞)/[𝑎, 𝑏] Nominal 𝑦 ∈ 0, 1, 2, 3 /{𝑟𝑒𝑑, 𝑤ℎ𝑖𝑡𝑒}


§ Recap on probabilities
§ Probabilities are always between 0 and 1
… target value 𝑦 § The probability of all classes sum up to 1
Functional relationship … class probability P (y = class i)
between features 𝑦 = 𝑓(𝑥), …, 𝑥', 𝛽&, …, 𝛽')
𝑃 𝑦 = 1 = 𝑝) => 𝑃 𝑦 = 0 = 1 − 𝑝)
and… 𝑦 = 𝛽&+𝛽)𝑥) + ⋯+ 𝛽'𝑥' 𝑃 𝑦 = 𝑐$ = 𝑓 𝑥), …, 𝑥', 𝛽&, …, 𝛽'

è It’s sufficient to model the probability for one class


Goal: Find the regression coefficients 𝛽&, … , 𝛽'

These slides are a derivative of KNIME Course These slides are a derivative of KNIME Course
Material of KNIME AG used under CC BY 4.0 104 Material of KNIME AG used under CC BY 4.0 105

104 105

Let’s Find Out How Binary Logistic Regression Works! More General: Binary Logistic Regression
§ Model:
1
𝑃 𝑦 = 1 = 𝑓 𝑥) , 𝑥( ; 𝛽& , 𝛽) , 𝛽( ≔ 𝜋 = 𝑃(𝑦 = 1) =
)
1 + 𝑒 ^(l]mlnE nmloE o) )mpqr(^s)

With 𝑧 = 𝛽& + 𝛽)𝑥) + ⋯+ 𝛽'𝑥' = 𝑿𝜷.


Feature space Probability function given 𝑥) = 2 § Goal: Find the regression coefficients 𝜷 = (𝛽&, … , 𝛽' )
§ Notation:
§ 𝑦$ is the class value for sample i
§ 𝑥), …, 𝑥' is the set of input features, 𝑿 = (1, 𝑥), …, 𝑥')
§ The training data set has m observations (𝑦$, 𝑥)$, …, 𝑥'$)

These slides are a derivative of KNIME Course These slides are a derivative of KNIME Course
Material of KNIME AG used under CC BY 4.0 106 Material of KNIME AG used under CC BY 4.0 107

106 107

How can we Find the Best Coefficients β? Max Likelihood and Log Likelihood Functions
Maximize the product of the probabilities è Likelihood function § Maximize the Likelihood function 𝐿 𝜷; 𝒚, 𝑿

} }
yt z{y t }
𝐿 𝛽; 𝑦, 𝑋 = 5 𝑃(𝑦 = 𝑦x ) = 5 𝜋x 1 − 𝜋x yt z{y t
x|z x|z max 𝐿 𝛽; 𝑦, 𝑋 = max 5 𝜋x 1 − 𝜋x
€ € x|z

Why does it make sense to maximize this function?


§ Equivalent to maximizing the Log Likelihood function 𝐿𝐿 𝜷; 𝒚, 𝑿

𝜋x 𝑖𝑓 𝑦x = 1 •
𝑃 𝑦 = 𝑦x = )1 − 𝜋
x 𝑖𝑓 𝑦x = 0 Remember: max 𝐿𝐿(𝜷; 𝒚, 𝑿) = max > 𝑦x ln 𝜋x + 1 − 𝑦x ln 1 − 𝜋x
€ € x|z
𝜋$ = P 𝑦 = 1
yt z{y t 𝑢& = 1 for 𝑢 ∈ ℝ
= 𝜋x 1 − 𝜋x 𝑢) = 𝑢 for 𝑢 ∈ ℝ
These slides are a derivative of KNIME Course These slides are a derivative of KNIME Course
Material of KNIME AG used under CC BY 4.0 108 Material of KNIME AG used under CC BY 4.0 109

108 109

17
21/06/24

How can we find this Coefficients? Idea: Gradient Descent Method


max 𝐿𝐿(𝜷; 𝑿, 𝒚) ⟺ min −𝐿𝐿(𝜷; 𝑿, 𝒚)
§ To find the coefficients of our model we want to find 𝜷 so that the value of the function
𝐿𝐿 𝜷; 𝒚, 𝑿 is maximal § Example: min −𝐿𝐿 𝛽 ≔ 𝑓(𝛽)
§ Start from an arbitrary point
§ KNIME Analytics Platform provides two algorithms § Move towards the minimum
§ Iteratively re-weighted least squares
§ Uses the idea of the newton method
Δs § With step size Δ𝑠
§ Stochastic average gradient descent § If 𝑓(𝛽) is strictly convex
è Only one global minimum exists
§ Z normalization of the input data lead
to better convergence

Optimal 𝛽T
These slides are a derivative of KNIME Course These slides are a derivative of KNIME Course
Material of KNIME AG used under CC BY 4.0 110 Material of KNIME AG used under CC BY 4.0 111

110 111

Learning Rate / Step Length Δ𝑠 Learning Rate Δ𝑠


Δ𝑠 too small Δ𝑠 too large Just right § Fixed:
Δ𝑠u = Δ𝑠&
§ Annealing:
Δ𝑠 Δ𝑠&
Δ𝑠u = 𝛼
1+
𝑘
Δ𝑠 with iteration number 𝑘 and decay rate 𝛼
§ Line Search: Learning rate strategy that tries to find the optimal learning rate

These slides are a derivative of KNIME Course These slides are a derivative of KNIME Course
Material of KNIME AG used under CC BY 4.0 112 Material of KNIME AG used under CC BY 4.0 113

112 113

Is there a way to handle Overfitting as well? (optional) Impact of Regularization


§ To avoid overfitting: add regularization by penalizing large weights
)
§ 𝐿( regularizations = Coefficients are Gauss distributed with 𝜎 = v

𝑙 𝛽; T 𝑦, 𝑋 + v ||𝛽||
T 𝑦, 𝑋 ≔ −𝐿𝐿 𝛽; T ((
(

(
§ 𝐿) regularizations = Coefficients are Laplace distributed with 𝜎 = v

• 𝑦, 𝑋 ≔ −𝐿𝐿 𝛽;
𝑙 𝛽; • 𝑦, 𝑋 + 𝜆||𝛽||
• )

&
=> The smaller 𝜎, the smaller the coefficients 𝛽

These slides are a derivative of KNIME Course These slides are a derivative of KNIME Course
Material of KNIME AG used under CC BY 4.0 114 Material of KNIME AG used under CC BY 4.0 115

114 115

18
21/06/24

Interpretation of the Coefficients Interpretation of the p Value

§ Interpretation of the sign


§ 𝛽$ > 0 : Bigger 𝑥$ lead to higher probability § p- value < 𝛼: input feature has a significant impact on the dependent variable.
§ 𝛽$ < 0 : Bigger 𝑥$ lead to smaller probability

These slides are a derivative of KNIME Course These slides are a derivative of KNIME Course
Material of KNIME AG used under CC BY 4.0 116 Material of KNIME AG used under CC BY 4.0 117

116 117

Summary Logistic Regression


§ Logistic regression is used for classification problems
§ The regression coefficients are calculated by maximizing the likelihood function, which
has no closed form solution, hence iterative methods are used.
§ Regularization can be used to avoid overfitting.
§ The p-value shows us whether an independent variable is significant

Session 2 exercises

These slides are a derivative of KNIME Course


Material of KNIME AG used under CC BY 4.0 118

118 119

Exercises
§ Regression Exercises:
§ Goal: Predicting the house price
§ 01_Linear_Regression_exercise
§ 02_Regression_Tree_exercise
§ Classification Exercises: [L4-ML] Introduction to Machine
§ Goal: Predicting the house condition (high /low) Learning Algorithms
§ 03_Random_Forest_exercise (with optional exercise to build a parameter optimization loop)
§ 04_Logistic_Regression_exercise
Session 3

These slides are a derivative of KNIME Course


Material of KNIME AG used under CC BY 4.0 120
121

120 121

19
21/06/24

Session 3 – Neural Networks and Recommendation Engines


In this session, the following subjects will be examined:
§ Artificial Neurons and Networks
§ Deep Learning
§ Recurrent Neural Networks
§ Convolutional Neural Networks (CNN)
§ Recommendation Engines
Artificial Neurons and Networks

These slides are a derivative of KNIME Course


Material of KNIME AG used under CC BY 4.0 122

122 123

Biological vs. Artificial Architecture / Topology


Biological Neuron Biological Neural Networks
Input Hidden Output
Layer Layer Layer
Forward pass:

𝑜)(
∑ 𝑓 𝒐 = 𝑓 𝑊E( 𝒙
6
𝑊3,3 5
6 𝑊3,3
𝑥) 𝑊3,6 𝑦 = 𝑓(𝑊y+ 𝒐)
Artificial Neural Networks
6
𝑜((
Artificial Neuron (Perceptron) (Multilayer Perceptron, MLP) 𝑊6,3
∑ 𝑓 5
𝑊3,6 ∑ 𝑓 𝑦
6
𝑊6,6
𝑦 = 𝑓(𝑥3 𝑤3 + 𝑥 6 𝑤 6 + 𝑏) 𝑥)
𝑥) 𝑏 𝑥( 6
𝑊5,3 5 fully connected
𝑤3 𝑥( 𝑊3,5
𝑏 = 𝑤< y 6
𝑊5,6 𝑜+( feed forward
∑f( )σ y $
𝑤6 𝑦 = 𝑓(g 𝑥 # 𝑤 # )
𝑥+ ∑ 𝑓
𝑥( #=<
𝑥U
These slides are a derivative of KNIME Course These slides are a derivative of KNIME Course
Material of KNIME AG used under CC BY 4.0 124 Material of KNIME AG used under CC BY 4.0 125

124 125

Same with Matrix Notations Frequently used activation functions

Input Hidden Output


Layer Layer Layer Sigmoid Tanh Rectified Linear Unit (ReLU)
Forward pass:

𝑾𝟐𝒙 𝑜)( 𝑾𝟑𝒚


∑ 𝑓 𝒐=𝑓 𝑊E( 𝒙

𝑥) 𝑦 = 𝑓(𝑊y+ 𝒐)
𝑜((
∑ 𝑓 ∑ 𝑓 𝑦

𝑥( f( ) = activation function
𝑜+( 1 𝑒 (}~ − 1
𝑓 𝑎 = 𝑓 𝑎 = 𝑓 𝑎 = 𝑚𝑎𝑥 0, ℎ𝑎
∑ 𝑓 1 + 𝑒 ^}~ 𝑒 (}~ + 1

These slides are a derivative of KNIME Course These slides are a derivative of KNIME Course
Material of KNIME AG used under CC BY 4.0 126 Material of KNIME AG used under CC BY 4.0 127

126 127

20
21/06/24

Example: Passing the KNIME L1-Certification Example: Passing the KNIME L1-Certification
Input Hidden Output
Layer Layer Layer

-0.566
Workflow builds

Passed certification ∑ 𝑓)
-0.044 2.275
Didn’t pass certification 𝑥) -1.41
-0.608

∑ 𝑓( 𝑦D
-1.431
-0.513
𝑥( -1.298
1.0733
∑ 𝑓)

Minutes attended 𝑓) = 𝑡𝑎𝑛ℎ 𝑓( = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑


Input features: Output:
𝑥)= minutes attended 𝑦f =Probability that a person passed
𝑥(= workflows build 𝑦f ≥ 0.5 ⇒ 𝑃𝑎𝑠𝑠𝑒𝑑 and 𝑦f < 0.5 ⇒ 𝐹𝑎𝑖𝑙𝑒𝑑
These slides are a derivative of KNIME Course These slides are a derivative of KNIME Course
Material of KNIME AG used under CC BY 4.0 128 Material of KNIME AG used under CC BY 4.0 129

128 129

Example: Passing the KNIME L1-Certification Example: Passing the KNIME L1-Certification
Input Hidden Output
Layer Layer Layer
This prediction looks wrong.
-0.566 Why does the network
predict this low probability?
Workflow builds

Passed certification ∑ 𝑓)
-0.888
Normalize -0.044 2.275
Didn’t pass certification 170 𝑥)
0.567
-1.41
-0.608

New sample ∑ 𝑓(
0.013 𝑦D
0.013
Normalize -1.431
-0.513
𝑥)= minutes attended = 170 8 𝑥(
0.8
-1.298
1.0733

𝑥(= workflows build = 8 ∑ 𝑓)


-0.983 The network has not been trained yet.
The current weights are selected
randomly.
Minutes attended 𝑓) = 𝑡𝑎𝑛ℎ 𝑓( = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑
Input features: Output:
𝑥)= minutes attended 𝑦f =Probability that a person passed
𝑥(= workflows build 𝑦f ≥ 0.5 ⇒ 𝑃𝑎𝑠𝑠𝑒𝑑 and 𝑦f < 0.5 ⇒ 𝐹𝑎𝑖𝑙𝑒𝑑
These slides are a derivative of KNIME Course These slides are a derivative of KNIME Course
Material of KNIME AG used under CC BY 4.0 130 Material of KNIME AG used under CC BY 4.0 131

130 131

Training a Neural Network = Finding Good Weights Learning Rule from Gradient Descent
Input Hidden Output § Adjust the weight for the next step by the adjustment term ∆𝒘(𝑡)
Layer Layer Layer

𝒘 𝑡 + 1 = 𝒘 𝑡 + 𝜂 ∆𝒘(𝑡)
∑ 𝑓)
170 𝑥)
Predicted 𝑦f = 0.013
∑ 𝑓( 𝑦D True y = 1
∆𝒘(𝑡)
Loss function

8 𝑥( Weight in
∑ 𝑓) current step 𝑡
Optimal
Binary cross entropy Weight
solution
ℒ = −(𝑦 log 𝑦f + (1 − 𝑦) log(1 − 𝑦))
f Updated weight
adjustment based
in next step 𝑡 + 1
𝐽 𝑊 = + ℒ (𝑦# 𝑥A, 𝑥B, 𝑊 , 𝑦) on gradient
Weight 𝒘

These slides are a derivative of KNIME Course These slides are a derivative of KNIME Course
Material of KNIME AG used under CC BY 4.0 132 Material of KNIME AG used under CC BY 4.0 133

132 133

21
21/06/24

Idea Behind Gradient Descent Backpropagation


§ Efficient way to calculate the gradient during optimization

Forward pass

𝑤) 𝑤(
𝑥 𝑧 𝑦D 𝐽(𝑦,
D 𝑦; 𝑤) , 𝑤( )
𝐽(𝑤) , 𝑤( )

Backward pass
𝑤) 𝑤(
𝑥 𝑧 𝑦D 𝐽(𝑦,
D 𝑦; 𝑤) , 𝑤( )
𝑤(
ۥ
𝑤)
€‚O 𝜕𝐽 𝜕𝐽 𝜕𝑦D 𝜕𝐽 𝜕𝐽 𝜕𝑦D 𝜕𝑧
∇ • 𝐽 𝑥, 𝑊 = €• = ∗ = ∗ ∗
€‚P 𝜕𝑤( 𝜕𝑦D 𝜕𝑤( 𝜕𝑤) 𝜕𝑦D 𝜕𝑧 𝜕𝑤)
These slides are a derivative of KNIME Course These slides are a derivative of KNIME Course
Material of KNIME AG used under CC BY 4.0 134 Material of KNIME AG used under CC BY 4.0 135

134 135

Example: Passing the KNIME L1 Certification Loss Landscape of a Real Neural Network
Input Hidden Output
Layer Layer Layer § A good choice for the step size 𝜂, aka
learning rate, is important
-0.106
𝑊 ← 𝑊 − 𝜂 ∇ • 𝐽(𝑥, 𝑊)
∑ 𝑓)
0.396
Normalize -0.554 0.931
170 𝑥)
0.567
-2.603
0.117
§ Many different optimizers with
∑ 𝑓(
0.967 𝑦D
0.967 adaptive learning rates are available
Normalize 0.146
1.309 § Adam, Adadelta, Adagrad, ect
8 𝑥(
0.8
-1.554
-3.096
∑ 𝑓)
-0.929
§ Other important settings
§ Batch size, aka number of samples for one
𝑓) = 𝑡𝑎𝑛ℎ 𝑓( = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 update
Input features: Output: § Number of epochs, aka how often each
𝑥)= minutes attended 𝑦 = Probability that a person passed sample is used
Source: https://www.cs.umd.edu/~tomg/projects/landscapes/
𝑥(= workflows build 𝑦 ≥ 0.5 ⇒ 𝑃𝑎𝑠𝑠𝑒𝑑 and 𝑦 < 0.5 ⇒ 𝐹𝑎𝑖𝑙𝑒𝑑
These slides are a derivative of KNIME Course These slides are a derivative of KNIME Course
Material of KNIME AG used under CC BY 4.0 136 Material of KNIME AG used under CC BY 4.0 137

136 137

Optimizers in Keras Which Activation Functions? Which Loss Functions?


Optimizer How it works Strengths Weaknesses When to use
§ Depends on the problem you are working on
SGD with Use the previous gradient to -Reduces oscillation Activation Functions
accelerate convergence near maxima
-Const learning rate ü Recommended
momentum
Δ Can be used Hidden Output
Use the current gradient to predict -Increased -Additional Layers Layer Loss Functions
NAG (Nesterov RNN
accelerated gradient) gradient responsiveness hyperparameter
Categorical
Binary CE
Sigmoid

Sigmoid

Softmax

-Computationally
-Different learning
Linear

Hinge

Updating by cumulating sum of sq expensive Sparse data (e.g.


MSLE
ReLU

ReLU

MAE
Tanh

Tanh

Adagrad parameters for


MSE

gradients from past -Shrinking learning text)


CE

different features Problems


rate
-Learning rate not Binary classification (0 vs 1) ü ü ü ü ü
Modified Adagrad with decaying -Computationally Sparse data (e.g.
Adadelta dramatically shrinking
average of sq gradients from past like Adagrad expensive text) Classification Binary classification (-1 vs 1) ü ü ü ü ü

-Learning rate not Multi-class classification ü ü ü ü ü


Modified Adagrad with sq gradients
RMSProp dramatically shrinking
added very slowly like Adagrad Regression ü ü ü Δ Δ ü Δ ü
Regression Regression (wide range) ü ü ü ü ü
Adam (Adaptive RMSProp plus decaying average of -Computationally
-Fast convergence
Moment Estimation) gradients from past expensive Regression (possible outliers) ü ü ü ü ü

These slides are a derivative of KNIME Course These slides are a derivative of KNIME Course
Material of KNIME AG used under CC BY 4.0 138 Material of KNIME AG used under CC BY 4.0 139

138 139

22
21/06/24

Codeless Deep Learning with KNIME Analytics Platform Codeless Deep Learning with KNIME Analytics Platform
§ Simple option for feed forward neural networks with activation function sigmoid

These slides are a derivative of KNIME Course These slides are a derivative of KNIME Course
Material of KNIME AG used under CC BY 4.0 140 Material of KNIME AG used under CC BY 4.0 141

140 141

Deep Learning Recurrent Neural Networks

142 143

What are Recurrent Neural Networks? Why do we need RNNs for Sequential Data?
§ Recurrent Neural Network (RNN) are a family of neural networks used for processing of § Goal: Translation network from German to English
sequential data ∑σ
“Ich mag Schokolade” ∑σ
§ RNNs are used for all sorts of tasks:
§ Language modeling / Text generation => “I like chocolate” ∑σ
§ Text classification 𝑥 ∑σ ∑σ 𝑦
§ Neural machine translation § One option: Use feed forward network to ∑σ
§ Image captioning translate word by word ∑σ
§ Speech to text ∑σ
§ Numerical time series data, e.g. sensor data
§ But what happens with this question?
Input x Output y

“Mag ich Schokolade?” Ich I


=> “Do I like chocolate?” mag like
Schokolade chocolate

These slides are a derivative of KNIME Course These slides are a derivative of KNIME Course
Material of KNIME AG used under CC BY 4.0 144 Material of KNIME AG used under CC BY 4.0 145

144 145

23
21/06/24

Why do we need RNNs for Sequential Data? What are RNNs?


§ Problems:
§ Each time step is completely independent ∑σ
§ For translations we need context ∑σ
§ More general: we need a network that remembers inputs from ∑σ
the past 𝑥 ∑σ ∑σ 𝑦
∑σ
§ Solution: Recurrent neural networks
∑σ
∑σ

Input x Output y

Ich I
mag like
Schokolade chocolate
Image Source: Christopher Olah, https://colah.github.io/posts/2015-08-Understanding-LSTMs/

These slides are a derivative of KNIME Course These slides are a derivative of KNIME Course
Material of KNIME AG used under CC BY 4.0 146 Material of KNIME AG used under CC BY 4.0 147

146 147

From Feed Forward to Recurrent Neural Networks From Feed Forward to Recurrent Neural Networks
𝑦z 𝑦ž 𝑦Ÿ 𝑦
𝑾𝟑𝒚 𝑾𝟑𝒚 𝑾𝟑𝒚 𝑾𝟑𝒚
𝑾𝒙𝟐 𝑾𝒚𝟑
∑ σ ∑ σ ∑ σ ∑ σ ∑ σ ∑ σ
𝑥)
𝑾𝟐𝒙 𝑾𝟑𝒚
∑ σ ∑ σ 𝑦 𝑥 ∑ σ 𝑦 ∑ σ ∑ σ ∑ σ ∑ σ
𝑥(
∑ σ ∑ σ ∑ σ ∑ σ ∑ σ ∑ σ

𝑾𝒙𝟐 𝑾𝒙𝟐 𝑾𝒙𝟐 𝑾𝒙𝟐


𝑥z 𝑥ž 𝑥Ÿ 𝑥

These slides are a derivative of KNIME Course These slides are a derivative of KNIME Course
Material of KNIME AG used under CC BY 4.0 148 Material of KNIME AG used under CC BY 4.0 149

148 149

Simple RNN unit Limitations of Simple Layer Structures


The “memory” of simple RNNs is sometimes too limited to be useful
§ “Cars drive on the ” (road)
§ “I love the beach – my favorite sound is the crashing of the “
(cars? glass? waves?)

Image Source: Christopher Olah, https://colah.github.io/posts/2015-08-Understanding-LSTMs/


These slides are a derivative of KNIME Course These slides are a derivative of KNIME Course
Material of KNIME AG used under CC BY 4.0 150 Material of KNIME AG used under CC BY 4.0 151

150 151

24
21/06/24

LSTM = Long Short Term Memory Unit Different Network-Structures and Applications
Many to Many
§ Special type of unit with three gates
§ Forget gate Ich gehe gerne segel
§ Input gate n
I like sailing <eos>
§ Output gate
D D D D

A A A A
E E E Ich gehe gerne
<sos> I like sailing
I like sailing

Language model Neural machine translation

Image Source: Christopher Olah, https://colah.github.io/posts/2015-08-Understanding-LSTMs/


These slides are a derivative of KNIME Course These slides are a derivative of KNIME Course
Material of KNIME AG used under CC BY 4.0 152 Material of KNIME AG used under CC BY 4.0 153

152 153

Different Network-Structures and Applications Neural Network: Code-free


Many to one One to many

English Couple sailing on a lake

A A A A A A A A A A

I like to go sailing

Language classification Image captioning


Text classification

These slides are a derivative of KNIME Course These slides are a derivative of KNIME Course
Material of KNIME AG used under CC BY 4.0 154 Material of KNIME AG used under CC BY 4.0 155

154 155

Convolutional Neural Network (CNN)


§ A CNN is a neural network with at least one convolutional layer.
§ CNNs are commonly used when data has spatial relationships, e.g. images
§ CNN learns a hierarchy of features using multiple convolution layers that detect different
features.

Convolutional Neural Networks (CNN)

Images from: http://introtodeeplearning.com/slides/6S191_MIT_DeepLearning_L3.pdf


These slides are a derivative of KNIME Course
Material of KNIME AG used under CC BY 4.0 157

156 157

25
21/06/24

Convolutional Neural Networks (CNN) Convolutional Neural Networks

§ Instead of connecting every neuron to


the new layer a sliding window is used,
which applies a filter on different parts
of the image
§ Some convolutions may detect edges
or corners, while others may detect
cats, dogs, or street signs inside an
image

Image from: https://towardsdatascience.com/a-


comprehensive-guide-to-convolutional-neural-
networks-the-eli5-way-3bd2b1164a53
These slides are a derivative of KNIME Course These slides are a derivative of KNIME Course
Material of KNIME AG used under CC BY 4.0 158 Material of KNIME AG used under CC BY 4.0 159

158 159

Building CNNs with KNIME

Recommendation Engines

These slides are a derivative of KNIME Course


Material of KNIME AG used under CC BY 4.0 160

160 161

Recommendation Engines and Market Basket Analysis Recommendation Engines or Market Basket Analysis
From the analysis of many From the analysis of the reactions
A-priori algorithm Recommendation Recommendation
shopping baskets ... of many people to the same item ...

Collaborative Filtering
IF +

IF A has the same opinion as B on an


THEN item,
THEN A is more likely to have B's
opinion on another item than that of a
randomly chosen person

These slides are a derivative of KNIME Course These slides are a derivative of KNIME Course
Material of KNIME AG used under CC BY 4.0 162 Material of KNIME AG used under CC BY 4.0 163

162 163

26
21/06/24

A-priori Algorithm: the Association Rule Building the Association Rule


N shopping baskets

{A, B, F, H}
Search for {A, B, C}
IF + THEN frequent item sets
{B, C, H}
{D, E , F}
{D, E}
{A, B}
Antecedent Consequent {A, C}
{H, F}

These slides are a derivative of KNIME Course These slides are a derivative of KNIME Course
Material of KNIME AG used under CC BY 4.0 164 Material of KNIME AG used under CC BY 4.0 165

164 165

From “Frequent Itemsets“ to “Rules“ Support, Confidence, and Lift


{A, B, F} è H {A, B, F} è H
>@=ƒ(A,<,„,…)
How often these items are
§ Item set support 𝒔 = found together
{A, B, H} è F †

{A, B, F, H} >@=ƒ(A,<,„,…) How often the antecedent


§ Rule confidence 𝒄 = is together with the consequent
>@=ƒ(A,<,„)
{A, F, H} è B
How often antecedent and
‡ˆ\\?@B ( A,<,„ ⇒…)
§ Rule lift = consequent happen together
Which rules shall I choose? ‡ˆ\\?@B A,<,„ × ‡ˆ\\?@B(…)
compared with random chance
{B, F, H} è A

The rules with support, confidence and lift above a threshold à most reliable ones

These slides are a derivative of KNIME Course These slides are a derivative of KNIME Course
Material of KNIME AG used under CC BY 4.0 166 Material of KNIME AG used under CC BY 4.0 167

166 167

Association Rule Mining (ARM): Two Phases A-Priori Algorithm: Example

Discover all frequent and strong association rules § Let‘s consider milk, diaper, and beer: 𝑚𝑖𝑙𝑘, 𝑑𝑖𝑎𝑝𝑒𝑟 ⇒ 𝑏𝑒𝑒𝑟
XÞY à “if X then Y”
with sufficient support and confidence § How often are they found together across all shopping baskets?
§ How often are they found together across all shopping baskets containing the
antecedents?
support
Two phases: TID Transactions
𝑠 𝑚𝑖𝑙𝑘, 𝑑𝑖𝑎𝑝𝑒𝑟, 𝑏𝑒𝑒𝑟
1. find all frequent itemsets (FI) ß Most of the complexity 1 Bread, Milk 𝑃 𝑚𝑖𝑙𝑘, 𝑑𝑖𝑎𝑝𝑒𝑟, 𝑏𝑒𝑒𝑟 2
§ Select itemsets with a minimum support = = = 0.4
2 Bread, Diaper, Beer, Eggs 𝑇 5
𝐹𝐼 = 𝑋, 𝑌 , 𝑋, 𝑌 ⊂ 𝐼|𝑠 𝑋, 𝑌 ≥ 𝑆Q#$
3 Milk, Diaper, Beer, Coke
2. build strong association rules 𝑃 𝑚𝑖𝑙𝑘, 𝑑𝑖𝑎𝑝𝑒𝑟, 𝑏𝑒𝑒𝑟 2
§ Select rules with a minimum confidence: User parameters 4 Bread, Milk, Diaper, Beer 𝑐= = = 0.67
𝑅𝑢𝑙𝑒𝑠: 𝑋 ⇒ 𝑌, 𝑋, 𝑌 ⊂ 𝐹𝐼, ”𝑐 𝑋 ⇒ 𝑌 ≥ 𝐶Q#$ 𝑃 𝑚𝑖𝑙𝑘, 𝑑𝑖𝑎𝑝𝑒𝑟 3
5 Bread, Milk, Diaper, Coke
confidence
These slides are a derivative of KNIME Course These slides are a derivative of KNIME Course
Material of KNIME AG used under CC BY 4.0 168 Material of KNIME AG used under CC BY 4.0 169

168 169

27
21/06/24

A-priori algorithm: an example Association Rule Mining: Is it Useful?


§ Let‘s consider milk, diaper, and beer: 𝑚𝑖𝑙𝑘, 𝑑𝑖𝑎𝑝𝑒𝑟 ⇒ 𝑏𝑒𝑒𝑟 § David J. Hand (2004):
§ How often are they found together across all shooping baskets? “Association Rule Mining is likely the field with the highest ratio of number of published
papers per reported application.”
§ How often are they found together across all shopping baskets containing the
antecedents?
𝑃 𝑚𝑖𝑙𝑘, 𝑑𝑖𝑎𝑝𝑒𝑟 3 § KNIME Blog post:
𝑠(𝑚𝑖𝑙𝑘, 𝑑𝑖𝑎𝑝𝑒𝑟) = = = 0.6 https://www.knime.com/knime-applications/market-basket-analysis-and-recommendation-engines
𝑇 5
TID Transactions
𝑃 𝑏𝑒𝑒𝑟 3
1 Bread, Milk 𝑠(𝑏𝑒𝑒𝑟) = = = 0.6
𝑇 5
2 Bread, Diaper, Beer, Eggs
𝑠 𝑚𝑖𝑙𝑘, 𝑑𝑖𝑎𝑝𝑒𝑟, 𝑏𝑒𝑒𝑟
3 Milk, Diaper, Beer, Coke 𝑅𝑢𝑙𝑒 𝑙𝑖𝑓𝑡 =
𝑠 𝑚𝑖𝑙𝑘, 𝑑𝑖𝑎𝑝𝑒𝑟 ×𝑠(𝑏𝑒𝑒𝑟)
4 Bread, Milk, Diaper, Beer
0.4
5 Bread, Milk, Diaper, Coke = = 1.11
0.6 ×0.6
These slides are a derivative of KNIME Course These slides are a derivative of KNIME Course
Material of KNIME AG used under CC BY 4.0 170 Material of KNIME AG used under CC BY 4.0 171

170 171

Recommendation Engines or Market Basket Analysis Collaborative Filtering (CF)


From the analysis of the reactions Recommendation
of many people to the same item ... Collaborative filtering systems have many forms, but many common systems can be
reduced to two steps:
Collaborative Filtering
1. Look for users who share the same rating patterns with the active user (the user whom
the recommendation is for)
2. Use the ratings from those like-minded users found in step 1 to calculate a prediction
for the active user
3. Implemented in Spark

IF A has the same opinion as B on an


item,
THEN A is more likely to have B's
opinion on another item than that of a
randomly chosen person https://www.knime.com/blog/movie-recommendations-with-spark-collaborative-filtering

These slides are a derivative of KNIME Course These slides are a derivative of KNIME Course
Material of KNIME AG used under CC BY 4.0 172 Material of KNIME AG used under CC BY 4.0 173

172 173

Exercises:
§ Neural Network
§ Goal: Train an MLP to solve our classification problem (rank: high/low)
§ 01_Simple_Neural_Network_exercise

§ Market Basket Analysis


§ 02_Build_Association_Rules_for_MarketBasketAnalysis_exercise
Session 3 exercises § 03_Apply_Association_Rules_for_MarketBasketAnalysis_exercise

These slides are a derivative of KNIME Course


Material of KNIME AG used under CC BY 4.0 176

175 176

28
21/06/24

Session 4 – Clustering & Data Preparation


In this session, the following subjects will be examined:
§ Unsupervised Learning: Clustering
§ Clustering: Partitioningk-Means
§ Clustering: Distance Functions
[L4-ML] Introduction to Machine § Clustering: Quality Measures Silhouette
Learning Algorithms § Clustering: Linkage Hierarchical Clustering
§ Clustering: Density DBSCAN
§ Data Preparation
Session 4 § Data Preparation: Normalization
§ Data Preparation: Missing Value Imputation
§ Data Preparation: Outlier Detection
§ Data Preparation: Dimensionality Reduction
§ Data Preparation: Feature Selection
§ Data Preparation: Feature Engineering

These slides are a derivative of KNIME Course


Material of KNIME AG used under CC BY 4.0 178

177 178

Goal of Clustering Analysis


Discover hidden structures in unlabeled data (unsupervised)

Clustering identifies a finite set of groups (clusters) 𝐶), 𝐶( ⋯ , 𝐶u


in the dataset such that:
§ Objects within the same cluster 𝐶$ shall be as similar as possible
§ Objects of different clusters 𝐶$, 𝐶a (𝑖 ≠ 𝑗) shall be as dissimilar as possible
Unsupervised Learning: Clustering C1 C2 C3

C4 C5 C6
These slides are a derivative of KNIME Course
Material of KNIME AG used under CC BY 4.0 180

179 180

Clustering Applications Cluster Properties


§ Clusters may have different sizes, shapes, densities
Methods § Find “natural” clusters and desc
§ Clusters may form a hierarchy
§ K-means § Data understanding
§ Clusters may be overlapping or disjoint
§ Hierarchical § Find useful and suitable groups
§ DBScan § Data Class Identification
§ Find representatives for homogenous
groups
Examples
§ Data Reduction
§ Customer segmentation § Find unusual data objects
§ Molecule search § Outlier Detection
§ Anomaly detection § Find random perturbations of the data
§ Noise Detection

These slides are a derivative of KNIME Course These slides are a derivative of KNIME Course
Material of KNIME AG used under CC BY 4.0 181 Material of KNIME AG used under CC BY 4.0 182

181 182

29
21/06/24

Types of Clustering Approaches Types of Clustering Approaches


Linkage Based § No clustering method works universally well
e.g. Hierarchical
Clustering

Density based Clustering Clustering by Partitioning


e.g. DBSCAN e.g. k-Means

These slides are a derivative of KNIME Course These slides are a derivative of KNIME Course
Material of KNIME AG used under CC BY 4.0 183 Material of KNIME AG used under CC BY 4.0 184

183 184

Partitioning
Goal:
A (disjoint) partitioning into k clusters with minimal costs

§ Local optimization method:


§ choose k initial cluster representatives
Clustering: Partitioning § optimize these representatives iteratively
k-Means § assign each object to its most similar cluster
representative
§ Types of cluster representatives:
§ Mean of a cluster (construction of central points)
§ Median of a cluster (selection of representative points)
§ Probability density function of a cluster (expectation
maximization)
These slides are a derivative of KNIME Course
Material of KNIME AG used under CC BY 4.0 186

185 186

k-Means-Algorithm k-Means Algorithm


Given k, the k-Means algorithm is implemented in four steps: 10 10 10

9 9 9

8 8

1. Randomly choose 𝑘 objects as the initial centroids


8

7 7 7

6
Cluster 6
Calculation of 6

2. Assign each object to the cluster with the nearest centroid 5 5 5

4 assignment 4 new centroids 4

3. Re-compute the centroids as the centers of the newly formed clusters 3

2
3

2
3

4. Go back to Step 2, repeat until the updated centroids stop moving significantly
1 1 1

0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

Centroids Cluster assignment


randomly chosen 10 10

9 9

8 8

7 7

6 6

5 5

4 4

Repeat unitl 3

2
Calculation of 3

cluster 1 new centroids 1

centers stop
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

moving
Changes in
These slides are a derivative of KNIME Course These slides are a derivative of KNIME Course
cluster assignment
Material of KNIME AG used under CC BY 4.0 187 Material of KNIME AG used under CC BY 4.0 188

187 188

30
21/06/24

Comments of the k-Means Method Outliers: k-Means vs k-Medoids


§ Advantages: Problem with K-Means
§ Relatively efficient An object with an extremely large value can substantially distort the distribution of the data.
§ Simple implementation
One solution: K-Medoids
§ Weaknesses:
Instead of taking the mean value of the objects in a cluster as a reference point, medoids
§ Often terminates at a local optimum
can be used, which are the most centrally located objects in a cluster.
§ Applicable only when mean is defined (what about categorical data?)
§ Need to specify k, the number of clusters, in advance
§ Unable to handle noisy data and outliers
§ Not suitable to discover clusters with non-convex shapes
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

These slides are a derivative of KNIME Course These slides are a derivative of KNIME Course
Material of KNIME AG used under CC BY 4.0 189 Material of KNIME AG used under CC BY 4.0 190

189 190

Distance Functions for Numeric Attributes


For two objects 𝑥 = 𝑥), 𝑥(, ⋯ , 𝑥D and 𝑦 = 𝑦), 𝑦(, ⋯ , 𝑦D :

§ Lp-Metric (Minkowski-Distance) R D
𝑑𝑖𝑠𝑡(𝑥, 𝑦) = i 𝑥$ − 𝑦$ \
$%)

§ Euclidean Distance (𝑝 = 2)
D
Clustering: Distance Functions 𝑑𝑖𝑠𝑡(𝑥, 𝑦) = i 𝑥$ − 𝑦$ (
$%)
§ Manhattan-Distance (𝑝 = 1)
D
𝑑𝑖𝑠𝑡 𝑥, 𝑦 = i 𝑥$ − 𝑦$
$%)
§ Maximum-Distance (𝑝 = ∞)

𝑑𝑖𝑠𝑡 𝑥, 𝑦 = max 𝑥$ − 𝑦$
)—$—D
These slides are a derivative of KNIME Course
Material of KNIME AG used under CC BY 4.0 192

191 192

Influence of Distance Function / Similarity


The distance function
§ Clustering vehicles: affects the shape of the
§ red Ferrari clusters
§ green Porsche
§ red Bobby car

§ Distance Function based on maximum speed Clustering: Quality Measures


(numeric distance function):
§ Cluster 1: Ferrari & Porsche Silhouette
§ Cluster 2: Bobby car

§ Distance Function based on color


(nominal attributes):
§ Cluster 1: Ferrari and Bobby car
§ Cluster 2: Porsche

These slides are a derivative of KNIME Course


Material of KNIME AG used under CC BY 4.0 193

193 194

31
21/06/24

Optimal Clustering: Example Cluster Quality Measures


Centroid 𝜇 ˜ : mean vector of all objects in clustering C
Within-Cluster Variation
§ Within-Cluster Variation:
Bad x u
5 5
Clustering x
𝑇𝐷 ( = i i 𝑑𝑖𝑠𝑡(𝑝, 𝜇 ˜T )(
x
$%) \∈𝑪𝒊
1 1 x Centroide
1 1
§ Between-Cluster Variation:
5 5
u u
𝐵𝐶 ( = i i 𝑑𝑖𝑠𝑡(𝜇 ˜U , 𝜇 ˜T )(
a%) $%)
§ Clustering Quality (one possible measure):
x
Good 5 5 𝐵𝐶 (
Clustering x 𝐶𝑄 =
x
𝑇𝐷 (
1 1 x Centroide
1 5 1 5

Between-Cluster Variation
These slides are a derivative of KNIME Course These slides are a derivative of KNIME Course
Material of KNIME AG used under CC BY 4.0 195 Material of KNIME AG used under CC BY 4.0 196

195 196

Silhouette-Coefficient for object 𝑥 Silhouette-Coefficient


Silhouette-Coefficient [Kaufman & Rousseeuw 1990] measures the quality of clustering Good clustering…
Cluster 1
§ 𝑎(𝑥): distance of object 𝑥 to its cluster representative Cluster 2
𝑎(𝑥)
§ 𝑏(𝑥): distance of object 𝑥 to the representative of the „second-best“ cluster
§ Silhouette 𝑠(𝑥) of 𝑥
𝑏(𝑥)
𝑏 𝑥 − 𝑎(𝑥)
𝑠 𝑥 =
max{𝑎 𝑥 , 𝑏(𝑥)}
𝑎 𝑥 ≪ 𝑏(𝑥)

𝑏(𝑥) − 𝑎(𝑥) 𝑏(𝑥)


𝑠(𝑥) = ≈ =1
max{ 𝑎(𝑥), 𝑏(𝑥)} 𝑏(𝑥)

These slides are a derivative of KNIME Course These slides are a derivative of KNIME Course
Material of KNIME AG used under CC BY 4.0 197 Material of KNIME AG used under CC BY 4.0 198

197 198

Silhouette-Coefficient Silhouette-Coefficient
…not so good… …bad clustering.
Cluster 1
Cluster 1 Cluster 2

𝑎(𝑥)

𝑏(𝑥) Cluster 2
𝑎(𝑥)

𝑎(𝑥) ≈ 𝑏(𝑥) 𝑏(𝑥)


𝑎(𝑥) ≫ 𝑏(𝑥)
𝑏(𝑥) − 𝑎(𝑥) 0
𝑠(𝑥) = ≈ =0
max{ 𝑎(𝑥), 𝑏(𝑥)} 𝑏(𝑥) 𝑏(𝑥) − 𝑎(𝑥) −𝑎(𝑥)
𝑠(𝑥) = ≈ = −1
max{ 𝑎(𝑥), 𝑏(𝑥)} 𝑎(𝑥)
These slides are a derivative of KNIME Course These slides are a derivative of KNIME Course
Material of KNIME AG used under CC BY 4.0 199 Material of KNIME AG used under CC BY 4.0 200

199 200

32
21/06/24

Silhouette-Coefficient for Clustering C Silhouette Coefficient over a Range of k


§ Silhouette coefficient 𝑠› for clustering 𝐶 is the average silhouette over all objects 𝑥 ∈ 𝐶 § Silhouette Coefficient Node in KNIME Analytics Platform
§ Loop over various values of k
1
𝑠› = i 𝑠(𝑥)
𝑛
E∈˜

§ Interpretation of silhouette coefficient:


§ 𝑠› > 0.7 : strong cluster structure,
§ 𝑠› > 0.5 : reasonable cluster structure,
§ Optimized k-means component
§ ...
§ Loop over various values of k

These slides are a derivative of KNIME Course These slides are a derivative of KNIME Course
Material of KNIME AG used under CC BY 4.0 201 Material of KNIME AG used under CC BY 4.0 202

201 202

Silhouette Coefficient over k Summary: Clustering by Partitioning


Peak at k=6 § Scheme always similar:
§ Find (random) starting clusters
§ Iteratively update centroid positions
(compute new mean, swap medoids, compute new distribution parameters,…)
§ Important:
§ Number of clusters k
§ Initial cluster position influences (heavily):
§ quality of results
§ speed of convergence
§ Problems for iterative clustering methods:
§ Clusters of varied size, density and shape

These slides are a derivative of KNIME Course These slides are a derivative of KNIME Course
Material of KNIME AG used under CC BY 4.0 203 Material of KNIME AG used under CC BY 4.0 204

203 204

Linkage Hierarchies: Basics


Goal
§ Construction of a hierarchy of clusters (dendrogram)
by merging/separating clusters with minimum/maximum distance

Dendrogram:
Clustering: Linkage
§ A tree representing hierarchy of clusters,
Hierarchical Clustering with the following properties:
Distance

§ Root: single cluster with the whole data set.


§ Leaves: clusters containing a single object.
§ Branches: merges / separations between larger
clusters and smaller clusters / objects

These slides are a derivative of KNIME Course


Material of KNIME AG used under CC BY 4.0 206

205 206

33
21/06/24

Linkage Hierarchies: Basics Base Algorithm


§ Example dendrogram 1. Form initial clusters consisting of a single object, and compute the distance between
each pair of clusters.
2. Merge the two clusters having minimum distance.
2 3. Calculate the distance between the new cluster and all other clusters.
8 9
5
7 4. If there is only one cluster containing all objects:
2 4 1 distance between Stop, otherwise go to step 2.
3 6 clusters
5
1 1
0
1 5 1 2 3 4 5 6 7 8 9

§ Types of hierarchical methods


§ Bottom-up construction of dendrogram (agglomerative)
§ Top-down construction of dendrogram (divisive)

These slides are a derivative of KNIME Course These slides are a derivative of KNIME Course
Material of KNIME AG used under CC BY 4.0 207 Material of KNIME AG used under CC BY 4.0 209

207 209

Single Linkage Complete Linkage


§ Distance between clusters (nodes): § Distance between clusters (nodes):

𝐷𝑖𝑠𝑡(𝐶z, 𝐶ž) = min {𝑑𝑖𝑠𝑡(𝑝, 𝑞)} 𝐷𝑖𝑠𝑡(𝐶z, 𝐶ž) = m𝑎𝑥 {𝑑𝑖𝑠𝑡(𝑝, 𝑞)}
-∈®n ,¯∈®o -∈®n ,¯∈®o

Distance of the closest two points, one from each cluster Distance of the farthest two points, one from each cluster

§ Merge Step: Union of two subsets of data points § Merge Step: Union of two subsets of data points

Cj Cj

Ci Ci

These slides are a derivative of KNIME Course These slides are a derivative of KNIME Course
Material of KNIME AG used under CC BY 4.0 210 Material of KNIME AG used under CC BY 4.0 211

210 211

Average Linkage / Centroid Method Comments on Single Linkage and Variants


§ Distance between clusters (nodes): + Finds not only a „flat“ clustering, but a hierarchy of clusters
1 (dendrogram)
𝐷𝑖𝑠𝑡~œ• (𝐶) , 𝐶( ) = v v 𝑑𝑖𝑠𝑡(𝑝, 𝑞) + A single clustering can be obtained from the dendrogram
𝐶) ⋅ 𝐶(
\∈˜n \∈˜o (e.g., by performing a horizontal cut)
Average distance of all possible pairs of points between 𝐶) and 𝐶(

𝐷𝑖𝑠𝑡i=~' 𝐶) , 𝐶( = 𝑑𝑖𝑠𝑡 𝑚𝑒𝑎𝑛 𝐶) , 𝑚𝑒𝑎𝑛 𝐶( - Decisions (merges/splits) cannot be undone


Distance

- Sensitive to noise (Single-Link)


Distance between two centroids (a „line“ of objects can connect two clusters)
§ Merge Step: - Inefficient
§ union of two subsets of data points à Runtime complexity at least O(n 2) for n objects
§ construct the mean point of the two clusters

These slides are a derivative of KNIME Course These slides are a derivative of KNIME Course
Material of KNIME AG used under CC BY 4.0 212 Material of KNIME AG used under CC BY 4.0 213

212 213

34
21/06/24

Linkage Based Clustering

§ Single Linkage:
§ Prefers well-separated clusters
§ Complete Linkage:
§ Prefers small, compact clusters
§ Average Linkage:
§ Prefers small, well-separated clusters… Clustering: Density
DBSCAN

These slides are a derivative of KNIME Course


Material of KNIME AG used under CC BY 4.0 214

214 215

Clustering: DBSCAN Example with MinPts = 3

DBSCAN - a density-based clustering algorithm - defines five types of points in a dataset. § t = Core point
Core Point s n § s = Boarder point
§ Core Points are points that have at least a minimum number of neighbors (MinPts) vs. Border Point
within a specified distance (𝜀). § n = Noise point
vs. Noise
§ Noise Points are neither core points nor border points. t
§ Border Points are points that are within 𝜀 of a core point, but have less than MinPts
neighbors.
§ Directly Density Reachable Points are within 𝜀 of a core point.
§ Density Reachable Points are reachable with a chain of Directly Density Reachable Directly Density § z is directly density
points. Reachable s reachable from t
vs. Density z § s is not directly density
Reachable t reachable from t, but
Clusters are built by joining core and density-reachable points to one another.
density reachable via z

Note: But t is not density reachable from s, because s is not a Core point
These slides are a derivative of KNIME Course These slides are a derivative of KNIME Course
Material of KNIME AG used under CC BY 4.0 216 Material of KNIME AG used under CC BY 4.0 217

216 217

DBSCAN [Density Based Spatial Clustering of Applications with Noise] DBSCAN [Density Based Spatial Clustering of Applications with Noise]

§ For each point, DBSCAN determines the e-environment and checks whether it contains § For each point, DBSCAN determines the e-environment and checks whether it contains
more than MinPts data points è core point more than MinPts data points è core point
§ Iteratively increases the cluster by adding density-reachable points § Iteratively increases the cluster by adding density-reachable points

These slides are a derivative of KNIME Course These slides are a derivative of KNIME Course
Material of KNIME AG used under CC BY 4.0 218 Material of KNIME AG used under CC BY 4.0 219

218 219

35
21/06/24

Summary: DBSCAN DBSCAN [Density Based Spatial Clustering of Applications with Noise]

Clustering: § DBSCAN uses (spatial) index structures for determining the e-environment:
§ A density-based clustering 𝐶 of a dataset D w.r.t. 𝜀 and MinPts is the set of all density- à computational complexity 𝑂(𝑛 log 𝑛) instead of 𝑂(𝑛2)
based clusters 𝐶$ w.r.t. 𝜀 and MinPts in D. § Arbitrary shape clusters found by DBSCAN
§ The set 𝑁𝑜𝑖𝑠𝑒𝐶𝐿 („noise“) is defined as the set of all objects in D which do not belong to § Parameters: 𝜀 and 𝑀𝑖𝑛𝑃𝑡𝑠
any of the clusters.
Property:
§ Let 𝐶$ be a density-based cluster and 𝑝Î𝐶$ be a core object.

𝐶$ = 𝑜Î𝐷 𝑜 density-reachable from 𝑝 w.r.t. 𝜀 and MinPts}.

These slides are a derivative of KNIME Course These slides are a derivative of KNIME Course
Material of KNIME AG used under CC BY 4.0 220 Material of KNIME AG used under CC BY 4.0 221

220 221

Motivation
§ Real world data is “dirty“
à Contains missing values, noises, outliers, inconsistencies
§ Comes from different information sources
à Different attribute names, values expressed differently, related tuples
§ Different value ranges and hierarchies
Data Preparation à One attribute range may overpower another
§ Huge amount of data
à Makes analyis difficult and time consuming

These slides are a derivative of KNIME Course


Material of KNIME AG used under CC BY 4.0 223

222 223

Data Preparation
§ Data Cleaning & Standardization (domain dependent)
§ Aggregations (often domain dependent)
§ Normalization
§ Dimensionality Reduction
§ Outlier Detection
§ Missing Value Imputation Data Preparation: Normalization
§ Feature Selection
§ Feature Engineering
§ Sampling
§ Integration of multiple Data Sources

These slides are a derivative of KNIME Course


Material of KNIME AG used under CC BY 4.0 224

224 225

36
21/06/24

Normalization: Motivation Normalization: Techniques


Example: § min-max normalization
§ Lengths in cm (100 – 200) and weights in kilogram (30 – 150) fall both in approximately E^E žtŸ
the same scale 𝑦= 𝑦i~E − 𝑦i$' + 𝑦i$'
E ž ¡^ E žtŸ
§ What about lengths in m (1-2) and weights also in gram (30000 – 150000)?
à The weight values in mg dominate over the length values for the similarity of
§ z-score normalization
records!

𝑥 − 𝑚𝑒𝑎𝑛(𝑥)
Goal of normalization: 𝑦=
𝑠𝑡𝑑𝑑𝑒𝑣(𝑥)
§ Transformation of attributes to make record ranges comparable
§ normalization by decimal scaling
E
𝑦 = )& ¢ where j is the smallest integer for max(𝑦) < 1

Here [𝑦𝑚𝑖𝑛, 𝑦𝑚𝑎𝑥] is [−1,1]


PMML
These slides are a derivative of KNIME Course These slides are a derivative of KNIME Course
Material of KNIME AG used under CC BY 4.0 226 Material of KNIME AG used under CC BY 4.0 227

226 227

PMML
§ Predictive Model Mark-up Language (PMML) standard XML-based interchange format
for predictive models.
§ Interchange. PMML provides a way to describe and exchange predictive models
produced by machine learning algorithms
§ Standard. In theory, a PMML model exported from KNIME can be read by PMML
compatible functions in other tools Data Preparation: Missing Value
§ It does not work that well for the modern / ensemble algorithms, such as random forest Imputation
or deep learning. In this case, other formats have been experimented.

These slides are a derivative of KNIME Course


Material of KNIME AG used under CC BY 4.0 228

228 229

Missing Value Imputation: Motivation Missing Values: Types


Data is not always available Types of missing values:
§ E.g., many tuples have no recorded value for several attributes, such as weight in a Example: Suppose you are modeling weight Y as a function of sex X
people database
§ Missing Completely At Random (MCAR): reason does not depend on its value or lack
Missing data may be due to of value.
§ Equipment malfunctioning There may be no particular reason why some people told you their weights and others
§ Inconsistency with other recorded data and thus deleted didn’t.
§ Data not entered (manually) § Missing At Random (MAR): the probability that Y is missing depends only on the value
§ Data not considered important at the time of collection of X.
One sex X may be less likely to disclose its weight Y.
§ Data format / contents of database changes
§ Not Missing At Random (NMAR): the probability that Y is missing depends on the
unobserved value of Y itself.
Heavy (or light) people may be less likely to disclose their weight.

These slides are a derivative of KNIME Course These slides are a derivative of KNIME Course
Material of KNIME AG used under CC BY 4.0 230 Material of KNIME AG used under CC BY 4.0 231

230 231

37
21/06/24

Missing Values Imputation


How to handle missing values?
§ Ignore the record
§ Remove the record
§ Fill in missing value as:
§ Fixed value: e.g., “unknown”, -9999, etc.
§ Attribute mean / median / max. / min.
Data Preparation:
§ Attribute most frequent value Outlier Detection
§ Next / previous /avg interpolation / moving avg value (in time series)
§ A predicted value based on the other attributes (inference-based such as Bayesian, Decision
Tree, ...)

These slides are a derivative of KNIME Course


Material of KNIME AG used under CC BY 4.0 232

232 233

Outlier Detection Outlier Detection: Motivation


§ An outlier could be, for example, rare behavior, system defect, measurement error, or § Why finding outliers is important?
reaction to an unexpected event § Summarize data by statistics that represent the majority of the data
§ Train a model that generalizes to new data
§ Finding the outliers can also be the focus of the analysis and not only data cleaning

These slides are a derivative of KNIME Course These slides are a derivative of KNIME Course
Material of KNIME AG used under CC BY 4.0 234 Material of KNIME AG used under CC BY 4.0 235

234 235

Outlier Detection Techniques Material

§ Knowledge-based
§ Statistics-based
§ Distance from the median
§ Position in the distribution tails
§ Distance to the closest cluster center
§ Error produced by an autoencoder
§ Number of random splits to isolate a data
point
from other data

https://www.knime.com/blog/four-techniques-for-outlier-detection
These slides are a derivative of KNIME Course These slides are a derivative of KNIME Course
Material of KNIME AG used under CC BY 4.0 236 Material of KNIME AG used under CC BY 4.0 237

236 237

38
21/06/24

Is there such a thing as “too much data”?


“Too much data”:
§ Consumes storage space
§ Eats up processing time
§ Is difficult to visualize
§ Inhibits ML algorithm performance
Data Preparation: § Beware of the model: Garbage in à Garbage out
Dimensionality Reduction

These slides are a derivative of KNIME Course


Material of KNIME AG used under CC BY 4.0 239

238 239

Dimensionality Reduction Techniques Missing Values Ratio


§ Measure based
§ Ratio of missing values
§ Low variance
§ High Correlation
§ Transformation based
§ Principal Component Analysis (PCA)
§ Linear Discriminant Analysis (LDA)
§ t-SNE
§ Machine Learning based
§ Random Forest of shallow trees
§ Neural auto-encoder

IF (% missing value > threshold ) THEN remove column


These slides are a derivative of KNIME Course These slides are a derivative of KNIME Course
Material of KNIME AG used under CC BY 4.0 240 Material of KNIME AG used under CC BY 4.0 241

240 241

Low Variance High Correlation


§ Two highly correlated input variables probably carry similar information
§ IF ( corr(var1, var2) > threshold ) => remove var1

Note: requires min-


max-normalization,
and only works for
numeric columns

§ If column has constant value (variance = 0), it contains no useful information


§ In general: IF (variance < threshold ) THEN remove column

These slides are a derivative of KNIME Course These slides are a derivative of KNIME Course
Material of KNIME AG used under CC BY 4.0 242 Material of KNIME AG used under CC BY 4.0 243

242 243

39
21/06/24

Principal Component Analysis (PCA) Principal Component Analysis (PCA)


§ PCA is a statistical procedure that orthogonally transforms the § 𝑃𝐶) describes most of the variability in the data, 𝑃𝐶( adds the next big contribution, and
original n coordinates of a data set into a new set of n coordinates,
called principal components. so on. In the end, the last PCs do not bring much more information to describe the data.
𝑃𝐶), 𝑃𝐶(, ⋯𝑃𝐶' = 𝑃𝐶𝐴 𝑋), 𝑋(, ⋯𝑋'

§ The first principal component 𝑃𝐶) follows the direction § Thus, to describe the data we could use only the top 𝑚 < 𝑛 (i.e., 𝑃𝐶), 𝑃𝐶(, ⋯ 𝑃𝐶i)
(eigenvector) of the largest possible variance (largest eigenvalue Image from W ikipedia components with little - if any - loss of information
of the covariance matrix) in the data. x2
§ Each succeeding component 𝑃𝐶u follows the direction of the next PC2 § Caveats:
largest possible variance under the constraint that it is orthogonal Dimensionality Reduction
to (i.e., uncorrelated with) the preceding components PC1 § Results of PCA are quite difficult to interpret
𝑃𝐶), 𝑃𝐶(, ⋯𝑃𝐶u^) .
§ Normalization required
If you’re still curious, there’s LOTS of different ways to think about PCA: § Only effective on numeric columns
https://stats.stackexchange.com/questions/2691/making-sense-of-principal-
component-analysis-eigenvectors-eigenvalues

x1
These slides are a derivative of KNIME Course These slides are a derivative of KNIME Course
Material of KNIME AG used under CC BY 4.0 244 Material of KNIME AG used under CC BY 4.0 245

244 245

Linear Discriminant Analysis (LDA) Linear Discriminant Analysis (LDA)


§ LDA is a statistical procedure that orthogonally transforms the original n coordinates of § 𝐿𝐷) describes best the class separation in the data, 𝐿𝐷( adds the next big contribution,
a data set into a new set of k-1 coordinates, called linear discriminants, where k is the and so on. In the end, the last LDs do not bring much more information to separate the
number of classes in the class variable classes.
𝐿𝐷) , 𝐿𝐷( , ⋯ 𝐿𝐷u^) = 𝐿𝐷𝐴 𝑋) , 𝑋( , ⋯ 𝑋'
§ Here, however, discriminants (components) § Thus, for our classification problem we could use only the top 𝑚 < 𝑘 − 1 (i.e.,
maximize the separation between classes 𝐿𝐷), 𝐿𝐷(, ⋯ 𝐿𝐷i) discriminants with little - if any - loss of information

§ PCA : unsupervised § Caveats: Dimensionality Reduction


§ LDA : supervised § Results of LDA are quite difficult to interpret
§ Normalization required
§ Only effective on numeric columns

These slides are a derivative of KNIME Course These slides are a derivative of KNIME Course
Material of KNIME AG used under CC BY 4.0 246 Material of KNIME AG used under CC BY 4.0 247

246 247

Ensembles of Shallow Decision Trees Autoencoder


Image: Wikipedia
§ Often used for classification, but can be used for feature § Feed-Forward Neural Network architecture
selection too with encoder / decoder structure.
The network is trained to reproduce the
input vector onto the output layer.
§ Generate a large number (we used 2000) of trees that are
very shallow (2 levels, 3 sampled features)

§ Calculate the statistics of candidates and selected features.


The more often a feature is selected in such trees, the more § That is, it compresses the input vector (dimension n) into a smaller vector space on layer
likely it contains predictive information “code” (dimension m<n) and then it reconstructs the original vector onto the output
layer.
§ Compare the same statistics with a forest of trees trained
on a random dataset. § If the network was trained well, the reconstruction operation happens with minimal loss
of information.

These slides are a derivative of KNIME Course These slides are a derivative of KNIME Course
Material of KNIME AG used under CC BY 4.0 248 Material of KNIME AG used under CC BY 4.0 249

248 249

40
21/06/24

Feature Selection vs. Dimensionality Reduction


§ Both methods are used for reducing the number of features in a dataset. However:
§ Feature selection is simply selecting and excluding given features without changing
them.
§ Dimensionality reduction might transform the features into a lower dimension.
§ Feature selection is often a somewhat more aggressive and more computationally
Data Preparation: expensive process.
Feature Selection § Backward Feature Elimination
§ Forward Feature Construction

These slides are a derivative of KNIME Course


Material of KNIME AG used under CC BY 4.0 251

250 251

Backward Feature Elimination (greedy top-down) Backward Feature Elimination


1. First train one model on n input features
2. Then train n separate models each on 𝑛 − 1 input features and remove the feature
whose removal produced the least disturbance
3. Then train 𝑛 − 1 separate models each on 𝑛 − 2 input features and remove the feature
whose removal produced the least disturbance
4. And so on. Continue until desired maximum error rate on training data is reached.

These slides are a derivative of KNIME Course These slides are a derivative of KNIME Course
Material of KNIME AG used under CC BY 4.0 252 Material of KNIME AG used under CC BY 4.0 253

252 253

Forward Feature Construction (greedy bottom-up) Material


1. First, train n separate models on one single input feature and keep the feature that
produces the best accuracy.
2. Then, train 𝑛 − 1 separate models on 2 input features, the selected one and one more.
At the end keep the additional feature that produces the best accuracy.
3. And so on … Continue until an acceptable error rate is reached.

https://thenewstack.io/3-new-techniques-for-data-dimensionality-reduction-in-machine-learning/
These slides are a derivative of KNIME Course These slides are a derivative of KNIME Course
Material of KNIME AG used under CC BY 4.0 254 Material of KNIME AG used under CC BY 4.0 255

254 255

41
21/06/24

Feature Engineering: Motivation


Sometimes transforming the original data allows for better discrimination
by ML algorithms.

Data Preparation:
Feature Engineering

These slides are a derivative of KNIME Course


Material of KNIME AG used under CC BY 4.0 257

256 257

Feature Engineering: Techniques Feature Engineering in Time Series Analysis


§ Coordinate Transformations § Second order differences: 𝑦 = 𝑥(𝑡) – 𝑥(𝑡 − 1) & 𝑦‘(𝑡) = 𝑦(𝑡) – 𝑦(𝑡 − 1)
Remember PCA and LDA? § Logarithm: log(𝑦‘(𝑡))
Polar coordinates , …

§ Distances to cluster centres, after data clustering


§ Simple math transformations on single columns
(𝑒𝑥 , 𝑥2, 𝑥3, tanh(𝑥), log(𝑥) , …)
§ Combining together multiple columns in math functions
(𝑓(𝑥1, 𝑥2, … 𝑥𝑛), 𝑥2 – 𝑥1, …)
§ The whole process is domain dependent

These slides are a derivative of KNIME Course These slides are a derivative of KNIME Course
Material of KNIME AG used under CC BY 4.0 258 Material of KNIME AG used under CC BY 4.0 259

258 259

Exercises
§ Clustering
§ Goal: Cluster location data from California
§ 01_Clustering_exercise
§ Data Preparation
§ 02_Missing_Value_Handling_exercise
§ 03_Outlier_Detection_exercise
Session 4 exercises § 04_Dimensionality_Reduction_exercise
§ 05_Feature_Selection_exercise

These slides are a derivative of KNIME Course


Material of KNIME AG used under CC BY 4.0 261

260 261

42
21/06/24

Confirmation of Attendance and Survey KNIME Learning Paths


§ If you would like to get a “Confirmation of
Attendance” please click on the link § Courses from level L1 to level L4
below* § Various professional profiles

Confirmation of Attendance and Survey § Self-paced courses


§ Videos and exercises at your own pace
§ The link also takes you to our course and for free
feedback survey. Filling it in is optional but
highly appreciated! § Instructor led courses
§ Scheduled sessions and guided exercises
in paid courses
Thank you!

Find the next course at:


*Please send your request within the next 3
days knime.com/knime-courses
These slides are a derivative of KNIME Course These slides are a derivative of KNIME Course
Material of KNIME AG used under CC BY 4.0 262 Material of KNIME AG used under CC BY 4.0 263

262 263

KNIME Certification Exams KNIME Press - Cheat Sheets

§ From level L1 to level L4 § Many cheat sheets available for:


§ Online examination § Beginners
§ Spreadsheet Users
§ Measure your expertise with KNIME
§ Machine Learning
software and data skills
§ Data Wrangling
§ Get and share your badge! § Orchestration
§ and many more…

Get certified at: All cheat sheets are available for download at:

knime.com/certification-program knime.com/cheat-sheets
These slides are a derivative of KNIME Course These slides are a derivative of KNIME Course
Material of KNIME AG used under CC BY 4.0 264 Material of KNIME AG used under CC BY 4.0 265

264 265

KNIME Press - Books Just KNIME It! Challenges

§ The best way to keep on learning


Use Case Collections Transition Booklets
§ Weekly challenges to test your
Collections of classic and Offer an easy onboarding
knowledge
more innovative use cases into KNIME from other tools
around specific topics § Easy, medium and hard challenges for
any level
§ Discuss the solution with the
community
Textbooks Technical Collections
Well structured schoolbooks Collections of specific § Post your solution and climb the
Leaderboard
with plenty examples technical topics to keep you
up to date

All KNIME Press books are available for download at: Find the challenges at:

knime.com/knimepress knime.com/just-knime-it
These slides are a derivative of KNIME Course These slides are a derivative of KNIME Course
Material of KNIME AG used under CC BY 4.0 266 Material of KNIME AG used under CC BY 4.0 267

266 267

43
21/06/24

KNIME Community Journal What do you think about us?

§ Daily content on § We provide most of our software and


§ data stories learning material for free
§ data science theory
§ Do you like our stuff? Let everybody
§ getting started with KNIME
know!
§ ...and more
§ For the community by the community

§ Share your data story with the community


§ Contributions are always welcome!

Low Code for Data Science community journal on Medium Write a review on G2:

medium.com/low-code-for-advanced-data-science www.g2.com/products/knime-analytics-platform/take_survey
These slides are a derivative of KNIME Course These slides are a derivative of KNIME Course
Material of KNIME AG used under CC BY 4.0 268 Material of KNIME AG used under CC BY 4.0 269

268 269

Stay Connected with KNIME

Follow us on
Blog: KNIME Self-Paced Courses:
social media:
knime.com/blog knime.com/knime-self-paced-
courses

Forum: Email:
forum.knime.com education@knime.com

KNIME Hub: Medium Journal:


hub.knime.com medium.com/low-code-for-
advanced-data-science Thank you

These slides are a derivative of KNIME Course These slides are a derivative of KNIME Course
Material of KNIME AG used under CC BY 4.0 259 Material of KNIME AG used under CC BY 4.0

270 271

44

You might also like