Date: Venue:: 28-11-2023, Saveetha School of Engineering
Date: Venue:: 28-11-2023, Saveetha School of Engineering
1
Adversarial Machine Learning
Dr. Saravanan.M.S
Innovation Ambassador,
IIC of Saveetha Institute of Medical and Technical Sciences,
Professor, Institute of CSE, SSE, SIMATS.
Outline
3
Machine Learning Basics
Machine Learning Basics
Training
Prediction
Learned
Labeled Data Prediction
model
4
Machine Learning Types
Machine Learning Basics
class A
class B
Regression Clustering
Classification
5
Supervised Learning
Machine Learning Basics
6
Unsupervised Learning
Machine Learning Basics
7
Nearest Neighbor Classifier
Machine Learning Basics
• Nearest Neighbor – for each test data point, assign the class label of
the nearest training data point
Adopt a distance function to find the nearest neighbor
o Calculate the distance to each data point in the training set, and assign the class
of the nearest data point (minimum distance)
It does not require learning a set of weights
Test Training
Training example examples
examples from class 2
from class 1
8
Nearest Neighbor Classifier
Machine Learning Basics
norm
(Manhattan distance)
x2
x
x
x o
x x
x
+ o x
o x
o + x
o
o o
o
o
x1
• Linear classifier
Find a linear function f of the inputs xi that separates the classes
Use pairs of inputs and labels to find the weights matrix W and the bias
vector b
o The weights and biases are the parameters of the function f
Several methods have been used to find the optimal set of parameters of a
linear classifier
o A common method of choice is the Perceptron algorithm, where the parameters
are updated until a minimal error is reached (single layer, does not use
backpropagation)
Linear classifier is a simple approach, but it is a building block of
advanced classification algorithms, such as SVM and neural networks
o Earlier multi-layer neural networks were referred to as multi-layer perceptrons
(MLPs)
11
Linear Classifier
Machine Learning Basics
13
Linear vs Non-linear Techniques
Linear vs Non-linear Techniques
14
Linear vs Non-linear Techniques
Linear vs Non-linear Techniques
15
Non-linear Techniques
Linear vs Non-linear Techniques
• Non-linear classification
Features are obtained as non-linear functions of the inputs
It results in non-linear decision boundaries
Can deal with non-linearly separable data
Inputs:
Features:
Outputs:
16
Non-linear Support Vector
Machines
Linear vs Non-linear Techniques
• Non-linear SVM
The original input space is mapped to a higher-dimensional feature space
where the training set is linearly separable
Define a non-linear kernel function to calculate a non-linear decision
boundary in the original feature space
Φ : 𝑥 ↦ 𝜙 (𝑥 )
17
Binary vs Multi-class
Classification
Binary vs Multi-class Classification
18
Binary vs Multi-class
Classification
Binary vs Multi-class Classification
19
Computer Vision Tasks
Machine Learning Basics
Picture from: Fie-Fei Li, Andrej Karpathy, Justin Johnson – Understanding and Visualizing CNNs 20
Thank you
Picture from: Fie-Fei Li, Andrej Karpathy, Justin Johnson – Understanding and Visualizing CNNs 21
No-Free-Lunch Theorem
Machine Learning Basics
22
ML vs. Deep Learning
Introduction to Deep Learning
23
ML vs. Deep Learning
Introduction to Deep Learning
25
Why is DL Useful?
Introduction to Deep Learning
26
Representational Power
Introduction to Deep Learning
27
Introduction to Neural Networks
Introduction to Neural Networks
Input Output
x1 y1
0.1 is 1
x2
y2
0.7 is 2
The image is “2”
……
……
……
x256 y10
0.2 is 0
16 x 16 = 256
Ink → 1 Each dimension represents the
confidence of a digit
No ink → 0
28
Introduction to Neural Networks
Introduction to Neural Networks
x1 y1
x2
Machine y2
“2
……
……
”
x256 𝑓 :𝑅
256
→𝑅
10
y10
The function is represented by a neural network
29
Elements of Neural Networks
Introduction to Neural Networks
z a1w1 a2 w2 aK wK b
a1 w1
𝑎=𝜎 ( 𝑧 )
a2 w2
z z a
…
wK output
…
aK Activation
weights function
b
input
bias
30
Elements of Neural Networks
Introduction to Neural Networks
Weights Biases
𝒉𝒊𝒅𝒅𝒆𝒏 𝒍𝒂𝒚𝒆𝒓 𝒉 =𝝈 ( 𝐖 𝟏 𝒙 + 𝒃𝟏 )
Activation functions
𝒉
31
Elements of Neural Networks
Introduction to Neural Networks
32
Elements of Neural Networks
Introduction to Neural Networks
……
……
……
……
……
xN …… yM
( 1 ∙1 ) + (− 1 ) ∙ ( − 2 )+ 1= 4
1 -2
34
Elements of Neural Networks
Introduction to Neural Networks
𝑓 :𝑅 →𝑅
2 2 𝑓
([ ]) [
1
−1
=
0 .62
0.83 ]
35
Matrix Operation
Introduction to Neural Networks
1 4 0.98
1 W x + b a
-2
1
-1
-1 -2 0.12 𝜎[
1
−1
−2
1] [ ] +¿ [ ] ¿ [
(
−
1
1
1
)
0
0 .98
0.12 ]
1
0 [ −2]
4
36
Matrix Operation
Introduction to Neural Networks
x1 …… y1
x 2 W1 …… y2
b1
……
……
……
……
……
xN x a1 …… yM
a1 W1 x + b1
37
Matrix Operation
Introduction to Neural Networks
x1 …… y1
x 2 W1 W2 ……
WL y2
b1 b2 bL
……
……
……
……
……
xN x a1 ……
a2 y yM
𝜎W1 x(+ b)
1
𝜎W2 a1(+ b)
2
𝜎 L-1 + )
WL a( bL
38
Matrix Operation
Introduction to Neural Networks
x1 …… y1
x 2 W1 W2 ……
WL y2
b1 b2 bL
……
……
……
……
……
xN x a1 ……
a2 y yM
y ¿ 𝑓 x( )
¿ WL … 𝜎
𝜎
W2 𝜎 b)
+ ( + b2 …
) + bL
1
W1 x(
( )
39
Softmax Layer
Introduction to Neural Networks
3
z1
0.95
y1 z1
1
z2 0.73
y2 z 2
-3
y3 z3
0.05
z3
40
Softmax Layer
Introduction to Neural Networks
1 2.7 0.12 3
z2 e e z2
y2 e z2
e
zj
j 1
0.05 ≈0
z3 -3
3
e e z3
y3 e z3
e
zj
3 j 1
e
zj
j 1
41
Activation Functions
Introduction to Neural Networks
𝑓 (𝑥 ) ℝ𝑛 → [ 0 , 1 ]
𝑥
43
Activation: Tanh
Introduction to Neural Networks
𝑓 (𝑥 ) ℝ 𝑛 → [ −1 , 1 ]
𝑥
44
Activation: ReLU
Introduction to Neural Networks
45
Activation: Leaky ReLU
Introduction to Neural Networks
46
Activation: Linear Function
Introduction to Neural Networks
47
Training NNs
Training Neural Networks
x1 …… y1
0.1 is 1
x2
Softmax
…… y2
0.7 is 2
……
……
……
x256 …… y10
0.2 is 0
16 x 16 = 256
48
Training NNs
Training Neural Networks
• To train a NN, set the parameters such that for a training subset of
images, the corresponding elements in the predicted output have
maximum values
50
Training NNs
Training Neural Networks
x1 …… y1 0.2 1
x2 …… y2 0.3 0
Cost
……
……
……
……
……
……
x256 …… y10 0.5 ℒ(𝜃) 0
True label “1”
51
Training NNs
Training Neural Networks
• For a training set of images, calculate the total loss overall all
images:
• Find the optimal parameters that minimize the total loss
ℒ1 ( 𝜃 )
x1 NN ^
𝑦
1
y1
ℒ2 ( 𝜃 )
x2 NN ^
𝑦
2
y2
ℒ3 ( 𝜃 )
x3 NN ^
𝑦
3
y3
……
……
……
……
ℒ𝑛 ( 𝜃 )
xN NN ^
𝑦
𝑁 yN
52
Loss Functions
Training Neural Networks
• Classification tasks
𝑁 𝐾
1
Loss function ℒ ( 𝜃 )=−
Cross-entropy ∑ ∑
𝑁 𝑖 =1 𝑘=1
[𝑦 (𝑖𝑘 ) + ( 1− 𝑦 (𝑖𝑘 ) ) log ( 1 − ^
𝑦 (𝑖𝑘 ) log ^ 𝑦 (𝑘𝑖 ) ) ]
Ground-truth class labels and model predicted class labels
53
Loss Functions
Training Neural Networks
• Regression tasks
Output
Linear (Identity) or Sigmoid Activation
Layer
𝑛
1
ℒ ( 𝜃 )= ∑ ( 𝑦 − 𝑦 )
(𝑖) (𝑖) 2
Mean Squared Error ^
Loss function 𝑛 𝑖=1
𝑛
1
Mean Absolute Error ℒ ( 𝜃 )= ∑ | 𝑦
(𝑖)
− ^
𝑦
(𝑖)
|
𝑛 𝑖=1
54
Training NNs
Training Neural Networks
ℒ( 𝜃) 𝜕ℒ
𝜕𝜃𝑖
𝜃𝑖
55
Gradient Descent Algorithm
Training Neural Networks
Parameter update:
Parameters
56
Gradient Descent Algorithm
Training Neural Networks
2. Compute the
gradient at ,
∗
𝜃
𝑤2 3. Times the learning
1 rate , and update
𝜃
− 𝛻 ℒ ( 𝜃0 )
4. Go to step 2, repeat
0
𝜃
𝑤1
𝛻 ℒ ( 𝜃 )=
0
[ 𝜕 ℒ ( 𝜃 0 ) / 𝜕 𝑤1
𝜕 ℒ (𝜃 )/ 𝜕 𝑤 2
0
]
57
Gradient Descent Algorithm
Training Neural Networks
• Example (contd.)
4. Go to step 2, repeat
0
𝜃
𝑤1 58
Gradient Descent Algorithm
Training Neural Networks
• For most tasks, the loss surface is highly complex (and non-convex)
• Random initialization in NNs
results in different initial
parameters every time the NN ℒ
is trained
Gradient descent may reach
different minima at every run
Therefore, NN will produce
different predicted outputs
• In addition, currently we don’t
have algorithms that guarantee
reaching a global minimum for 𝑤1 𝑤2
an arbitrary loss function
60
Backpropagation
Training Neural Networks
62
Stochastic Gradient Descent
Training Neural Networks
63
Problems with Gradient Descent
Training Neural Networks
• Besides the local minima problem, the GD algorithm can be very slow
at plateaus, and it can get stuck at saddle points
cost
𝛻 ℒ ( 𝜃 ) ≈ 0 𝛻 ℒ ( 𝜃 )=0
𝛻 ℒ ( 𝜃 )=0
𝜃
64
Gradient Descent with Momentum
Training Neural Networks
cost
Movement = Negative of Gradient + Momentum
Negative of Gradient
Momentum
Real Movement
𝜃
Gradient = 0
65
Gradient Descent with Momentum
Training Neural Networks
This term is analogous to a momentum of a heavy ball rolling down the hill
• The parameter is referred to as a coefficient of momentum
A typical value of the parameter is 0.9
• This method updates the parameters in the direction of the weighted
average of the past gradients
66
Nesterov Accelerated Momentum
Training Neural Networks
GD with Nesterov
GD with momentum
momentum
68
Learning Rate
Training Neural Networks
• Learning rate
The gradient tells us the direction in which the loss has the steepest rate
of increase, but it does not tell us how far along the opposite direction we
should step
Choosing the learning rate (also called the step size) is one of the most
important hyper-parameter settings for NN training
LR too LR too
small large
69
Learning Rate
Training Neural Networks
71
Vanishing Gradient Problem
Training Neural Networks
• In some cases, during training, the gradients can become either very
small (vanishing gradients) of very large (exploding gradients)
They result in very small or very large update of the parameters
Solutions: change learning rate, ReLU activations, regularization, LSTM
units in RNNs
x1 …… y1
x2 …… y2
……
……
……
……
……
xN …… yM
72
Generalization
Generalization
• Underfitting
The model is too “simple” to
represent all the relevant class
characteristics
E.g., model with too few
parameters
Produces high error on the training
set and high error on the validation
set
• Overfitting
The model is too “complex” and fits
irrelevant characteristics (noise) in
the data
E.g., model with too many
parameters
Produces low error on the training
73
Overfitting
Generalization
• Overfitting – a model with high capacity fits the noise in the data
instead of the underlying relationship
• weight decay
A regularization term that penalizes large weights is added to the loss
function
Data loss Regularization loss
For every weight in the network, we add the regularization term to the
loss value
o During gradient descent parameter update, every weight is decayed linearly
toward zero
The weight decay coefficient determines how dominant the regularization
is during the gradient computation
75
Regularization: Weight Decay
Regularization
76
Regularization: Weight Decay
Regularization
• weight decay
The regularization term is based on the norm of the weights
77
Regularization: Dropout
Regularization
• Dropout
Randomly drop units (along with their connections) during training
Each unit is retained with a fixed dropout rate p, independent of other
units
The hyper-parameter p needs to be chosen (tuned)
o Often, between 20% and 50% of the units are dropped
78
Regularization: Dropout
Regularization
……
79
Regularization: Early Stopping
Regularization
• Early-stopping
During model training, use a validation set
o E.g., validation/train ratio of about 25% to 75%
Stop when the validation accuracy (or loss) has not improved after n
epochs
o The parameter n is called patience
Stop training
validation
80
Batch Normalization
Regularization
81
Hyper-parameter Tuning
Hyper-parameter Tuning
82
Hyper-parameter Tuning
Hyper-parameter Tuning
• Grid search
Check all values in a range with a step value
• Random search
Randomly sample values for the parameter
Often preferred to grid search
• Bayesian hyper-parameter optimization
Is an active area of research
83
k-Fold Cross-Validation
k-Fold Cross-Validation
84
k-Fold Cross-Validation
k-Fold Cross-Validation
86
Deep vs Shallow Networks
Deep vs Shallow Networks
output
Shallow Deep
NN NN
……
x1 x2 …… xN
input
87
Convolutional Neural Networks (CNNs)
Convolutional Neural Networks
Convolutional
Input matrix 3x3 filter
• When the convolutional filters are scanned over the image, they
capture useful features
E.g., edge detection by convolutions
0 1
Filter 0
1 -4
1 1 1 1 1
1
1 0.015686 0.015686 0.011765 0.015686 0.015686 0.015686 0.015686 0.964706 0.988235 0.964706 0.866667 0.031373 0.023529 0.007843
0.007843 0.741176 1 1 0.984314 0.023529 0.019608 0.015686 0.015686 0.015686 0.011765 0.101961 0.972549 1 0 1 0.996078 0.058824 0.015686
1 0.996078 0.996078
0.019608 0.513726 1 1 1 0.019608 0.015686 0.015686 0.015686 0.007843 0.011765 1 1 1 0.996078 0.031373 0.015686 0.019608 1 0.011765
0.015686 0.733333 1 1 0.996078 0.019608 0.019608 0.015686 0.015686 0.011765 0.984314 1 00.015686 0.007843 0.007843 1 0.352941
1 0.988235 0.027451
0.015686 0.823529 1 1 0.988235 0.019608 0.019608 0.015686 0.015686 0.019608 1 1 0.980392 0.015686 0.015686 0.015686 0.015686 0.996078 1 0.996078
0.015686 0.913726 1 1 0.996078 0.019608 0.019608 0.019608 0.019608 1 1 0.984314 0.015686 0.015686 0.015686 0.015686 0.952941 1 1 0.992157
0.019608 0.913726 1 1 0.988235 0.019608 0.019608 0.019608 0.039216 0.996078 1 0.015686 0.015686 0.015686 0.015686 0.996078 1 1 1 0.007843
0.019608 0.898039 1 1 0.988235 0.019608 0.015686 0.019608 0.968628 0.996078 0.980392 0.027451 0.015686 0.019608 0.980392 0.972549 1 1 1 0.019608
0.043137 0.905882 1 1 1 0.015686 0.035294 0.968628 1 1 0.023529 1 0.792157 0.996078 1 1 0.980392 0.992157 0.039216 0.023529
1 1 1 1 1 0.992157 0.992157 1 1 0.984314 0.015686 0.015686 0.858824 0.996078 1 0.992157 0.501961 0.019608 0.019608 0.023529
0.996078 0.992157 1 1 1 0.933333 0.003922 0.996078 1 0.988235 1 0.992157 1 1 1 0.988235 1 1 1 1
0.015686 0.74902 1 1 0.984314 0.019608 0.019608 0.031373 0.984314 0.023529 0.015686 0.015686 1 1 1 0 0.003922 0.027451 0.980392 1
0.019608 0.023529 1 1 1 0.019608 0.019608 0.564706 0.894118 0.019608 0.015686 0.015686 1 1 1 0.015686 0.015686 0.015686 0.05098 1
0.015686 0.015686 1 1 1 0.047059 0.019608 0.992157 0.007843 0.011765 0.011765 0.015686 1 1 1 0.015686 0.019608 0.996078 0.023529 0.996078
0.019608 0.015686 0.243137 1 1 0.976471 0.035294 1 0.003922 0.011765 0.011765 0.015686 1 1 1 0.988235 0.988235 1 0.003922 0.015686
0.019608 0.019608 0.027451 1 1 0.992157 0.223529 0.662745 0.011765 0.011765 0.011765 0.015686 1 1 1 0.015686 0.023529 0.996078 0.011765 0.011765
0.015686 0.015686 0.011765 1 1 1 1 0.035294 0.011765 0.011765 0.011765 0.015686 1 1 1 0.015686 0.015686 0.964706 0.003922 0.996078
0.007843 0.019608 0.011765 0.054902 1 1 0.988235 0.007843 0.011765 0.011765 0.015686 0.011765 1 1 1 0.015686 0.015686 0.015686 0.023529 1
0.007843 0.007843 0.015686 0.015686 0.960784 1 0.490196 0.015686 0.015686 0.015686 0.007843 0.027451 1 1 1 0.011765 0.011765 0.043137 1 1
0.023529 0.003922 0.007843 0.023529 0.980392 0.976471 0.039216 0.019608 0.007843 0.019608 0.015686 1 1 1 1 1 1 1 1 1
Input Convoluted
Image Image
89
Convolutional Neural Networks (CNNs)
Convolutional Neural Networks
w1 w2
w3 w4 w5 w6
w7 w8
Filter
1 Filter
Input 2
Image Layer 1
Feature Layer 2
Map Feature
Map
90
Convolutional Neural Networks (CNNs)
Convolutional Neural Networks
91
Convolutional Neural Networks (CNNs)
Convolutional Neural Networks
Bedroom
128
256
256
512
512
512
512
128
256
512
512
Kitchen
64
64
Bathroom
Outdoor
Conv
layer
Max
Pool
Fully Connected
Layer
92
Residual CNNs
Convolutional Neural Networks
93
Recurrent Neural Networks
(RNNs)
Recurrent Neural Networks
• Recurrent NNs are used for modeling sequential data and data with
varying length of inputs and outputs
Videos, text, speech, DNA sequences, human skeletal data
• RNNs introduce recurrent connections between the neurons
This allows processing sequential data one element at a time by selectively
passing information across a sequence
Memory of the previous inputs is stored in the model’s internal state and
affect the model predictions
Can capture correlations in sequential data
• RNNs use backpropagation-through-time for training
• RNNs are more sensitive to the vanishing gradient problem than
CNNs
94
Recurrent Neural Networks (RNNs)
Recurrent Neural Networks
• RNN use same set of weights and across all time steps
A sequence of hidden states is learned, which represents the memory of
the network
The hidden state at step t, , is calculated based on the previous hidden
state and the input at the current step , i.e.,
The function is a nonlinear activation function, e.g., ReLU or tanh
• RNN shown rolled over time
𝑤h 𝑤h 𝑤h 𝑤𝑦
h0 (·) h1 (·) h2 (·) h3 (·)
𝑤𝑥 𝑤𝑥 𝑤𝑥
x1 x2 x3
INPUT SEQUENCE:
95
Recurrent Neural Networks
(RNNs)
Recurrent Neural Networks
• RNNs can have one of many inputs and one of many outputs
A person riding a
Image
motorbike on dirt
Captioning
road
Machine शुभ
Happy Diwali
Translation दीपावली
96
Bidirectional RNNs
Recurrent Neural Networks
h⃑ 𝑡=𝜎 ( ⃑
𝑊 ⃑h𝑡 − 1+ ⃑
(hh ) (h𝑥)
𝑊 𝑥𝑡 )
´ (hh ) h́ + 𝑊
h́ 𝑡=𝜎 ( 𝑊 ´ (h𝑥) 𝑥 )
𝑡 +1 𝑡
𝑦 𝑡= 𝑓 ( [ ⃑h𝑡 ; h́ 𝑡 ] )
97
LSTM Networks
Recurrent Neural Networks
98
LSTM Networks
Recurrent Neural Networks
• LSTM cell
Input gate, output gate, forget gate, memory cell
LSTM can learn long-term correlations within data sequences
99
References
100