Chapter 02
Introduction to
deep learning
Woo Youn Kim
Professor of Chemistry &
Graduate School of Data Science, KAIST
CEO, HITS
Contents
• Linear regression
• Linear classification
• Concept of deep learning
• Multi-layer perceptron & Non-linearity
• Forward propagation for prediction
• Backward propagation for training
LAIDD: Chapter 3 (linear regression and classification) of the Lecture series on “QSAR(Quantitative Structure
Activity Relationship)” (struct 2
Statistical modeling
hypothesis
observations
• Statistical modeling aims to obtain a hypothesis with training samples (observations) {𝐗, 𝐘}, where
we consider that samples are independent and identically distributed (i.i.d.) random variables
sampled from the joint distribution 𝑝(𝐗, 𝐘).
• Statistical inference is a procedure of inferring a new output 𝒚∗ of a new input 𝒙∗ by using the
obtained hypothesis. 3
Learning algorithms
• Reinforcement learning: to learn a decision (policy) from experience by interacting with environments
• Supervised learning: to learn a hypothesis that best explains the input-label relationship
• Unsupervised learning: to learn features, which are also called as representations or latent variables
• Semi-supervised learning: obtaining a hypothesis by using both labeled and unlabeled data
• Self-supervised learning:
4
Supervised learning
Solubility
Toxicity
Drug efficacy
.
.
.
Input data Output data
Feature extraction (L)
Structure (X) Property (Y)
Y = f(X); f = neural networks
• Regression: predicting values
• Classification: predicting active or inactive
5
Linear regression
6
Regression
Let us start with a talk about a few examples of supervised learning problems.
X Y
hypothesis
Molecular structure Energy
Molecular structure Solubility
Molecular structure Biological activity
observations
𝑋
When the target variable that we’re trying to predict is continuous, we apply a
regression method to it.
7
Regression (linear hypothesis)
𝑊: parameters or weights
𝑏 : bias
The figure illustrates how linear regression works by fitting different lines
(hypotheses) to a given dataset. It visualizes the equation H(x)=Wx+b, where
W is the weight (or slope), and b is the bias (intercept). The goal is to find the
best-fitting line that minimizes the error between the predicted values and the
actual data points.
8
Cost function
How fit the line to our (training) data numerically?
9
Cost minimization
10
Simplified hypothesis
To simplify the calculation, we ignore the bias term in the hypothesis and thus in the cost function.
11
What cost looks like?
• 𝑾 = 𝟏, 𝒄𝒐𝒔𝒕 𝑾 = 𝟎
1 2 2
( 1∗1−1 + 1∗2−2 + 1 ∗ 3 − 3 2)
3
• 𝑾 = 𝟎, 𝒄𝒐𝒔𝒕 𝑾 = 𝟒. 𝟔𝟕
1 2 2
( 0∗1−1 + 0∗2−2 + 0 ∗ 3 − 3 2)
3
...
This graph shows the cost function Cost(W) with respect
to different values of W, which represents the weights in
a linear regression model.
12
Gradient descent algorithm
• Minimize the cost function using gradient descent algorithm
• Start with an initial guess
- Start at 0,0 (or any other value)
- Keep changing W and b a little bit to try and reduce cost(W, b)
• Each time you change the parameters, you select the gradient
which reduces cost(W, b) the most possible
a (= DW) : updating rate (or learning rate)
- the larger, the faster
- but may cause ill-convergence
• Repeat until you converge to a local minimum (i.e., the gradient = 0)
13
Least Mean Squares (LMS)
The resulting equation is called the LMS update rule (LMS stands for “least mean squares”)
14
Convex function
Note that the optimization problem here has only
one global, and no other local optima.
Thus, gradient descent always converges to the
the global minimum unless the learning rate α is
too large.
cost(W, b)
15
Gradient descent algorithm
• Batch gradient descent
Repeat until convergence {
scan through the entire training set before taking a single step:
computationally expensive: suppose you have million samples!
• Stochastic gradient descent
(also incremental gradient descent)
Loop {
for i=1 to m, {
}
}
start making progress right away, and continue to make progress with each example it looks at.
much faster convergence than the batch gradient descent.
16
Gaussian noise
Suppose that the target variables and the inputs are related via the equation.
where 𝜖 𝑖
is an error term that captures either unmodeled effects or random noise.
If 𝜖 𝑖 is independently and identically distributed according to a Gaussian distribution (also
called a Normal distribution) with mean zero and some variance 𝜎 2 ,
Thus,
The notation “𝑝 𝑦 𝑖 |𝑥 𝑖 ; 𝜃 ” indicates that this is the distribution of 𝑦 𝑖
given 𝑥 𝑖
and parameterized by 𝜃.
17
Likelihood
Given a data set, 𝑋 𝑥 𝑖
, and a ML model 𝜃, what is the distribution of target, Y 𝑦 𝑖
?
The probability of the data is given by p(Y|X; 𝜃), which is viewed a function of Y for a fixed value of 𝜃.
It can be instead called the likelihood function.
What is a reasonable way of choosing our best guess of the parameters 𝜃 or training the machine
learning model?
18
Maximum likelihood
The principal of maximum likelihood says that we should choose 𝜃 so as to make
the data as highly probable as possible.
That is, we should choose 𝜃 to maximize 𝐿 𝜃 .
19
Maximum likelihood
Instead of maximizing 𝐿 𝜃 , we can also maximize any strictly increasing function of 𝐿 𝜃 .
For example, maximize the log likelihood ℓ 𝜃 :
Note that the final 𝜽 has no
dependence on 𝝈.
Maximizing ℓ 𝜃 gives the same answer as minimizing
which is the original LMS cost function.
20
Classification
21
Classification
Classification problem: the values y we now want to predict take on only a small number of discrete
values.
Binary classification problem: y can take on only two values, 0 and 1.
0 or - : the negative class,
1 or + : the positive class,
22
Decision boundary
For 2D, the dimension of θ will be one.
Thus, the decision boundary will be a
linear line.
23
Problem of linear regression
Let’s try to predict y given z (= θTx) using the linear regression algorithm
It performs very poorly!
Simply it doesn’t make sense for hθ(x) to take values larger than 1 or smaller than 0
when we know that y ∈ {0, 1}.
24
Logistic function
change hθ(x) as
logistic function or the sigmoid function.
✓ It becomes 0.5 as a data point is on the decision
boundary (z = 0).
✓ It goes to 1 as z → ∞ and to 0 as z → −∞.
✓ It is always bounded between 0 and 1.
25
Logistic function
A useful property of the derivative of the sigmoid function,
26
Maximum likelihood
Assuming that
More compactly
27
Maximum likelihood
For m training examples, the likelihood of the parameters
Maximizing via the gradient ascent.
The corresponding log likelihood
for the maximization
28
Cost function
The log likelihood ➔ one has to maximize it
Cost function = -log likelihood ➔ one has to minimize it
𝑦 (𝑖) : true value in the training data ➔ true probablity of being in class 1
ℎ 𝑥 𝑖 : predicted value by the classifier ➔ predicted probability of being in class 1
1 − 𝑦 (𝑖) : true probablity of being in class 2
1 − ℎ 𝑥 𝑖 : predicted probability of being in class 2
29
Cost function
Two important properties:
1. Non-negative because 0 ≤ 𝑝𝑟𝑜𝑏 ≤ 1
2. For all training data (m), if ℎ 𝑥 𝑖
is close to 𝑦 (𝑖) , the cost goes to zero.
if 𝑦 (𝑖) = 0, ℎ 𝑥 𝑖
→0: 1st term = 0 and ln1=0 in the 2nd term; therefore Cost ➔0
if 𝑦 (𝑖) = 1, ℎ 𝑥 𝑖 →1: ln1=0 in the 1st term and 2nd term =0; therefore Cost ➔0
Indeed, it can play a role as a cost function.
30
Maximum likelihood
Let us take a derivative to derive the stochastic gradient ascent rule:
The stochastic gradient ascent rule
31
Maximum likelihood
The stochastic gradient ascent rule
Difference between the true and predicted values
The derivative becomes zero as the difference becomes zero.
In other words, the classifier perfectly predicts the class of input data!
32
Cross entropy
Gibbs entropy formula
VS
Entropy in information theory
It is also called the cross entropy
33
Softmax classification
34
Multinomial classification
We need 3 decision boundaries!
For 2D, 3 lines:
35
Softmax function
The softmax function If k = 2 like the binary classification (y = 0 or 1),
Prob. for k = 1
Prob. for k = 2
➔Probability of having the k-th label. The result is the same with the logistic function
for binary classification!!
Generalization of the logistic function for multinomial problems
36
Softmax classification
The hypothesis function
The softmax regression
Dog Cat Rabbit
Probability of being in each class
One has to determine the parameters {𝜃} with a given training set ➔ cost function
37
One-hot encoding
1
Dog = 0
0
0
Cat = 1
0
0
Rabbit = 0
1
• Multiple classes can be represented with a vector having elements whose number is the same
with the number of classes.
• This is called one-hot encoding.
• In this example, we have three classes and thus a vector with 3 elements.
• We take the maximum value and let it the unit value, resulting in only a single non-zero
element (in this example, the first element)
• Therefore, the input data is labeled the class 1.
38
Cost function
As the generalization of the logistic regression, we use the cross entropy as a cost function.
element-wise product
where
The element-wise product gives this result.
39
Concept of deep learning
40
Concept of machine learning
“easy-for-a-human, difficult-for-a-machine” tasks,
often referred to as pattern recognition.
• This image explains the concept of machine learning, specifically focusing on tasks that are
“easy for a human, difficult for a machine.” These tasks typically involve pattern recognition,
such as distinguishing between cats and dogs from labeled photos.
• On the right side of the image, there are labeled pictures of a cat and a dog. This visual
representation demonstrates how machines learn to recognize patterns, like identifying
different animals, by being trained with labeled data (photos with categories).
41
Why deep learning?
Classical ML (or shallow learning)
hand-crafted features by experts
Deep learning
automated feature extraction by AI
42
Neuron & synapse
Electrical and chemical signal transmission
Digital signal: on-off
threshold
Input Output
Analog signal transmission
Logistic (sigmoid) function 43
Neural network
Hebbian principle (1940s)
a neuro-scientific theory claiming that an increase in synaptic efficacy arises from a presynaptic cell’s
repeated and persistent stimulation of a postsynaptic cell. “Cells that fire together wire together.”
Complexity of human brain
• 1011 neurons
• each neuron has ~7,000 synaptic connections
• 3 years old: 1015 synapses
• adult: 1014~5X1014 synapses
➔ complex functions
• One of the key functions of a neural network is its ability to learn.
• A neural network is a complex adaptive system; it can change its internal structure based on the
information flowing through it by adjusting the amplitude of signals (weight).
44
Perception
“easy-for-a-human, difficult-for-a-machine” tasks, often referred to as pattern recognition.
Unique signal pattern
45
Perceptron
Invented in 1957 by Frank Rosenblatt at the Cornell Aeronautical Laboratory, a perceptron is the
simplest neural network possible: a computational model of a single neuron.
46
Principle of perceptron
0 if 𝑓 𝒘 ∙ 𝒙 < threshold
output = ቊ
1 if 𝑓 𝒘 ∙ 𝒙 ≥ threshold
inputs processor single output 𝒘 ∙ 𝒙 = 𝑤𝑗 𝑥𝑗
𝑗
Step 1: receive inputs Step 2: weight inputs Step 3: sum inputs Step 4: generate output
Input 0: x1 = 12 x1 × w1 = 12 × 0.5 = 6 Output = sign(sum)
Input 1: x2 = 4 x2 × w2 = 4 × -1 = -4 Sum = 6 + -4 = 2 = sign(2) = +1
47
Bias
0 if 𝑓 𝒘 ∙ 𝒙 < threshold 0 if 𝑓 𝒘 ∙ 𝒙 + 𝑏 < 0
output = ቊ output = ቊ
1 if 𝑓 𝒘 ∙ 𝒙 ≥ threshold 1 if 𝑓 𝒘 ∙ 𝒙 + 𝑏 ≥ 0
move the threshold inside the perceptron
and then make it a learnable parameter
a measure of how easy it is to get the perceptron to output 1.
or a measure of how easy it is to get the perceptron to fire.
48
Linear classification
By varying the weights and the bias, we can get different models of decision-making.
Learning ➔ determining appropriate weights and bias
to get an optimal output with a given training and test data set 49
Logic gate
✓ AND Gate
x1 x2 y
0 0 0
1 0 0
0 1 0
1 1 1
0 if 𝒘 ∙ 𝒙 + 𝑏 < 0
output = ቊ
1 if 𝒘 ∙ 𝒙 + 𝑏 ≥ 0
50
Logic gate
✓ Logic gate - OR Gate
x1 x2 y
0 0 0
1 0 1
0 1 1
1 1 1
0 if 𝒘 ∙ 𝒙 + 𝑏 < 0 the same slope
output = ቊ
1 if 𝒘 ∙ 𝒙 + 𝑏 ≥ 0 but a different bias value
51
Logic gate
✓ Logic gate- XOR Gate
x1 x2 y
0 0 0
1 0 1
0 1 1
1 1 0
x2
XOR Gate
0 if 𝒘 ∙ 𝒙 + 𝑏 < 0
output = ቊ
1 if 𝒘 ∙ 𝒙 + 𝑏 ≥ 0
x1
Non-linear expression with w and x 52
Drawback of a single perceptron
The single perceptron approach has one major drawback; it can only learn linearly separable
functions. Take XOR, a relatively simple function, and notice that it can’t be classified by a linear
separator.
But what if we make the model with more perceptrons? If one perceptron can solve OR and one
perceptron can solve NOT AND, then two perceptrons combined can solve XOR.
➔ Many perceptrons to go beyond linearity
53
Multilayer perceptron
& non-linearity
54
Multilayer perceptron (MLP)
MLP, a.k.a a forward-feed network
𝒙 = {𝑥1, 𝑥2}
𝒙 𝒘1 𝒘2
ℎ𝑗 = 𝑓(𝒘𝑗1 ∙ 𝒙 + 𝑏1)
𝒚
𝑦 = 𝑓(𝒘2 ∙ 𝒉 + 𝑏2)
𝑏1 𝑏2
Marvin Minsky 1969, founder of MIT AI lab
no way to train the neural network ➔ the first winter of AI 55
Nonlinearity
The nonlinear activation functions can
be expanded as
𝑢2
𝜎 𝑢 ≈ 𝜎0 + 𝜎1 ⋅ 𝑢 + 𝜎2 ⋅
2
The nonlinearity comes from both the activation
functions and many perceptrons
56
Universal approximation
57
Universal Approximation Theorem
• Why does neural networks work so well?
• The answer is, MLPs can approximate any non-linear function which is called the universal approximation
theorem. https://en.wikipedia.org/wiki/Universal_approximation_theorem
58
Universal approximation
One hidden-layer approximation of nonlinear functions: x2, sinx, |x|, and H(x)
Good bases for spanning a nonlinear function space 59
Supervised learning
Solubility
Toxicity
Drug efficacy
.
.
.
Input data Output data
Feature extraction (L)
Structure (X) Property (Y)
Y = f(X); f = neural networks
• Regression: predicting values
• Classification: predicting active or inactive
60
Why deeper?
Theorem. For any given multivariate polynomial and any tolerance ε > 0, there exists a neural
network of fixed finite size N (independent of ε) that approximates the polynomial to accuracy
better than ε. Furthermore, N is bounded by the complexity of the polynomial, scaling as the
number of multiplications required times a factor that is typically slightly larger than 4.
https://arxiv.org/pdf/1608.08225v4.pdf
61
Why deeper?
• ILSVRC (ImageNet Large Scale Visual Recognition Challenge) Winners
62
Forward propagation
for prediction
63
Prediction
Feed-forward MLP (or DNN)
x y
64
Example: forward propagation
65
Example: forward propagation
66
Example: forward propagation
67
Example: forward propagation
68
Back propagation
for training
69
Loss functions
Loss function = Cost function = Objective function = Error function
Regression
1 2
L2 loss function (or MSE or LMS) = 𝑦𝑖 − ℎ(𝑥𝑖 )
2𝑚
𝑖
Classification
Cross entropy = − 𝑦𝑖 ln ℎ 𝑥𝑖
𝑖
For binary classification, 𝐿 = −(𝑦𝑖 ln ℎ 𝑥𝑖 + (1 − 𝑦𝑖)ln(1 − ℎ 𝑥𝑖 )
70
How to train?
x y
(𝑖+1) (𝑖)
ℎ𝑖 = 𝑓(σ𝑗 𝒘𝑖𝑗ℎ𝑗 + 𝑏 (𝑖) )
One has to determine {w, b} so as to minimize the error, E(ypred, ytrue)
Minsky (1969): no one on earth had found a viable way to train MLPs good enough
to learn such simple functions (Winter of NN1 until 1986) 71
Stochastic gradient descent
Back propagation
1974, 1982 by Paul Werbos,
1986 by Hinton, rediscovery
72
Back propagation
73
Back propagation
74
Back propagation
75
Back propagation
76
Back propagation
77
Back propagation
78
Back propagation
79
Vanishing gradient
s(x) tanh(x)
Successive multiplication of the derivatives of the activation function ➔ zero update value
80
Vanishing gradient
Winter of neural network 2, CIFAR
Canadian Institute For Advanced Research
The vanishing gradient problem occurs during the training of deep neural networks,
particularly in very deep networks or when using activation functions like the sigmoid
or tanh. It happens when the gradients, which are used to update the weights of the
network during backpropagation, become extremely small as they propagate through
the layers. As a result, the early layers in the network receive negligible updates,
making it difficult for the network to learn effectively, slowing or even halting training.
This problem can be mitigated by using activation functions like ReLU and its variants
or by employing techniques like batch normalization.
81
Various cost functions
ReLU = Rectified Linear Unit
Variants of ReLU to improve its problem that has no
learning effect if the input has a negative value (zero
derivative): a.k.a. dying ReLU
• Leaky ReLU: very large negative output
• ELU: softer than Leaky ReLU (good for noise
handling) but more computational costs
Nair and Hinton, ICML (2010)
Go deeper
Maxout: piecewise linear function
(each perceptron has different values)
82
New terms
• Supervised learning
• Unsupervised learning
• Reinforcement learning
• Hypothesis
• Least Mean Square (LMS)
• Maximum Likelihood Estimate (MLE)
• Regression
• Gradient descent
• Classification
• Decision boundary
• Logistic or sigmoid function
• Cost function
• Softmax classification
• Cross entropy
• One-hot encoding
83
New terms
• Neuron & synapse
• Perceptron
• Activation function: sigmoid, tanh, ReLU, etc
• Multilayer perceptron (MLP) = Deep neural network (DNN) = Artificial
neural network (ANN)
• Nonlinearity
• Universal approximation
• Loss function = Cost function = Error function
• Feed forward
• Back propagation
84