Essential Concept in Artificial Neural Networks

Essential Concepts in
ARTIFICIAL NEURAL NETWORKS
September 11, 2024
Prepared by: Huy Tran Minh, Khang Vo Hoang Nhat
Ho Chi Minh City University of Technology

Contents
1 Introduction 3
2 Architectural Framework of Artificial Neural Networks 3

2.1 Elemental Neuron Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Fully Connected Network Architecture . . . . . . . . . . . . . . . . . . . . . . 3
3 Mathematical Formulation 4
3.1 Matrix Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.2 Expanded Matrix Representation . . . . . . . . . . . . . . . . . . . . . . . . . 4
4 Computational Example 5
4.1 Weight Matrices, Input Vector and Bias Vectors . . . . . . . . . . . . . . . . 5
4.2 Calculation of Activation Values . . . . . . . . . . . . . . . . . . . . . . . . . 6
4.2.1 Hidden Layer Activations . . . . . . . . . . . . . . . . . . . . . . . . . 6
4.2.2 Output Layer Activation . . . . . . . . . . . . . . . . . . . . . . . . . . 6
5 Activation Functions in Neural Networks 7

5.1 Sigmoid Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
5.1.1 Properties and Characteristics . . . . . . . . . . . . . . . . . . . . . . 7
5.1.2 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
5.1.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
5.2 Hyperbolic Tangent (Tanh) Function . . . . . . . . . . . . . . . . . . . . . . . 8
5.2.2 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
5.2.3 Advantages over Sigmoid . . . . . . . . . . . . . . . . . . . . . . . . . 9
5.3 Rectified Linear Unit (ReLU) Function . . . . . . . . . . . . . . . . . . . . . . 9
5.3.2 Advantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
5.3.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
5.4 Softmax Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
5.4.2 Advantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
5.4.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
6 Neural Network Learning Algorithms 12

6.1 Forward Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
6.2 Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
6.3 Weight Update Mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
6.3.1 Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
6.3.2 Gradient Descent with Momentum . . . . . . . . . . . . . . . . . . . . 15
6.3.3 Adam (Adaptive Moment Estimation) . . . . . . . . . . . . . . . . . . 15
6.4 Loss Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
6.4.1 Cross-Entropy Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
6.4.2 Mean Squared Error (MSE) . . . . . . . . . . . . . . . . . . . . . . . . 16
7 Neural Network Classification: A Concrete Example 16

7.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1
7.2 Network Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
7.3 Forward Propagation Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 18
7.4 Numerical Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
7.4.1 Step 1: Hidden Layer Input . . . . . . . . . . . . . . . . . . . . . . . . 19
7.4.2 Step 2: Hidden Layer Activation . . . . . . . . . . . . . . . . . . . . . 19
7.4.3 Step 3: Output Layer Input . . . . . . . . . . . . . . . . . . . . . . . . 19
7.4.4 Step 4: Output Layer Activation . . . . . . . . . . . . . . . . . . . . . 20
7.5 Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
8 Multi-Layer Perceptron (MLP) 20

8.1 Architectural Overview of MLP . . . . . . . . . . . . . . . . . . . . . . . . . . 20
8.2 Mathematical Formulation of MLP . . . . . . . . . . . . . . . . . . . . . . . . 21
8.2.1 Forward Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
8.2.2 Backward Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
9 Artificial Neural Networks: An Overview 22

9.1 Applications of Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . 22
9.2 Artificial Neural Networks and Multilayer Perceptrons . . . . . . . . . . . . . 24
9.2.1 Strengths and Capabilities of Artificial Neural Networks . . . . . . . . 24
9.2.2 Challenges and Constraints in ANN Implementation . . . . . . . . . . 24
9.2.3 Theoretical and Practical Advantages of Multilayer Perceptrons . . . . 24
9.2.4 Critical Limitations in MLP Architecture and Training . . . . . . . . 24
9.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2
1 Introduction
Artificial Neural Networks (ANNs) are sophisticated computational models inspired by the
intricate structure and function of biological neural networks. These models are designed
to emulate the information processing capabilities of the human nervous system. The fun-
damental unit of an ANN is the artificial neuron, analogous to a biological neuron, which is
capable of receiving, processing, and transmitting information within the network architec-
ture.
2 Architectural Framework of Artificial Neural Networks
2.1 Elemental Neuron Structure

The artificial neuron, the core building block of an ANN, comprises three essential com-
ponents: input vectors, a weighted summation function, and an activation function. Each
neuron receives a set of inputs, each associated with a corresponding weight. These weighted
inputs are aggregated and subsequently processed through an activation function to deter-
mine the neuron’s output.
x1
w1
P P
w2 w i xi + b f ( wi xi + b)
x2 y
P
f
w3
x3
Figure 1: Schematic representation of an elementary artificial neuron
2.2 Fully Connected Network Architecture

A fully connected neural network, also referred to as a densely connected network, exhibits
a multi-layer architecture. In this configuration, each neuron in one layer is interconnected
with every neuron in the subsequent layer. The network typically consists of an input layer
for data ingestion, one or more hidden layers for intermediate processing, and an output
layer for final results.
3
x1 h1
Input Layer Hidden Layer Output Layer
x2 h2 y
x3 h3
Figure 2: Architectural diagram of a fully connected neural network
3 Mathematical Formulation
3.1 Matrix Representation

In the context of neural networks, inputs and weights are frequently represented in matrix
form. For a fully connected neuron layer, the input vector x and weight matrix W undergo
matrix multiplication to produce the layer’s output.
The fundamental equation governing this operation is:
z = Wx + b (1)
Where:
• x represents the input vector (dimension: n × 1)
• W denotes the weight matrix (dimension: m × n)
• b signifies the bias vector (dimension: m × 1)

• z is the resultant output vector (dimension: m × 1)
3.2 Expanded Matrix Representation

The matrix structure in neural networks can be expressed in expanded form as follows:
      
z1 w11 w12 · · · w1n x1 b1
 z2   w21 w22 · · · w2n   x2   b2 
 ..  =  .. ..   ..  +  ..  (2)
      
.. ..
 .   . . . .  .   . 
zm wm1 wm2 · · · wmn xn bm
4
4 Computational Example
Consider a neural network with 4 input neurons, 3 hidden neurons, and 1 output neuron:
Input
layer Hidden
layer
I1
H1
Output
I2
layer
H2
I3 O
H3
I4
Figure 3: Schematic of the example neural network
4.1 Weight Matrices, Input Vector and Bias Vectors
H1 H2 H3 O
I1 1 3 5 -
I2 2 1 0 -
I3 1 4 5 -
I4 2 0 3 -
H1 - - - 1
H2 - - - 2
H3 - - - 3
Table 1: Weight matrices for hidden and output layers
x1 x2 x3 x4
Input 1 2 3 -1
Table 2: Input vector
5
b1 b2 b3
Hidden Layer Bias 0.1 0.2 0.3
Table 3: Hidden layer bias vector
b (Output Layer Bias)

0.5
Table 4: Output layer bias
4.2 Calculation of Activation Values

4.2.1 Hidden Layer Activations
zH1 = (1 · 1) + (2 · 2) + (1 · 3) + (2 · (−1)) + bH1

= 1 + 4 + 3 − 2 + 0.1
= 6.1 (3)
zH2 = (3 · 1) + (1 · 2) + (4 · 3) + (0 · (−1)) + bH2

= 3 + 2 + 12 + 0 + 0.2
= 17.2 (4)
zH3 = (5 · 1) + (0 · 2) + (5 · 3) + (3 · (−1)) + bH3

= 5 + 0 + 15 − 3 + 0.3
= 17.3 (5)
4.2.2 Output Layer Activation
zO = (1 · 6.1) + (2 · 17.2) + (3 · 17.3) + bO

= 6.1 + 34.4 + 51.9 + 0.5
= 92.9 (6)
This computational example demonstrates the forward propagation process through a

simple neural network, illustrating how information flows from the input layer through the
hidden layer to the output layer. The calculations show the step-by-step application of
weights and biases to transform the input data into the final output.
6
5 Activation Functions in Neural Networks
Activation functions play a crucial role in determining the output of neurons in artificial
neural networks. This section examines four fundamental activation functions: the Sigmoid,
Hyperbolic Tangent (Tanh),The Rectified Linear Unit (ReLU) and Softmax Function func-
tions, elucidating their mathematical properties, applications, and significance in various
machine learning paradigms.
5.1 Sigmoid Function

The Sigmoid function, denoted as σ(x), is a nonlinear activation function that maps input
values to the interval (0, 1). It is defined mathematically as:
1
σ(x) = (7)
1 + e−x
5.1.1 Properties and Characteristics

• Range: (0, 1)
• Continuous and differentiable

• S-shaped curve
• Saturates for large positive or negative inputs
5.1.2 Applications
1. Binary Classification: The Sigmoid function is extensively utilized in logistic re-
gression models for binary classification tasks. Its output can be interpreted as a
probability, making it particularly suitable for problems requiring probabilistic pre-
dictions.
2. Output Layer Activation: In multi-layer perceptrons, the Sigmoid function is often
employed in the output layer for binary classification tasks or when the target variable
is bounded between 0 and 1.
3. Gradient-Based Learning: The Sigmoid function’s differentiability facilitates gradient-
based optimization techniques in neural network training.
5.1.3 Limitations
• Vanishing Gradient Problem: For inputs with large absolute values, the gradient
of the Sigmoid function approaches zero, potentially impeding learning in deep neural
networks.
• Non-Zero Centered: The Sigmoid function’s output is not centered around zero,
which can introduce difficulties in subsequent layers of deep networks.
7
σ(x)
0.8
0.6
0.4
0.2
x
−6 −4 −2 2 4 6
1
Figure 4: Sigmoid function σ(x) = 1+e−x
5.2 Hyperbolic Tangent (Tanh) Function

The Hyperbolic Tangent function, denoted as tanh(x), is a scaled and shifted variant of the
Sigmoid function. It is defined as:
ex − e−x
tanh(x) = (8)
ex + e−x

• Range: (−1, 1)
• Continuous and differentiable
• S-shaped curve
• Zero-centered
5.2.2 Applications
1. Hidden Layer Activation: The Tanh function is frequently employed as an activa-
tion function in hidden layers of neural networks, particularly in architectures such as
Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) networks.
2. Feature Scaling: Due to its zero-centered nature, the Tanh function can effectively
normalize input features, facilitating faster convergence during training.
3. Signal Processing: In signal processing applications, the Tanh function is used for
its ability to handle bipolar signals effectively.
8
5.2.3 Advantages over Sigmoid
• Zero-Centered Output: The Tanh function’s output is centered around zero, which
can help mitigate certain optimization issues in deep networks.
• Steeper Gradient: The Tanh function has a steeper gradient compared to the Sig-
moid function, which can lead to faster learning in some scenarios.
1
tanh(x)
0.5
x
−6 −4 −2 2 4 6
−0.5
−1
ex −e−x
Figure 5: Tanh function tanh(x) = ex +e−x
5.3 Rectified Linear Unit (ReLU) Function

The Rectified Linear Unit (ReLU) function is a piecewise linear function that outputs the
input for positive values and zero otherwise. It is defined as:
ReLU(x) = max(0, x) (9)

• Non-linear activation function
• Range: [0, ∞)
• Non-differentiable at x = 0
• Sparse activation: Promotes sparsity in the network
5.3.2 Advantages
• Mitigation of Vanishing Gradient: ReLU effectively addresses the vanishing gra-
dient problem in deep neural networks, facilitating the training of deeper architectures.
• Computational Efficiency: The simplicity of the ReLU function allows for faster
computation compared to sigmoid or tanh functions.
• Non-saturation: Unlike sigmoid and tanh, ReLU does not saturate for positive
inputs, allowing for continued learning.
9
5.3.3 Applications
1. Convolutional Neural Networks (CNNs): ReLU is extensively employed in
CNNs for computer vision tasks such as image classification, object detection, and
semantic segmentation.
2. Deep Feedforward Networks: ReLU is commonly used in hidden layers of deep

feedforward networks across various domains, including natural language processing
and speech recognition.
3. Generative Models: ReLU is utilized in generative adversarial networks (GANs)
and variational autoencoders (VAEs) for tasks such as image generation and style
transfer.
6
ReLU(x)
x
−2 −1 1 2 3 4 5 6
Figure 6: ReLU function: ReLU(x) = max(0, x)
5.4 Softmax Function

The Softmax function, also known as the normalized exponential function, transforms a
vector of K real numbers into a probability distribution of K possible outcomes. It is
defined as:
exi
Softmax(xi ) = PK for i = 1, . . . , K (10)
j=1 exj
where xi represents the i-th element of the input vector, and K is the number of classes
in the multi-class classification problem.
10
• Outputs a probability distribution (sum of outputs equals 1)
• Range: (0, 1) for each output
• Differentiable
• Preserves relative order of inputs
5.4.2 Advantages
• Probabilistic Interpretation: Softmax provides a clear probabilistic interpretation
of the model’s predictions, which is crucial in many classification tasks.
• Differentiability: The Softmax function is differentiable, making it suitable for use

with gradient-based optimization methods.
• Handling of Multi-class Problems: Softmax naturally extends to multi-class clas-
sification problems, unlike binary classification-oriented functions like sigmoid.
5.4.3 Applications
1. Multi-class Classification: Softmax is widely used in the output layer of neural
networks for multi-class classification tasks, such as image classification or document
categorization.
2. Natural Language Processing: In tasks like named entity recognition, part-of-
speech tagging, and sentiment analysis, Softmax is employed to classify words or
sentences into multiple categories.
3. Recommendation Systems: Softmax can be used in recommendation systems to

model user preferences across multiple items or categories.
11
1
Softmax probability 0.5
0 2
0
−2
0 −2 x2
2
x1
Figure 7: Softmax function for three classes: x1 , x2 , and x3 = 0. The z-axis represents the
probability of class 1.
6 Neural Network Learning Algorithms
This section presents a comprehensive overview of the fundamental learning algorithms

employed in neural networks, encompassing forward propagation, backpropagation, weight
update mechanisms, and loss functions.
6.1 Forward Propagation

Forward propagation is the process by which input data traverses through the neural net-
work, generating activations in each layer until the final output is produced. The mathe-
matical formulation of this process is as follows:
1. Hidden Layer Input Computation:
z(l) = W(l) a(l−1) + b(l) (11)
where z(l) is the input to layer l, W(l) is the weight matrix, a(l−1) is the activation
from the previous layer, and b(l) is the bias vector.
2. Activation Function Application:
a(l) = f (z(l) ) (12)
where f is a non-linear activation function, commonly ReLU for hidden layers.
12
3. Output Layer Computation: For the output layer, a different activation function
may be applied, such as softmax for multi-class classification:
(L)
(L) ez
ŷ = softmax(z ) = P (L) (13)
zj
je
where L denotes the output layer.
Input Layer
W(1) , b(1)
Hidden Layer(s)
W(L) , b(L)
z(L)
Output Layer Activation f
ŷ
Figure 8: Schematic representation of forward propagation in a neural network
6.2 Backpropagation
Backpropagation is a gradient-based learning algorithm that computes the gradient of the
loss function with respect to the network parameters. This process enables efficient weight
updates to minimize the prediction error.
The backpropagation algorithm proceeds as follows:
1. Compute the error at the output layer:
δ (L) = ∇a L ⊙ f ′ (z(L) ) (14)
where L is the loss function, ⊙ denotes element-wise multiplication, and f ′ is the
derivative of the activation function.
2. Propagate the error backwards through the network:
δ (l) = ((W(l+1) )T δ (l+1) ) ⊙ f ′ (z(l) ) (15)
3. Compute the gradients for weights and biases:

∇W(l) L = δ (l) (a(l−1) )T (16)
∇b(l) L = δ (l) (17)
13
X: Input matrix
Input n: Batch size, d: Feature di-
Data X mension
R: Set of real numbers
X ∈ Rn×d
Forward fθ : Neural network

Propagation θ: Network parameters
fθ (X)
Loss Com- L: Loss function

y: True labels
putation ŷ: Predicted outputs
L(y, ŷ)
Compute gradients via chain

Backpropagation rule:
∂L
∂θ
= ∂L
∂ ŷ
· ∂∂θ
ŷ
∇θ L
Parameter Update rule (e.g., SGD):

θ ← θ − η∇θ L
Update η: Learning rate
No Tmax : Max iterations

Convergence
ϵ: Loss threshold
Criteria Met? δ: Gradient norm threshold
• t > Tmax
• L<ϵ
• ∥∇θ L∥2 < δ
Yes
Terminate
Figure 9: The backpropagation algorithm in neural networks
14
6.3 Weight Update Mechanisms
The optimization of neural network parameters is typically achieved through iterative weight
update procedures. We present three prominent optimization algorithms: Gradient Descent
(GD), Gradient Descent with Momentum, and Adam.
6.3.1 Gradient Descent

Gradient Descent is a first-order iterative optimization algorithm that updates the weights
in the direction of steepest descent of the loss function. The weight update rule is given by:
W(l) ← W(l) − η∇W(l) L (18)

where η is the learning rate, and ∇W(l) L is the gradient of the loss function with respect
to the weights in layer l.
6.3.2 Gradient Descent with Momentum

Gradient Descent with Momentum incorporates a momentum term to accelerate convergence
and mitigate oscillations. The update rules are as follows:
v(l) ← βv(l) + (1 − β)∇W(l) L (19)

(l) (l) (l)
W ←W − ηv (20)
where v(l) is the velocity vector for layer l, and β is the momentum coefficient.
6.3.3 Adam (Adaptive Moment Estimation)

Adam is an adaptive learning rate optimization algorithm that combines the advantages of
RMSprop and Momentum. The update rules for Adam are:
m(l) ← β1 m(l) + (1 − β1 )∇W(l) L (21)

(l) (l) 2
v ← β2 v + (1 − β2 )(∇W(l) L) (22)
m(l)
W(l) ← W(l) − η √ (23)
v(l) + ϵ
where m(l) and v(l) are the first and second moment vectors respectively, β1 and β2 are
decay rates for the moment estimates, and ϵ is a small constant for numerical stability.
6.4 Loss Functions

Loss functions quantify the discrepancy between predicted and true values, guiding the
optimization process. We present two commonly used loss functions: Cross-Entropy for
classification tasks and Mean Squared Error (MSE) for regression tasks.
15
6.4.1 Cross-Entropy Loss
For binary classification, the cross-entropy loss is defined as:
N
1 X
L(y, ŷ) = − [yi log(ŷi ) + (1 − yi ) log(1 − ŷi )] (24)
N i=1
where N is the number of samples, yi is the true label, and ŷi is the predicted probability
for the i-th sample.
2.5
yi = 1
yi = 0
2
Cross-Entropy Loss
1.5
0.5
0
0 0.2 0.4 0.6 0.8 1
ŷ (Predicted probability)
Figure 10: Cross-Entropy Loss for binary classification
6.4.2 Mean Squared Error (MSE)

For regression tasks, the MSE loss is defined as:
N
1 X
L(y, ŷ) = (yi − ŷi )2 (25)
N i=1
where yi is the true value and ŷi is the predicted value for the i-th sample.
The choice of loss function is crucial and depends on the specific task and desired prop-
erties of the model output.
7 Neural Network Classification: A Concrete Example
This section presents a detailed analysis of a simple artificial neural network (ANN) designed
for binary classification. We elucidate the network’s structure and functionality through a
step-by-step computational example.
16
1
yi = 1
0.8
MSE Loss
0.6
0.4
0.2
0
−1 −0.5 0 0.5 1 1.5 2
ŷ (Predicted value)
Figure 11: Mean Squared Error (MSE) for regression tasks
7.1 Problem Formulation

Consider a binary classification problem with input data x = (x1 , x2 ) ∈ R2 and output
y ∈ {0, 1}. We construct a feedforward neural network with one hidden layer to model the
relationship between inputs and outputs.
17
7.2 Network Architecture
The proposed neural network comprises:
• An input layer with two neurons, corresponding to features x1 and x2 .

• A hidden layer with three neurons, employing the Rectified Linear Unit (ReLU) acti-
vation function.
• An output layer with one neuron, utilizing the sigmoid activation function to produce
a probability estimate for class membership.
Figure 12 illustrates the network architecture.
Input Hidden Output
x1 h1
h2 y
x2 h3
Figure 12: Neural network architecture for binary classification
7.3 Forward Propagation Algorithm

Let W(1) ∈ R3×2 and b(1) ∈ R3 denote the weight matrix and bias vector for the hidden
layer, respectively. Similarly, let W(2) ∈ R1×3 and b(2) ∈ R represent the weight vector and
bias for the output layer.
The forward propagation algorithm proceeds as follows:
1. Compute the input to the hidden layer:
z(1) = W(1) x + b(1)
2. Apply the ReLU activation function:
a(1) = ReLU(z(1) ) = max(0, z(1) )
3. Compute the input to the output layer:
z (2) = W(2) a(1) + b(2)
4. Apply the sigmoid activation function:

1
ŷ = σ(z (2) ) =
1 + e−z(2)
18
7.4 Numerical Example
To illustrate the forward propagation process, we provide a concrete example with the
following network parameters:
   
0.2 −0.3 0.1
W(1) =  0.4 0.1  , b(1) = −0.1
−0.5 0.2 0.2
W(2) = 0.5 b(2) = 0.1

−0.2 0.3 ,
Consider an input vector x = [0.6, −0.4]T . We now proceed through the forward propa-
gation steps:
7.4.1 Step 1: Hidden Layer Input
z(1) = W(1) x + b(1)

   
0.2 −0.3 0.1
0.6
=  0.4 0.1  + −0.1
−0.4
−0.5 0.2 0.2
 
0.34
=  0.20 
−0.18
7.4.2 Step 2: Hidden Layer Activation
a(1) = ReLU(z(1) )
= max(0, z(1) )
 
0.34
= 0.20
0
7.4.3 Step 3: Output Layer Input
z (2) = W(2) a(1) + b(2)

 
0.34
= 0.5 −0.2 0.3 0.20 + 0.1
0
= 0.23
19
7.4.4 Step 4: Output Layer Activation
ŷ = σ(z (2) )
1
=
1 + e−0.23
≈ 0.557
7.5 Interpretation
The output value of 0.557 represents the network’s estimated probability that the input
x = [0.6, −0.4]T belongs to class 1. This example demonstrates how a simple neural network
processes input data to produce a classification probability through forward propagation.
8 Multi-Layer Perceptron (MLP)
The Multi-Layer Perceptron (MLP) is a fully connected feedforward neural network compris-
ing multiple layers of artificial neurons. MLPs are widely employed in supervised learning
tasks, particularly in classification and regression problems.
8.1 Architectural Overview of MLP

Figure 13 illustrates the typical structure of an MLP:
Input Layer Hidden Layer Output Layer
x1 h1
x2 h2 y
x3 h3
Figure 13: Architectural representation of a Multi-Layer Perceptron (MLP)
The MLP architecture consists of three primary components:
1. Input Layer: Neurons in this layer correspond to the features of the input data.
2. Hidden Layers: One or more layers that perform non-linear transformations on the
input data.
3. Output Layer: Produces the final output of the network, often utilizing activation
functions tailored to the specific task (e.g., softmax for multi-class classification).
20
8.2 Mathematical Formulation of MLP
8.2.1 Forward Propagation
The forward propagation process in an MLP can be mathematically described as follows:
1. Input to Hidden Layer: For a hidden layer l, the pre-activation output is computed
as:
z(l) = W(l) a(l−1) + b(l) (26)
where W(l) ∈ Rnl ×nl−1 is the weight matrix, a(l−1) ∈ Rnl−1 is the output of the
previous layer, and b(l) ∈ Rnl is the bias vector.
2. Activation Function: The output of the hidden layer is obtained by applying an

activation function f :
a(l) = f (z(l) ) (27)
3. Output Layer: The final layer computes:
z(L) = W(L) a(L−1) + b(L) (28)
The network output ŷ is then obtained by applying an appropriate activation function

(e.g., softmax for classification tasks).
8.2.2 Backward Propagation

The backward propagation algorithm computes the gradients of the loss function with re-
spect to the network parameters:
1. Output Layer Error: Compute the error at the output layer:
δ (L) = ∇a L ⊙ f ′ (z(L) ) (29)
where L is the loss function, and ⊙ denotes element-wise multiplication.

2. Hidden Layer Error: Propagate the error backwards:
δ (l) = ((W(l+1) )T δ (l+1) ) ⊙ f ′ (z(l) ) (30)
3. Parameter Updates: Update weights and biases:

∂L
= δ (l) (a(l−1) )T (31)
∂W(l)
∂L
= δ (l) (32)
∂b(l)
21
9 Artificial Neural Networks: An Overview
The human brain serves as the primary inspiration for neural network architecture. Human
brain cells, also known as neurons, form a complex, highly interconnected network that
transmits electrical signals to facilitate information processing. Analogously, an artificial
neural network (ANN) is composed of artificial neurons that collaborate to solve problems.
Artificial neurons are software modules, referred to as nodes, while artificial neural networks
are software programs or algorithms that fundamentally utilize computer systems to perform
computational operations. These networks mimic the biological neural structure in their
design and function.
The key similarities between biological and artificial neural networks include:
• Interconnectivity: Both systems feature a high degree of interconnection between

individual units (neurons or nodes).
• Signal Transmission: In biological systems, neurons transmit electrical signals; in

ANNs, nodes pass weighted values through activation functions.
• Information Processing: Both networks process information through the collective
action of their constituent units.
• Adaptive Learning: Biological neural networks modify synaptic connections based

on experience; ANNs adjust weights and biases during training.
• Parallel Processing: Both systems can process multiple inputs simultaneously, al-
lowing for efficient computation of complex problems.
This biomimetic approach to computing has led to significant advancements in machine

learning and artificial intelligence, enabling the development of systems capable of pattern
recognition, decision-making, and complex problem-solving across various domains.
9.1 Applications of Artificial Neural Networks

The versatility of ANNs has led to their widespread adoption in numerous fields:
• Computer Vision
– Object Detection and Recognition: Convolutional Neural Networks (CNNs) like

YOLO (You Only Look Once) and R-CNN (Region-based CNN) enable real-time
object detection in images and video streams.
– Image Segmentation: U-Net and Mask R-CNN architectures provide pixel-level
classification, crucial in medical imaging and autonomous driving systems.
– Generative Models: Generative Adversarial Networks (GANs) can create, modify,
and enhance images, with applications in art, design, and data augmentation.
22
• Speech Recognition and Synthesis
– Automatic Speech Recognition (ASR): Deep Neural Networks, particularly Long
Short-Term Memory (LSTM) networks, have significantly improved the accuracy
of speech-to-text systems.
– Text-to-Speech (TTS): WaveNet and Tacotron architectures enable the generation
of highly natural and expressive synthetic speech.
• Time Series Prediction
– Financial Forecasting: LSTM networks and Temporal Convolutional Networks
(TCNs) are employed for stock price prediction and risk assessment.
– Weather Prediction: Ensemble methods combining CNNs and RNNs have im-
proved the accuracy of short-term and long-term weather forecasts.
– Energy Consumption Forecasting: ANNs help in predicting and optimizing energy
usage in smart grid systems.
• Recommendation Systems
– Collaborative Filtering: Matrix Factorization techniques using Neural Networks

have improved the accuracy of user-item interaction predictions.
– Content-Based Filtering: Deep learning models can extract complex features
from user profiles and item descriptions for more accurate recommendations.
• Gaming and Artificial Intelligence

– Game Playing Agents: Deep Reinforcement Learning, as demonstrated in Al-
phaGo and OpenAI Five, has produced AI capable of superhuman performance
in complex games.
– Procedural Content Generation: GANs and other generative models can create
game assets, levels, and even entire game worlds.
• Robotics
– Motion Control : Deep Reinforcement Learning enables robots to learn complex
motor skills and adapt to new environments.
– Computer Vision for Robotics: CNNs provide visual perception capabilities for
object recognition, grasping, and navigation.
– Human-Robot Interaction: Multimodal neural networks facilitate natural lan-
guage understanding and generation for more intuitive human-robot communi-
cation.
• Healthcare and Disease Diagnosis

– Medical Imaging Analysis: CNNs assist in detecting anomalies in X-rays, MRIs,
and CT scans, enhancing early disease detection.
– Drug Discovery: Recurrent Neural Networks and Graph Neural Networks acceler-
ate the process of identifying potential new drugs and predicting their properties.
23
– Personalized Medicine: ANNs analyze genetic data and patient histories to rec-
ommend tailored treatment plans.
• Financial Modeling and Market Prediction
– Algorithmic Trading: RNNs and LSTM networks analyze market trends and
execute high-frequency trading strategies.
– Credit Scoring: Multilayer Perceptrons (MLPs) assess credit risk by analyzing
various financial and personal data points.
– Fraud Detection: Anomaly detection using autoencoders helps identify unusual
patterns indicative of fraudulent activities.
9.2 Artificial Neural Networks and Multilayer Perceptrons

9.2.1 Strengths and Capabilities of Artificial Neural Networks
• Non-linear Function Approximation: ANNs excel at modeling complex, non-
linear relationships in high-dimensional data.
• Adaptive Learning: Through iterative training, ANNs can improve their perfor-
mance on specific tasks.
• Generalization Capability: Well-trained ANNs can effectively generalize to unseen
data.
9.2.2 Challenges and Constraints in ANN Implementation

• Data Dependency: ANNs typically require large datasets for effective training.
• Computational Complexity: Training deep ANNs can be computationally inten-
sive and time-consuming.
• Interpretability Challenges: The decision-making process in complex ANNs can

be difficult to interpret and explain.
9.2.3 Theoretical and Practical Advantages of Multilayer Perceptrons

• Universal Function Approximation: MLPs with sufficient hidden units can ap-
proximate any continuous function on compact subsets of Rn .
• Hierarchical Feature Learning: Deep MLPs can learn hierarchical representations

of input data.
9.2.4 Critical Limitations in MLP Architecture and Training

• Vanishing Gradient Problem: Deep MLPs can suffer from vanishing gradients
during training, impeding learning in early layers.
• Hyperparameter Sensitivity: MLP performance is sensitive to architectural choices
and hyperparameter settings.
24
9.3 Conclusion
Artificial Neural Networks, particularly Multi-Layer Perceptrons, have emerged as powerful
tools in the machine learning landscape. Their ability to model complex, non-linear re-
lationships has led to breakthrough performance in various domains. However, challenges
remain in terms of interpretability, data requirements, and computational complexity. Fu-
ture research directions may focus on addressing these limitations while further enhancing
the capabilities of these versatile models.
25
References
[1] Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT press.
[2] LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436-444.
[3] Schmidhuber, J. (2015). Deep learning in neural networks: An overview. Neural net-
works, 61, 85-117.
[4] Vaswani, A., et al. (2017). Attention is all you need. In Advances in neural information
processing systems (pp. 5998-6008).
26

Essential Concept in Artificial Neural Networks

Uploaded by

Copyright:

Available Formats

Essential Concept in Artificial Neural Networks

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Essential Concept in Artificial Neural Networks

Uploaded by

Copyright:

Available Formats

Essential Concepts in

ARTIFICIAL NEURAL NETWORKS

September 11, 2024

Prepared by: Huy Tran Minh, Khang Vo Hoang Nhat

Ho Chi Minh City University of Technology

2 Architectural Framework of Artificial Neural Networks 3

5 Activation Functions in Neural Networks 7

6 Neural Network Learning Algorithms 12

7 Neural Network Classification: A Concrete Example 16

8 Multi-Layer Perceptron (MLP) 20

9 Artificial Neural Networks: An Overview 22

2 Architectural Framework of Artificial Neural Networks

2.1 Elemental Neuron Structure

Figure 1: Schematic representation of an elementary artificial neuron

2.2 Fully Connected Network Architecture

Input Layer Hidden Layer Output Layer

Figure 2: Architectural diagram of a fully connected neural network

3.1 Matrix Representation

• b signifies the bias vector (dimension: m × 1)

3.2 Expanded Matrix Representation

Figure 3: Schematic of the example neural network

4.1 Weight Matrices, Input Vector and Bias Vectors

Table 1: Weight matrices for hidden and output layers

Table 2: Input vector

Table 3: Hidden layer bias vector

b (Output Layer Bias)

Table 4: Output layer bias

4.2 Calculation of Activation Values

zH1 = (1 · 1) + (2 · 2) + (1 · 3) + (2 · (−1)) + bH1

zH2 = (3 · 1) + (1 · 2) + (4 · 3) + (0 · (−1)) + bH2

zH3 = (5 · 1) + (0 · 2) + (5 · 3) + (3 · (−1)) + bH3

4.2.2 Output Layer Activation

zO = (1 · 6.1) + (2 · 17.2) + (3 · 17.3) + bO

This computational example demonstrates the forward propagation process through a

5.1 Sigmoid Function

5.1.1 Properties and Characteristics

• Continuous and differentiable

5.2 Hyperbolic Tangent (Tanh) Function

5.2.1 Properties and Characteristics

5.3 Rectified Linear Unit (ReLU) Function

ReLU(x) = max(0, x) (9)

5.3.1 Properties and Characteristics

2. Deep Feedforward Networks: ReLU is commonly used in hidden layers of deep

Figure 6: ReLU function: ReLU(x) = max(0, x)

5.4 Softmax Function

• Differentiability: The Softmax function is differentiable, making it suitable for use

3. Recommendation Systems: Softmax can be used in recommendation systems to

Softmax probability 0.5

6 Neural Network Learning Algorithms

This section presents a comprehensive overview of the fundamental learning algorithms

6.1 Forward Propagation

1. Hidden Layer Input Computation:

z(l) = W(l) a(l−1) + b(l) (11)

a(l) = f (z(l) ) (12)

where f is a non-linear activation function, commonly ReLU for hidden layers.

where L denotes the output layer.

Figure 8: Schematic representation of forward propagation in a neural network

3. Compute the gradients for weights and biases:

Forward fθ : Neural network