Essential Concept in Artificial Neural Networks
Essential Concept in Artificial Neural Networks
Essential Concept in Artificial Neural Networks
3 Mathematical Formulation 4
3.1 Matrix Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.2 Expanded Matrix Representation . . . . . . . . . . . . . . . . . . . . . . . . . 4
4 Computational Example 5
4.1 Weight Matrices, Input Vector and Bias Vectors . . . . . . . . . . . . . . . . 5
4.2 Calculation of Activation Values . . . . . . . . . . . . . . . . . . . . . . . . . 6
4.2.1 Hidden Layer Activations . . . . . . . . . . . . . . . . . . . . . . . . . 6
4.2.2 Output Layer Activation . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1
7.2 Network Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
7.3 Forward Propagation Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 18
7.4 Numerical Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
7.4.1 Step 1: Hidden Layer Input . . . . . . . . . . . . . . . . . . . . . . . . 19
7.4.2 Step 2: Hidden Layer Activation . . . . . . . . . . . . . . . . . . . . . 19
7.4.3 Step 3: Output Layer Input . . . . . . . . . . . . . . . . . . . . . . . . 19
7.4.4 Step 4: Output Layer Activation . . . . . . . . . . . . . . . . . . . . . 20
7.5 Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2
1 Introduction
Artificial Neural Networks (ANNs) are sophisticated computational models inspired by the
intricate structure and function of biological neural networks. These models are designed
to emulate the information processing capabilities of the human nervous system. The fun-
damental unit of an ANN is the artificial neuron, analogous to a biological neuron, which is
capable of receiving, processing, and transmitting information within the network architec-
ture.
x1
w1
P P
w2 w i xi + b f ( wi xi + b)
x2 y
P
f
w3
x3
3
x1 h1
x2 h2 y
x3 h3
3 Mathematical Formulation
z = Wx + b (1)
Where:
• x represents the input vector (dimension: n × 1)
• W denotes the weight matrix (dimension: m × n)
4
4 Computational Example
Consider a neural network with 4 input neurons, 3 hidden neurons, and 1 output neuron:
Input
layer Hidden
layer
I1
H1
Output
I2
layer
H2
I3 O
H3
I4
H1 H2 H3 O
I1 1 3 5 -
I2 2 1 0 -
I3 1 4 5 -
I4 2 0 3 -
H1 - - - 1
H2 - - - 2
H3 - - - 3
x1 x2 x3 x4
Input 1 2 3 -1
5
b1 b2 b3
Hidden Layer Bias 0.1 0.2 0.3
6
5 Activation Functions in Neural Networks
Activation functions play a crucial role in determining the output of neurons in artificial
neural networks. This section examines four fundamental activation functions: the Sigmoid,
Hyperbolic Tangent (Tanh),The Rectified Linear Unit (ReLU) and Softmax Function func-
tions, elucidating their mathematical properties, applications, and significance in various
machine learning paradigms.
5.1.2 Applications
1. Binary Classification: The Sigmoid function is extensively utilized in logistic re-
gression models for binary classification tasks. Its output can be interpreted as a
probability, making it particularly suitable for problems requiring probabilistic pre-
dictions.
2. Output Layer Activation: In multi-layer perceptrons, the Sigmoid function is often
employed in the output layer for binary classification tasks or when the target variable
is bounded between 0 and 1.
3. Gradient-Based Learning: The Sigmoid function’s differentiability facilitates gradient-
based optimization techniques in neural network training.
5.1.3 Limitations
• Vanishing Gradient Problem: For inputs with large absolute values, the gradient
of the Sigmoid function approaches zero, potentially impeding learning in deep neural
networks.
• Non-Zero Centered: The Sigmoid function’s output is not centered around zero,
which can introduce difficulties in subsequent layers of deep networks.
7
σ(x)
0.8
0.6
0.4
0.2
x
−6 −4 −2 2 4 6
1
Figure 4: Sigmoid function σ(x) = 1+e−x
ex − e−x
tanh(x) = (8)
ex + e−x
• Zero-centered
5.2.2 Applications
1. Hidden Layer Activation: The Tanh function is frequently employed as an activa-
tion function in hidden layers of neural networks, particularly in architectures such as
Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) networks.
2. Feature Scaling: Due to its zero-centered nature, the Tanh function can effectively
normalize input features, facilitating faster convergence during training.
3. Signal Processing: In signal processing applications, the Tanh function is used for
its ability to handle bipolar signals effectively.
8
5.2.3 Advantages over Sigmoid
• Zero-Centered Output: The Tanh function’s output is centered around zero, which
can help mitigate certain optimization issues in deep networks.
• Steeper Gradient: The Tanh function has a steeper gradient compared to the Sig-
moid function, which can lead to faster learning in some scenarios.
1
tanh(x)
0.5
x
−6 −4 −2 2 4 6
−0.5
−1
ex −e−x
Figure 5: Tanh function tanh(x) = ex +e−x
5.3.2 Advantages
• Mitigation of Vanishing Gradient: ReLU effectively addresses the vanishing gra-
dient problem in deep neural networks, facilitating the training of deeper architectures.
• Computational Efficiency: The simplicity of the ReLU function allows for faster
computation compared to sigmoid or tanh functions.
• Non-saturation: Unlike sigmoid and tanh, ReLU does not saturate for positive
inputs, allowing for continued learning.
9
5.3.3 Applications
1. Convolutional Neural Networks (CNNs): ReLU is extensively employed in
CNNs for computer vision tasks such as image classification, object detection, and
semantic segmentation.
6
ReLU(x)
x
−2 −1 1 2 3 4 5 6
10
5.4.1 Properties and Characteristics
• Outputs a probability distribution (sum of outputs equals 1)
• Range: (0, 1) for each output
• Differentiable
• Preserves relative order of inputs
5.4.2 Advantages
• Probabilistic Interpretation: Softmax provides a clear probabilistic interpretation
of the model’s predictions, which is crucial in many classification tasks.
5.4.3 Applications
1. Multi-class Classification: Softmax is widely used in the output layer of neural
networks for multi-class classification tasks, such as image classification or document
categorization.
2. Natural Language Processing: In tasks like named entity recognition, part-of-
speech tagging, and sentiment analysis, Softmax is employed to classify words or
sentences into multiple categories.
11
1
0 2
0
−2
0 −2 x2
2
x1
Figure 7: Softmax function for three classes: x1 , x2 , and x3 = 0. The z-axis represents the
probability of class 1.
where z(l) is the input to layer l, W(l) is the weight matrix, a(l−1) is the activation
from the previous layer, and b(l) is the bias vector.
2. Activation Function Application:
12
3. Output Layer Computation: For the output layer, a different activation function
may be applied, such as softmax for multi-class classification:
(L)
(L) ez
ŷ = softmax(z ) = P (L) (13)
zj
je
Input Layer
W(1) , b(1)
Hidden Layer(s)
W(L) , b(L)
z(L)
Output Layer Activation f
ŷ
6.2 Backpropagation
Backpropagation is a gradient-based learning algorithm that computes the gradient of the
loss function with respect to the network parameters. This process enables efficient weight
updates to minimize the prediction error.
The backpropagation algorithm proceeds as follows:
1. Compute the error at the output layer:
δ (L) = ∇a L ⊙ f ′ (z(L) ) (14)
where L is the loss function, ⊙ denotes element-wise multiplication, and f ′ is the
derivative of the activation function.
2. Propagate the error backwards through the network:
δ (l) = ((W(l+1) )T δ (l+1) ) ⊙ f ′ (z(l) ) (15)
13
X: Input matrix
Input n: Batch size, d: Feature di-
Data X mension
R: Set of real numbers
X ∈ Rn×d
fθ (X)
L(y, ŷ)
∇θ L
• t > Tmax
• L<ϵ
• ∥∇θ L∥2 < δ
Yes
Terminate
14
6.3 Weight Update Mechanisms
The optimization of neural network parameters is typically achieved through iterative weight
update procedures. We present three prominent optimization algorithms: Gradient Descent
(GD), Gradient Descent with Momentum, and Adam.
where v(l) is the velocity vector for layer l, and β is the momentum coefficient.
where m(l) and v(l) are the first and second moment vectors respectively, β1 and β2 are
decay rates for the moment estimates, and ϵ is a small constant for numerical stability.
15
6.4.1 Cross-Entropy Loss
For binary classification, the cross-entropy loss is defined as:
N
1 X
L(y, ŷ) = − [yi log(ŷi ) + (1 − yi ) log(1 − ŷi )] (24)
N i=1
where N is the number of samples, yi is the true label, and ŷi is the predicted probability
for the i-th sample.
2.5
yi = 1
yi = 0
2
Cross-Entropy Loss
1.5
0.5
0
0 0.2 0.4 0.6 0.8 1
ŷ (Predicted probability)
This section presents a detailed analysis of a simple artificial neural network (ANN) designed
for binary classification. We elucidate the network’s structure and functionality through a
step-by-step computational example.
16
1
yi = 1
0.8
MSE Loss
0.6
0.4
0.2
0
−1 −0.5 0 0.5 1 1.5 2
ŷ (Predicted value)
17
7.2 Network Architecture
The proposed neural network comprises:
• An output layer with one neuron, utilizing the sigmoid activation function to produce
a probability estimate for class membership.
x1 h1
h2 y
x2 h3
18
7.4 Numerical Example
To illustrate the forward propagation process, we provide a concrete example with the
following network parameters:
0.2 −0.3 0.1
W(1) = 0.4 0.1 , b(1) = −0.1
−0.5 0.2 0.2
Consider an input vector x = [0.6, −0.4]T . We now proceed through the forward propa-
gation steps:
a(1) = ReLU(z(1) )
= max(0, z(1) )
0.34
= 0.20
0
= 0.23
19
7.4.4 Step 4: Output Layer Activation
ŷ = σ(z (2) )
1
=
1 + e−0.23
≈ 0.557
7.5 Interpretation
The output value of 0.557 represents the network’s estimated probability that the input
x = [0.6, −0.4]T belongs to class 1. This example demonstrates how a simple neural network
processes input data to produce a classification probability through forward propagation.
The Multi-Layer Perceptron (MLP) is a fully connected feedforward neural network compris-
ing multiple layers of artificial neurons. MLPs are widely employed in supervised learning
tasks, particularly in classification and regression problems.
x1 h1
x2 h2 y
x3 h3
1. Input Layer: Neurons in this layer correspond to the features of the input data.
2. Hidden Layers: One or more layers that perform non-linear transformations on the
input data.
3. Output Layer: Produces the final output of the network, often utilizing activation
functions tailored to the specific task (e.g., softmax for multi-class classification).
20
8.2 Mathematical Formulation of MLP
8.2.1 Forward Propagation
The forward propagation process in an MLP can be mathematically described as follows:
1. Input to Hidden Layer: For a hidden layer l, the pre-activation output is computed
as:
z(l) = W(l) a(l−1) + b(l) (26)
where W(l) ∈ Rnl ×nl−1 is the weight matrix, a(l−1) ∈ Rnl−1 is the output of the
previous layer, and b(l) ∈ Rnl is the bias vector.
21
9 Artificial Neural Networks: An Overview
The human brain serves as the primary inspiration for neural network architecture. Human
brain cells, also known as neurons, form a complex, highly interconnected network that
transmits electrical signals to facilitate information processing. Analogously, an artificial
neural network (ANN) is composed of artificial neurons that collaborate to solve problems.
Artificial neurons are software modules, referred to as nodes, while artificial neural networks
are software programs or algorithms that fundamentally utilize computer systems to perform
computational operations. These networks mimic the biological neural structure in their
design and function.
The key similarities between biological and artificial neural networks include:
22
• Speech Recognition and Synthesis
– Automatic Speech Recognition (ASR): Deep Neural Networks, particularly Long
Short-Term Memory (LSTM) networks, have significantly improved the accuracy
of speech-to-text systems.
– Text-to-Speech (TTS): WaveNet and Tacotron architectures enable the generation
of highly natural and expressive synthetic speech.
• Time Series Prediction
– Financial Forecasting: LSTM networks and Temporal Convolutional Networks
(TCNs) are employed for stock price prediction and risk assessment.
– Weather Prediction: Ensemble methods combining CNNs and RNNs have im-
proved the accuracy of short-term and long-term weather forecasts.
– Energy Consumption Forecasting: ANNs help in predicting and optimizing energy
usage in smart grid systems.
• Recommendation Systems
23
– Personalized Medicine: ANNs analyze genetic data and patient histories to rec-
ommend tailored treatment plans.
• Financial Modeling and Market Prediction
– Algorithmic Trading: RNNs and LSTM networks analyze market trends and
execute high-frequency trading strategies.
– Credit Scoring: Multilayer Perceptrons (MLPs) assess credit risk by analyzing
various financial and personal data points.
– Fraud Detection: Anomaly detection using autoencoders helps identify unusual
patterns indicative of fraudulent activities.
24
9.3 Conclusion
Artificial Neural Networks, particularly Multi-Layer Perceptrons, have emerged as powerful
tools in the machine learning landscape. Their ability to model complex, non-linear re-
lationships has led to breakthrough performance in various domains. However, challenges
remain in terms of interpretability, data requirements, and computational complexity. Fu-
ture research directions may focus on addressing these limitations while further enhancing
the capabilities of these versatile models.
25
References
[1] Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT press.
[2] LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436-444.
[3] Schmidhuber, J. (2015). Deep learning in neural networks: An overview. Neural net-
works, 61, 85-117.
[4] Vaswani, A., et al. (2017). Attention is all you need. In Advances in neural information
processing systems (pp. 5998-6008).
26