mv_cs4243_2024_amir_6_p1 (1)

Computer Vision and
Pattern Recognition
CS 4243
S1-Y2024/25
1
Lesson 6- Part 1
Computer Vision and Deep Learning
100 billion CPUs, 500 trillion connections, OMG!!!
2
ARTIFICIAL NEURAL NETWORKS,
HISTORY
HTTPS://MEDIUM.COM/ANALYTICS-VIDHYA/BRIEF-HISTORY-OF-NEURAL-NETWORKS-44C2BF72EEC
3
LETS DIVE INTO IT
Use Jupyter Notebook/ Anaconda/

Python
You need Tensorflow Package too
Program: ann2.ipynb
This is a function estimation example
ANN2.IPYNB
4
ANN EXAMPLE: A SIMPLE ADDER
𝑜𝑜 = 𝑖𝑖1 𝑤𝑤1 + 𝑖𝑖2 𝑤𝑤2
• w1 and w2 are selected randomly, i1 i2 O
e.g. 0.3 and -0.7 1 2 3
• Then we will try to find the best 1 3 4
w1 and w2 to have the output 1 1 2
equal (or close enough) to the Os 3 3 6

7 1 8
of the training data samples.
…
• Iteratively 8 1 9
• Learning by samples/examples
5
ANN EXAMPLE: A SIMPLE ADDER

real
Training our adder target
E,S I1,i2 w1 w2 Or error Dw1 Dw2

i1 i2 Ot
1,1 1,2 0.3 -0.7 -1.1 4.1 + +
1 2 3
1,2 1,3 0.4 -0.6 -1.4 5.4 + + 1 3 4
1,3 1,1 0.5 -0.5 0 2 + + 1 1 2
1,4 3,3 0.6 -0.4 0.6 5.4 + + 3 3 6

7 1 8
… … …
…
N,M 1,1 ~1 ~1 ~2 ~0 0 0
8 1 9
6
HOW IT WORKS?
Delta Rule
𝑂𝑂𝑟𝑟 = � 𝑖𝑖𝑗𝑗 𝑤𝑤𝑗𝑗

𝑗𝑗
𝑤𝑤𝑗𝑗+1 = 𝑤𝑤𝑗𝑗 + ∆𝑤𝑤𝑗𝑗
∆𝑤𝑤𝑗𝑗 = 𝜂𝜂(𝑂𝑂𝑡𝑡 − 𝑂𝑂𝑟𝑟 )𝑖𝑖𝑗𝑗
η= Learning Rate
7
ANN EXAMPLE: A SIMPLE
SUBTRACTOR
i1 i2 O
• w1 and w2 are selected 4 1 3
randomly, e.g. -0.4 and 0.6 5 1 4
• Learning by examples and 3 2 1
iterations once again 3 3 0
7 0 7
• New training samples, mean
…
new functionality. 8 2 6
8
ANN EXAMPLE: A SIMPLE
SUBTRACTOR
Training our subtractor
E,S I1,i2 w1 w2 Or error Dw1 Dw2
i1 i2 O
1,1 4,1 -0.4 0.6 -1 4 + + 4 1 3
1,2 5,1 -0.3 0.7 -0.8 4.8 + + 5 1 4
1,3 3,2 -0.2 0.8 1 0 0 0 3 2 1
3 3 0
1,4 3,3 -0.2 0.8 1.8 -1.8 - -
7 0 7
… …
…
N,M 8,2 ~1 ~ -1 ~6 ~0 0 0 8 2 6
9
HOW IT WORKS?
Inverse matrix scheme

In matrix format: O=IW
⇒ 𝑊𝑊 = 𝐼𝐼 −1 𝑂𝑂
We most likely need to compute
the pseudo inverse
10
PERCEPTRON
Two
implementations of
a single neuron
Perceptron, be able
to do any 2x1
linear logical
mapping
i1 i2 O=i1 . I2
(and)
0 0 0
0 1 0
1 0 0
w1=w2=1, th=1.5 1 1 1
11
PERCEPTRON
We may try a Perceptron network to
materialize multi-input / multi-output
function.
An example is a 1-out-of-c maximum
classifier
In many cases, it is more accurate than
having a single output, in particular
𝑛𝑛
when there are many classes.
𝑜𝑜𝑜𝑜𝑜𝑜𝑗𝑗 = 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠(� 𝑖𝑖𝑖𝑖 𝑤𝑤𝑖𝑖𝑖𝑖 − 𝜃𝜃𝑗𝑗 )
𝑖𝑖=1
12
PERCEPTRON
NN_EXAM4_PERCEPTRON.IPYNB , PERCEPTRON.XLSX
13
PRACTICE: TRAIN IT AS AN ‘OR’
1 𝑥𝑥 ≥ 0
𝑜𝑜 = 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑥𝑥 = �
𝑜𝑜 𝑥𝑥 < 0
What are w1,w2, and b?

i1 i2 O=i1 \/ i2
0 0 0
0 1 1
1 0 1
1 1 1
14
PRACTICE: TRAIN IT AS AN ‘XOR’
1 𝑥𝑥 ≥ 0
What are w1,w2, and b?

i1 i2 O=i1 xor i2
0 0 0
0 1 1
1 0 1
1 1 0
15
PERCEPTRON/ SINGLE NEURON
DISADVANTAGES: XOR
1 𝑥𝑥 ≥ 0
To deal with such problems, a

i1 i2 O=i1 xor i2
Perceptron needs a multi-neuron
0 0 0
hidden layer with non-linear
0 1 1
activation function.
1 0 1
1 1 0
16
MULTI LAYERED PERCEPTRON AND
XOR
• There is no way to train a single-layered
perceptron to carry out non-linear tasks.
• We need a few things to guarantee a
non-linear behavior:
• At least 1 hidden layer of neurons XOR Problem
between the input and output layers
• With at least 2 neurons
• With a non-linear activation function
• Next we need a good training algorithm,
e.g. Error Back-Propagation Algorithm
• Then we can have a fantastic non-linear
system …
17
MULTI LAYERED PERCEPTRON AND
XOR
i1 i2 O=i1 xor i2 A multi layered

0 0 0 perceptron: multi-
line borders, or
0 1 1 using partial lines to
1 0 1 estimate functions
1 1 0
18
MULTILAYERED PERCEPTRON
Input hidden output

layers
• In a classification problem, a MLP, after a proper training, can classify the

samples using line pieces/segments boundary.
• Number of line pieces depends on the number of hidden neurons, training
samples, training epochs, and training algorithm.
19
UNDERFITTING / OVERFITTING
Underfitting
good
• Underfitting: Both training and testing

Overfitting
error are high
• Overfitting: The training error is low
but the testing error is high.
20
HOW TO EMPLOY AN ANN
Configure Determine
Select your Test and
your train the Train your
training evaluate
and test structure of network
algorithm your ANN
data your ANN
21
ACTIVATION FUNCTIONS
1 2 3 4 5
Sigmoid Hyperbolic Step Function ReLU Piecewise
Function Tangent Function Linear Func
𝑥𝑥 𝑥𝑥 ≥ 0
𝑓𝑓 𝑥𝑥 = �
0 𝑥𝑥 < 0
−1 𝑥𝑥 < 0
1,2: invertible, differentiable; 1,4: popular; 4,5: partially
6 S𝑖𝑖𝑖𝑖𝑖𝑖 𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹, 𝑓𝑓 𝑥𝑥 = � 0
1
𝑥𝑥 = 0
𝑥𝑥 > 0 differentiable, 1: Logistic, 4: Rectified Linear Unit
22
ACTIVATION FUNCTIONS
23
TRAINING
• Vanishing Gradient Problem
• Not Zero-Centered
• Error/Loss/Cost Function
Complicated
and heavy
Optimization
Training and testing

data, non overlapped,
complete, reflexive
Training algorithms are

Weight and other parameters almost the most challenging
of the net part of neural computing
24
ANN, HOW TO BUILD
Formulating neural network solutions for particular problems is a
multi-stage process:
1. Understand and specify your problem, inputs and outputs
2. Take the simplest form of network that might be able to solve your problem
3. Try to find appropriate connection weights, i.e. training, and other parameters
4. Evaluate train and test errors and measure under/over fitting
5. If the network doesn’t perform well enough, go back to stage 3
6. If the network still doesn’t perform well enough, go back to stage 2 and try
harder.
7. If the network still doesn’t perform well enough, go back to stage 1 and try
harder.
8. Problem solved – move on to next problem.
25
ANN TRAINING
26
ANN TRAINING
27
ERROR BACKPROPAGATION
ALGORITHM
An algorithm for
supervised learning of
artificial neural
networks using gradient
descent.
A general optimization
method.
WWW.TOWARDSDATASCIENCE.COM
28
ALGORITHM
The last 3
Forward
boxes
pass/propagation: set Evaluate error signal show the
the inputs, compute the for each layer back
outputs propagatio
n stages
Update layer
Use the error signal to parameters using the
compute error error gradients with an
gradients optimization algorithm
such as GD.
29
ALGORITHM
Your Comments:
30
ALGORITHM
Your Comments:
31
ALGORITHM
Your Comments:
32
ADAM TRAINING ALGORITHM
• Introduced in 2015
• Adaptive Moment Estimation Algorithm
• Advantages:
Invariant to diagonal
Straightforward to Computationally Little memory
rescale of the
implement. efficient. requirements.
gradients.
Hyper-parameters
Well suited for Appropriate for
have intuitive
problems that are Appropriate for non- problems with very
interpretation and
large in terms of data stationary objectives. noisy/or sparse
typically require little
and/or parameters. gradients.
tuning.
D. KINGMA, J. BA . “ADAM: A METHOD FOR STOCHASTIC OPTIMIZATION“, ICLR, 2015.
33
• SGD maintains a single learning

rate (termed α) for all weight
updates and the learning rate Adaptive
Gradient
Algorithm (Ada
does not change during training. Grad)
• A learning rate is maintained for

each network weight (parameter)
and separately adapted as ADAM
learning unfolds.
• ADAM computes individual
Root Mean
adaptive learning rates for Square
Propagation (R
different parameters from MSP)
estimates of the first and second

moments of the gradients.
34
Adaptive
Gradient
Algorithm
(AdaGrad)
ADAM
Root Mean
Square
Propagation
(RMSP)
35
• Adam is Effective, and
popular in the field of deep
learning because it achieves
good results fast.
• Comparison of Adam to
Other Optimization
Algorithms Training a
Multilayer Perceptron
Taken from Adam: A Method for

Stochastic Optimization, 2015.
WWW.TOWARDSDATASCIENCE.COM
36
PRACTICAL CONSIDERATIONS
37
1. Sometimes, normalization, etc.

2. Randomly
3. Start with larger ƞ, then make it smaller in later
epochs
4. Online vs. batch training, and the batch size

5. Recently, ReLU is more popular, formerly, Sigmoid
was.
38
6. Try momentum term, also try to avoid 0 and 1 as

your output
7. Learning rate and momentum
8. Use the test set every now and then, or monitor the
training error curve
39
LEARNING RATE AND
MOMENTUM
(w)
40
LEARNING WITH MOMENTUM
• We simply add a momentum term,

which is the weight change of the previous step times a
momentum parameter α.
• If α is zero, then we have the standard online
training algorithm used before.
• As we increase α towards 1, each step includes
increasing contributions from previous training
patterns.
41
ANN PARAMETERS
42
ANN PARAMETERS
These 4 factors
together increase/
Number of Number of decrease the ANN’s
training hidden overfitting probability.
epochs neurons
Training
Constraints
algorithm
α Degree of
freedom
43
ANN Parameters
• A rule in system science:
• To keep your system general (no overfitting), the
number of constraints should be at least k=4 times
bigger than the system’s degree of freedom.
• Question: What is the degree of freedom in an
MLP? What is the number of constraints?
44
Overfitting: System Eng Viewpoint
• There are 2 factors which determine the system generality:

Degree of Freedom (𝑭𝑭𝒐𝒐 ), and Number of Constraints
(#C).
• To avoid loss of generality in a system, #C must be k
times more than 𝑭𝑭𝒐𝒐 . k=4,5,..,10.
#𝐶𝐶 ≥ 𝑘𝑘𝐹𝐹 𝑜𝑜
• In a neural network, #C is the number of training samples,
and 𝑭𝑭𝒐𝒐 is the number of amendable parameters (mostly
weights and biases).
45
TRAINING ALGORITHM
Evolutionary Error Back

Learning Propagation
Limited Memory
ADAM
BFGS
The best?
Broyden–
Conjugate
Sadly, no rule
Fletcher–
Goldfarb–
Gradient and
Scaled Conjugate
of thumb!
Shannon
Gradient
algorithm
46
HOW MANY HIDDEN NEURONS?
It depends in a complex way on many factors, including:
The numbers of The amount of

The number of
input and output noise in the
training patterns
units training data
The complexity of The type of

the function or hidden unit The training
classification to activation algorithm
be learned function
47
DIFFERENT LEARNING RATES FOR
DIFFERENT LAYERS?
It is often quicker to just use the same rates η for all the weights and
thresholds, rather than spending time trying to work out appropriate differences. A
very powerful approach is to use evolutionary strategies to determine good
learning rates.
48
PREVENTING UNDER-FITTING AND
OVER-FITTING
To prevent under-fitting we
need to make sure that:
1. The network has enough hidden
units to represent to required
mappings.
2. We train the network for long
enough so that the sum
squared error cost function is
sufficiently minimized.
49
PREVENTING UNDER-FITTING AND
OVER-FITTING
To prevent over-fitting we can:

1. Stop the training early – before it has had time to learn the training
data too well.
2. Restrict the number of adjustable parameters the network has, e.g.
by reducing the number of hidden units, or by forcing connections
to share the same weight values.
3. Add some form of regularization term to the error function to
encourage smoother network mappings.
4. Add noise to the training patterns to smear out the data points.
50
INTRODUCTION TO
Deep Learning
51
DEEP LEARNING
• Deeper is better than fatter !!!

• Deep learning is a class of machine
learning algorithms that uses
multiple layers to progressively
extract higher-level features from
the raw input.
• Most modern deep learning
models are based on artificial
neural networks, specifically,
Convolutional Neural Networks
(CNN)s.
WWW.WIKIPEDIA.COM
52
A POINT TO THINK ABOUT
53
IN THE REAL WORLD: DRIVERLESS
CARS
WWW.YOUTUBE.COM , MOBILE GEEKS

54
WHAT IS IT ALL ABOUT?
• We need an AI to classify objects/ entities/
samples for us.
• We know the class of each object.
• We extract the features of each object
• Feature: a number, quantity, or tag, which
represents (an aspect of) the object.
• Now, we need to train our AI-Classifier to
show it how to classify.
• Now on, Lets focus on convolutional neural
networks (CNN)
55
SO, WHAT IS DEEP LEARNING?
• We get closer to the biological

brain’s structure.
• We get closer to the mental
system’s process and
functionality.
• Therefore, we get closer to its
performance too.
• It is important, in particular, when
• The data is semi or
unstructured.
• The problem in hand is hard to
solve.
56
SO, WHAT IS DEEP LEARNING?
HTTPS://THENEWSTACK.IO/DEMYSTIFYING-DEEP-LEARNING-AND-ARTIFICIAL-INTELLIGENCE/
57
COMMENTS
• The train dataset, configures and sets the parameters of the hidden
feature extractor layers.
• A “from detail to blocks to object” scenario would be chased up in
different layers of the deep convolutional network, from input towards
output.
58
FILTERING AND FILTER RESPONSES
• We have developed a test in Octave to make the convolution, image
features and filter responses clearer.
• You can ask it to filter a given image with two horizontal and vertical
edge-detection filters, and show you the results and the filter
responses, i.e. the power of filtered images.
• It uses convolution to do the filtering.
∞ ∞
𝑏𝑏 𝑚𝑚, 𝑛𝑛 = ℎ 𝑚𝑚, 𝑛𝑛 ∗ 𝑥𝑥 𝑚𝑚, 𝑛𝑛 = � � ℎ[𝑖𝑖, 𝑗𝑗]. 𝑥𝑥[𝑚𝑚 − 𝑖𝑖, 𝑛𝑛 − 𝑗𝑗]

𝑖𝑖=−∞ 𝑗𝑗=−∞
𝑀𝑀 𝑁𝑁
1
𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 𝑎𝑎 = � � 𝑎𝑎2 (𝑖𝑖, 𝑗𝑗)
𝑀𝑀𝑀𝑀
𝑖𝑖=1 𝑗𝑗=1
IMAGE4.IPYNB , CNN_FILTERS1.IPYNB , CNN_FILTERS2.IPYNB

59
TRAINING
Good for deep

learning
60
OVERFITTING
Training error
is very low,
while testing
error is rather
high
Overfitting
or
Overtraining
happened
The ANN has
lost its
generalization
ability
61
OVERFITTING
Top: Function estimation without

overfitting, reds are training
samples and blues are testing
results, sinusoidal underneath
function is revealed clearly. 4
hidden neurons, 200 training
epochs.
Bottom: Overfitting, passing
through all training samples but
no sign of the sinusoidal
meaning. 100 hidden neurons,
10000 training epochs.
62
OVERFITTING
How to avoid overfitting? When the model capacity increases, the model
gradually changes from underfitting to overfitting.
63
REGULARIZATION
• In deep learning, we wish to minimize the following loss/cost/
error function:
1
ℒ(𝑤𝑤1 , 𝑏𝑏1 , … , 𝑤𝑤𝑛𝑛 , 𝑏𝑏𝑛𝑛 ) = ∑𝑚𝑚
𝑖𝑖=1 𝐸𝐸( 𝑦𝑦
� 𝑖𝑖 𝑖𝑖
, 𝑦𝑦 )
𝑚𝑚
• L can be any loss, E is any difference function.
• For L2 regularization, we add a component that will penalize
large weights:
1 𝜆𝜆
ℒ(𝑤𝑤1 , 𝑏𝑏1 , … , 𝑤𝑤𝑛𝑛 , 𝑏𝑏𝑛𝑛 ) = ∑𝑚𝑚
𝑖𝑖=1 𝐸𝐸 𝑦𝑦
� 𝑖𝑖 𝑖𝑖
, 𝑦𝑦 + ∑𝑛𝑛𝑖𝑖=1 𝑤𝑤𝑖𝑖 2
𝐹𝐹
𝑚𝑚 2𝑚𝑚
• λ is the regularization coefficient/parameter
• Usually, L1 is absolute values norm, while L2 is square norm.
64
REGULARIZATION
• The higher the λ, the higher the penalty rate for larger
weights.
• Large weights will be driven down in order to minimize the
cost function
• Output of each neuron before applying the activation
function: 𝑧𝑧 = 𝑊𝑊 𝑇𝑇 𝑥𝑥 + 𝑏𝑏
• By reducing the values in the weight matrix, z will also be
reduced, which in turns decreases the effect of the activation
function. Therefore, a less complex function will be fit to the
data, effectively reducing overfitting.
HTTPS://TOWARDSDATASCIENCE.COM
65
DROPOUT
Consider a rate 𝜃𝜃, 0 < 𝜃𝜃 < 100

Randomly select and eliminate 𝜃𝜃% of your ANN’s nodes after
each training epoch and evaluation.
Does it really work? Yes, it does.
HTTPS://TOWARDSDATASCIENCE.COM
66
Dropout
• Using Dropout, the neural network cannot

rely on any input node.
• So, the neural network will be reluctant to
give high weights to certain features,
because they might disappear.
• Consequently, the weights are spread across
all features, making them smaller.
• This effectively shrinks the model and
regularizes it.
• We usually apply dropout on hidden layers.
https://towardsdatascience.com
67
Learning
𝜕𝜕𝜕
• Δ𝑤𝑤 = Γ + Α + Β = −𝜂𝜂 + α∆𝑤𝑤𝑛𝑛−1 + 𝛽𝛽𝛽𝛽
𝜕𝜕𝜕𝜕
• Γ = 𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡
• A = momentum term
• B = random term, r= random Gaussian
• 𝜂𝜂, α, 𝛽𝛽 = 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐
68
Batch Normalization
• Batch normalization is a technique for training deep neural
networks that normalizes the contributions to a layer for
every mini-batch. This has the impact of settling the
learning process and drastically decreasing the number of
training epochs required to train deep neural networks.
• Any modification of weights, changes many things inside
your network.
• Batch normalization provides an elegant way of
reparametrizing almost any deep network. The
reparametrization significantly reduces the problem of
coordinating updates across many layers.
69
Batch Normalization
• For any mini-batch of samples during the training, for any
input or hidden layer 𝐻𝐻𝑚𝑚 , normalize the output over the
mini-batch using
′
𝐴𝐴 𝐻𝐻𝑚𝑚 − 𝜇𝜇[𝐴𝐴 𝐻𝐻𝑚𝑚 ]
𝐴𝐴 𝐻𝐻𝑚𝑚 =
𝜎𝜎[𝐴𝐴 𝐻𝐻𝑚𝑚 ]
Before sending that to the next layer 𝐻𝐻𝑚𝑚+1 .
• A is the activation function.
• So, considering a complete mini-batch, we normalize the
output of a layer, before forward passing that to the next
layer.
70
YOUR COMPUTER, NO GPU
71
Fine Tuning and Transfer Learning
• Fine-tuning, in general, means making small adjustments to

a process to achieve the desired output or performance. Fine-
tuning deep learning involves using weights of a previous
deep learning algorithm for training another similar deep
learning process.
• Transfer learning, A neural network model is first trained on
a problem similar to the problem that is being solved. One or
more layers from the trained model are then used in a new
model trained on the problem of interest.
72
FINE TUNING AND TRANSFER
LEARNING
WWW.MATHWORKS.COM
73
Fine Tuning and Transfer Learning
• In transfer learning for deep neural networks, it's common to apply transfer learning
on earlier layers rather than the last layers of the network. Transfer learning
involves taking a pre-trained model (often trained on a large dataset for a related
task) and fine-tuning it for a specific task or dataset of interest.
• The reason for fine-tuning earlier layers is based on the idea that the early layers of
a deep neural network capture general features and patterns that are often
transferable across different tasks and domains. These layers learn low-level
features like edges, textures, and basic shapes, which tend to be similar in many
different types of data.
• In deep learning fine-tuning, the extent to which the parameters of early layers
versus the last layers are changed can vary depending on several factors, including
the specific problem, the architecture of the neural network, and the amount of
available training data. There is no fixed rule, and it depends on how you configure
the fine-tuning process. However, in many transfer learning scenarios, the early
layers of a pre-trained model tend to undergo less change compared to the later
layers.
74
Deep Neural Network Models
Convolutional Generative
Neural Adversarial
Networks Networks
Recurrent
Long-Short
Term
Memory
75
LONG SHORT-TERM MEMORY,
APPLICATIONS
Protein Time series
Robot control homology anomaly
detection detection
Business
Time series Sign language
process
prediction translation
management
Speech Action Prediction in

Drug design
recognition recognition medical care
Rhythm Semantic Short-term

OCR
learning parsing traffic forecast
Music Grammar Object Airport

composition learning Segmentation management
76
LONG SHORT-TERM MEMORY,
CASE STUDIES
77
Applications: DL for CV
Image
Object Detection Tracking
Classification
Deep
Convolutional Regional CNN Discriminative
Neural Networks (RCNN) tracker (DSST)
(CNN)
Siamese trackers
and segmentation
78
DL for Image Classification: Cat
or Dog?
Python program, Open Kaggle
Dataset has been
using Tensorflow cats & dogs
cleaned before
and Keras image data set
Configure the
Build the deep
training and Show the images
model
validation sets
Train the deep

Validate that
model
Deep5.ipynb 79
CAT OR DOG EXAMPLE
DEEP5.IPYNB
80
CAT OR DOG EXAMPLE
DEEP5.IPYNB
81
CAT OR DOG EXAMPLE
Results after 5 training epochs:
DEEP5.IPYNB
82
AlexNet Structure
Schematic of the ZFNet architecture. This schematic is very similar to that for AlexNet.
Notice that AlexNet contains 7 hidden layers whereas ZFNet contains 8 hidden layers (these
figures count S layers as parts of the corresponding C layers). Also, note that ZFNet is
implemented using only a single GPU and its architecture is unsplit. 83
VGGNet Architecture
84
VGGNet Architecture
• Note that the convolution layers all have unit stride, and that
their input fields are limited to a maximum size of 3 × 3: the
subsampling layers all have 2 × 2 input fields and 2×2 strides.
85
SqueezeNet
SqueezeNet is another popular deep-learning model designed
specifically for image classification tasks with a focus on reducing
the model’s size and computational requirements. It was developed
by researchers from UC Berkeley and DeepScale in 2016.
Compact Model Size

Key Features of SqueezeNet Efficient Parameter Reduction
SqueezeNet’s architecture is designed to deliver a lightweight yet

powerful model by focusing on efficient parameter use. This makes it
an excellent choice for scenarios where memory, power, or processing
capacity is a limiting factor.
86
SqueezeNet
Initial Convolution Layer

• 7x7 Convolution
Fire Modules
• Squeeze Layer (1x1 Convolution)
• Expand Layer (1x1 and 3x3 Convolutions)
Max Pooling Layers
Final Convolution Layer

• 1x1 Convolution
Global Average Pooling Layer
Softmax Output Layer
87
SqueezeNet
88
INCEPTION V.3
The Inception v3 model is a popular

deep learning model designed for Inception v3 is widely used in real-
image classification tasks. It was world image classification problems
developed by researchers at Google and transfer learning. Due to its
and is part of the Inception family of pretrained weights on large datasets
models, which focus on improving like ImageNet, it can serve as a
the efficiency and accuracy of powerful feature extractor for other
convolutional neural networks custom image datasets.
(CNNs).
89
Architecture
Inception v.3 • Improved version of the earlier

Inception models
• Multiple "Inception modules“:
parallel convolutional layers with
different filter sizes (e.g., 1x1,
3x3, 5x5).
• Factorized convolutions
Improvement Features
• Factorized Convolutions
• Auxiliary Classifiers
• Label Smoothing
90
Initial Convolution and Pooling Layers
Inception v.3 • Conv2D Layers
Factorized Convolutions
• 1x1 and 3x3 Convolutions
Inception Modules
Auxiliary Classifier
Reduction Modules
Global Average Pooling Layer
Dense Layer and Softmax Output
91
Inception v.3
92
Inception v.3
93
IMPORTANT POINTS
Error
Regularization and
Backpropagation, Momentum
Dropout
how it works
Convolutional Deep
Training Underfitting and Neural Networks
Algorithms Overfitting for image
classification
Batch Perceptron and Perceptron and

Normalization MLP features MLP training
94
1. Main:
• E. R. Davies and M.
Turk (ed), Advanced-
Methods-and-Deep-
Learning-in-Computer-
Vision, E1st ed., 2021,
References Elsevier (Ch 1, 2, and 9)
95
2. Auxiliary:
• Deep learning, Yann LeCun, Yoshua
Bengio & Geoffrey Hinton, Nature,
volume 521, pages 436–444 (2015)
• Learn Keras for Deep Neural
Networks, Jojo Moolayil, 2019
• https://keras.io/examples/vision/
• Samira Pouyanfar et al., A Survey
on Deep Learning: Algorithms,
References Techniques, and Applications. ACM
Comput. Surv. 51, 5, September
2018,
https://doi.org/10.1145/3234150
• Neural Networks and Learning
Machines, 3rd Ed., by Simon S.
Haykin.
• Deep Learning, by Ian Goodfellow
et al., 2016.
96
THAT’S IT …
Thank You!
Any Question?
97

mv_cs4243_2024_amir_6_p1 (1)

Uploaded by

Copyright:

Available Formats

mv_cs4243_2024_amir_6_p1 (1)

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

mv_cs4243_2024_amir_6_p1 (1)

Uploaded by

Copyright:

Available Formats

Computer Vision and

100 billion CPUs, 500 trillion connections, OMG!!!

Use Jupyter Notebook/ Anaconda/

You need Tensorflow Package too

This is a function estimation example

𝑜𝑜 = 𝑖𝑖1 𝑤𝑤1 + 𝑖𝑖2 𝑤𝑤2

• w1 and w2 are selected randomly, i1 i2 O

e.g. 0.3 and -0.7 1 2 3

• Then we will try to find the best 1 3 4

w1 and w2 to have the output 1 1 2

equal (or close enough) to the Os 3 3 6

𝑜𝑜 = 𝑖𝑖1 𝑤𝑤1 + 𝑖𝑖2 𝑤𝑤2

E,S I1,i2 w1 w2 Or error Dw1 Dw2

1,4 3,3 0.6 -0.4 0.6 5.4 + + 3 3 6

𝑂𝑂𝑟𝑟 = � 𝑖𝑖𝑗𝑗 𝑤𝑤𝑗𝑗

𝑜𝑜 = 𝑖𝑖1 𝑤𝑤1 + 𝑖𝑖2 𝑤𝑤2

• Learning by examples and 3 2 1

iterations once again 3 3 0

1,3 3,2 -0.2 0.8 1 0 0 0 3 2 1

Inverse matrix scheme

What are w1,w2, and b?

What are w1,w2, and b?

To deal with such problems, a

i1 i2 O=i1 xor i2 A multi layered

Input hidden output

• In a classification problem, a MLP, after a proper training, can classify the

• Underfitting: Both training and testing

Training and testing

Training algorithms are

D. KINGMA, J. BA . “ADAM: A METHOD FOR STOCHASTIC OPTIMIZATION“, ICLR, 2015.

• SGD maintains a single learning

• A learning rate is maintained for

estimates of the first and second

Taken from Adam: A Method for

1. Sometimes, normalization, etc.

4. Online vs. batch training, and the batch size

6. Try momentum term, also try to avoid 0 and 1 as

• We simply add a momentum term,

• There are 2 factors which determine the system generality:

Evolutionary Error Back

The numbers of The amount of

The complexity of The type of

To prevent over-fitting we can:

• Deeper is better than fatter !!!

WWW.YOUTUBE.COM , MOBILE GEEKS

• We get closer to the biological

𝑏𝑏 𝑚𝑚, 𝑛𝑛 = ℎ 𝑚𝑚, 𝑛𝑛 ∗ 𝑥𝑥 𝑚𝑚, 𝑛𝑛 = � � ℎ[𝑖𝑖, 𝑗𝑗]. 𝑥𝑥[𝑚𝑚 − 𝑖𝑖, 𝑛𝑛 − 𝑗𝑗]

IMAGE4.IPYNB , CNN_FILTERS1.IPYNB , CNN_FILTERS2.IPYNB

Good for deep

Top: Function estimation without

Consider a rate 𝜃𝜃, 0 < 𝜃𝜃 < 100

Does it really work? Yes, it does.

• Using Dropout, the neural network cannot

• Fine-tuning, in general, means making small adjustments to

Speech Action Prediction in

Rhythm Semantic Short-term

Music Grammar Object Airport

Train the deep

Results after 5 training epochs:

Compact Model Size