Stochastic Gradient
Descent Optimization
Stochastic Gradient Descend (SGD)
Stochastically sample “mini-batches” from dataset D
The size of Bj can contain even just 1 sample
● Much faster than Gradient Descend
● Results are often better
● Also suitable for datasets that change over time
● Variance of gradients increases when batch size decreases
SGD is often better
Gradient Descent (GD) vs
Stochastic Gradient Descent (SGD)
Gradient Descend ==> Complete gradients
● Complete gradients fit optimally the (arbitrary) data we
have, not the distribution that generates them
● All training samples are the “absolute representative” of
the input distribution
● Suitable for traditional optimization problems: “find optimal
route”
● Assumption: Test data will be no different than training
data
For ML we cannot make this assumption ==> test data are
always different
Gradient Descent (GD) vs
Stochastic Gradient Descent (SGD)
Stochastic Gradient Descent
● SGD is preferred to Gradient Descend
● (A bit) Noisy gradients act as regularization
● Stochastic gradients ==> sampled training data sample
roughly representative gradients
● Model does not overfit to the particular training samples
● Training is orders faster. In real datasets Gradient
Descend is not even realistic
● How many samples per mini-batch? Hyper-parameter, trial
& error. Usually between 32-256 samples
Gradient Descent (GD) vs
Stochastic Gradient Descent (SGD)
SGD for dynamically changed datasets
Let’s assume 1 million of new images uploaded per
week on Instagram and we want to build a “cool
picture” classifier
● Should “cool pictures” from the previous year have the
same as much influence?
● No, the learning machine should track these changes
With GD these changes go undetected, as results are
averaged by the many more “past” samples Past
“over-dominates”
A properly implemented SGD can track changes much
better and give better models
Data Preprocessing
Shuffling
● Applicable only with SGD
● Choose samples with maximum information content
● Mini-batches should contain examples from different classes
● Prefer samples likely to generate larger errors
– Otherwise gradients will be small slower learning
– Check the errors from previous rounds and prefer “hard examples”
– Don’t overdo it though , beware of outliers
● In practice, split your dataset into mini-batches
● Each mini-batch is as class-divergent and rich as possible
Data Preprocessing
Data Preprocessing
Early Stopping
Drop Out
Dropout
Dropout
Dropout
Dropout
Dropout
Dropout
Activaton Functions
Segmoid-Like Activation Fn
Rectified Linear Unit (ReLU)
● Very popular in computer vision
and speech recognition
● Much faster computations,
gradients
● No vanishing or exploding
problems, only comparison,
addition, multiplication
● People claim biological
plausibility
● No saturation
Rectified Linear Unit (ReLU)
ReLU convergence rate (w.r.t Tanh)
Rectified Linear Unit (ReLU)
Other ReLUs
Learning Rate
Learning rate
Learning rate
Learning rate schedules
Learning rate
Learning rate in practice
Better Optimization
Momentum
Adagrad (Adaptive Gradient)
RMSprop
(Root Mean Square)
Adam
(Adaptive Momentum)
Convolutional Neural Network (CNN)
With Tensor flow
Summation of Matrix + vector
[X] [W] b
1*4
3*4
1*4
7*3 Repeated 7 times
7*4
[X] * [W] [X] * [W] + b
7*4 7*4
tf.zeros
tf.zeros(shape, dtype=tf.float32, name=None)
Args:
● shape: Either a list of integers, or a 1-D Tensor of type int32.
● dtype: The type of an element in the resulting Tensor.
● name: A name for the operation (optional).
● tf.zeros([3, 4], tf.int32) ==> [ [0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0] ]
● tf.zeros(4) ==> [0, 0, 0, 0]
tf.nn.conv2d
tf.nn.conv2d (input, filter, strides, padding,
use_cudnn_on_gpu=None, data_format=None, name=None)
Computes a 2-D convolution given 4-D input and filter tensors.
Given an input tensor of shape [batch, in_height, in_width,
in_channels] and a filter / kernel tensor of shape [filter_height,
filter_width, in_channels, out_channels], this op performs the
following:
● Flattens the filter to a 2-D matrix with shape [filter_height *
filter_width * in_channels, output_channels].
● Extracts image patches from the input tensor to form a
virtual tensor of shape [batch, out_height, out_width,
filter_height * filter_width * in_channels].
tf.nn.conv2d
tf.nn.conv2d (input, filter, strides, padding, use_cudnn_on_gpu=None,
data_format=None, name=None)
Args:
● input: A Tensor. Must be one of the following types: half, float32, float64.
● filter: A Tensor. Must have the same type as input.
● strides: A list of ints. 1-D of length 4. The stride of the sliding window for each
dimension of input. Must be in the same order as the dimension specified with
format.
● padding: A string from: "SAME", "VALID". The type of padding algorithm to use.
● use_cudnn_on_gpu: An optional bool. Defaults to True.
● data_format: An optional string from: "NHWC", "NCHW". Defaults to "NHWC".
Specify the data format of the input and output data. With the default format
"NHWC", the data is stored in the order of: [batch, in_height, in_width,
in_channels]. Alternatively, the format could be "NCHW", the data storage order
of: [batch, in_channels, in_height, in_width].
tf.nn.max_pool
tf.nn.max_pool(value, ksize, strides, padding,
data_format='NHWC', name=None)
Args:
● value: A 4-D Tensor with shape [batch, height, width, channels]
and type tf.float32.
● ksize: A list of ints that has length >= 4. The size of the window for
each dimension of the input tensor.
● strides: A list of ints that has length >= 4. The stride of the sliding
window for each dimension of the input tensor.
● padding: A string, either 'VALID' or 'SAME'. The padding algorithm.
● data_format: A string. 'NHWC' and 'NCHW' are supported.
● name: Optional name for the operation.
●
tf.reshape
tf.reshape(tensor, shape, name=None)
Reshapes a tensor.
Given tensor, this operation returns a tensor that has the
same values as tensor with shape shape.
If one component of shape is the special value -1, the size
of that dimension is computed so that the total size remains
constant. In particular, a shape of [-1] flattens into 1-D.
At most one component of shape can be -1.
If x size is 117600, then
tf.reshape(x, [-1, 28, 28, 3]) will generate 50
images each of size 28*28*3 channels
tf.nn.dropout
tf.nn.dropout(x, keep_prob, noise_shape=None, seed=None, name=None)
Args:
● x: A tensor.
● keep_prob: A scalar Tensor with the same type as x. The probability that
each element is kept.
● noise_shape: A 1-D Tensor of type int32, representing the shape for
randomly generated keep/drop flags.
● seed: A Python integer. Used to create random seeds.
● name: A name for this operation (optional).
With probability keep_prob, outputs the input element scaled up by 1 /
keep_prob, otherwise outputs 0. The scaling is so that the expected sum is
unchanged.
By default, each element is kept or dropped independently.
Input Image
28*28*3 Input Image 28*28 pixels, 3 Channels
32 Filters
Convolute with 32 Filters,
[5*5] *3 [5*5] *3 [5*5] *3
Each is 5*5*3
32 Feature map
O/P is 32 images
28*28 28*28 Feature Map 28*28 Each is 28*28 (use padding)
ReLU (Activation Fn) ReLU Activation Fn
2*2*1 pooling
Then Pooling (2*2)
32 Feature map
O/P (after pooling)
14*14 14*14 14*14 14*14 32 images each 14*14 pixels
64 Filters
Convolute 32 images with
5*5 5*5
*32
5*5
*32
5*5
*32
64 Filters, Each is 5*5*32
*32
64 Feature map
O/P is 64 images
14*14 14*14 14*14 14*14 Each is 14*14 (use padding)
Feature Map
ReLU (Activation Fn) ReLU Activation Fn
2*2*1 pooling Then Pooling (2*2)
64 Feature map
O/P (after pooling)
7*7 7*7 7*7 7*7 64 images each 7*7 pixels
64 Feature map
O/P (after pooling)
7*7 7*7 7*7 7*7
64 images each 7*7 pixels
7x7x64 values
………………………………………….. Flaten to one Feature Vector
Size 7*7*64
Fully connected (FC)
Hidden layer 1024 Hidden Layer (1024 Neurons)
10 Fully connected (FC)
O/P layer Output Layer (10 Neurons)
Softmax Activation function,
Softmax
One-Hot output
Cross Entropy Error Loss Function
Training data Labels
(One-Hot Labels)
Training Labels
Parameters Calculations
Input values
28*28*3 Input Image 28*28*3
32 Filters
Filter parameters
[5*5] *3 [5*5] *3 [5*5] *3
(5*5*3) * 32 + 32 {bias}
32 Feature map
Feature maps size
28*28 28*28 Feature Map 28*28 (28*28) * 32
ReLU (Activation Fn) ReLU Activation Fn
2*2*1 pooling
Then Pooling (2*2)
32 Feature map
Feature maps size
14*14 14*14 14*14 14*14 (14*14) *32
64 Filters
Filter parameters
5*5 5*5
*32
5*5
*32
5*5
*32
(5*5*32) * 64 + 64 {bias}
*32
64 Feature map
Feature maps size
14*14 14*14 14*14 14*14 (14*14) * 64
Feature Map
ReLU (Activation Fn) ReLU Activation Fn
2*2*1 pooling Then Pooling (2*2)
64 Feature map
Feature maps size
7*7 7*7 7*7 7*7 (7*7) * 64
64 img
Feature maps size
7*7 7*7 7*7 7*7 (7*7) * 64
7x7x64 values
Feature Vector Size 7*7*64=3136
…………………………………………..
Hidden Layer Weights Parameters
[7*7*64] *1024 +1024 {bias}
Hidden Layer (1024 Neurons)
Hidden layer 1024
Hidden Layer Weights Parameters
1024 * 10 +10 {bias}
10
O/P layer O/P Layer (10 Neurons)
Softmax Activation function,
Softmax One-Hot output
Cross Entropy Error Loss Function
Total Number of Parameters
Conv_1 =(5*5*3) * 32 + 32
=2432
Conv_2=(5*5*32) * 64 + 64
=30784
Training Labels FC_Hidden=[7*7*64] *1024 +1024
=3212288
FC_Output= 1024 * 10 +10
=10250
Total =3,255,754
Implementation Guide
For Convolutional part, Define:
● Convolutional Network Architecture
● Number of Conv. Layers
● Number of filters in each layer
● Fliter Parameters
● Bias Parameters
def weight_variable(shape):
28*28*3 Input Image initial = tf.truncated_normal(shape, stddev=0.1)
return tf.Variable(initial)
32 Filters
def bias_variable(shape):
[5*5] *3 [5*5] *3 [5*5] *3 initial = tf.constant(0.1,shape=shape)
32 Feature map return tf.Variable(initial)
28*28 28*28 Feature Map 28*28 W_conv1 = weight_variable([5, 5, 3, 32])
b_conv1 = bias_variable([32])
ReLU (Activation Fn)
2*2*1 pooling
32 Feature map W_conv2 = weight_variable([5, 5, 32, 64])
b_conv2 = bias_variable([64])
14*14 14*14 14*14 14*14
64 Filters
5*5 5*5 5*5 5*5
*32 *32 *32 *32
64 Feature map Define Parameters of :
14*14 14*14 14*14 14*14 Convolution Layer1 (filters+bias)
Feature Map Convolution Layer2 (filters+bias)
ReLU (Activation Fn)
2*2*1 pooling
64 Feature map
7*7 7*7 7*7 7*7
For Convolutional Part, Apply:
● 2-D Convolution (on each Layer)
● ReLU Activation function
● Max-Pool
def max_pool_2x2(x):
28*28*3 Input Image return tf.nn.max_pool(x, ksize=[1, 2, 2, 1],
strides=[1, 2, 2, 1], padding='SAME')
32 Filters
def conv2d(x, W):
[5*5] *3 [5*5] *3 [5*5] *3
return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1],
32 Feature map
padding='SAME')
28*28 28*28 Feature Map 28*28 h_conv1 = tf.nn.relu(
conv2d(x_image, W_conv1) + b_conv1)
h_pool1 = max_pool_2x2(h_conv1)
ReLU (Activation Fn)
2*2*1 pooling
32 Feature map h_conv2 = tf.nn.relu(
14*14 14*14 14*14 14*14 conv2d(h_pool1, W_conv2) + b_conv2)
h_pool2 = max_pool_2x2(h_conv2)
64 Filters
5*5 5*5 5*5 5*5
*32 *32 *32 *32
64 Feature map
14*14 14*14 14*14 14*14
Feature Map Apply
ReLU (Activation Fn)
2*2*1 pooling Convolution,
64 Feature map Activation Function (ReLU)
and max_pooling 2x2
7*7 7*7 7*7 7*7
Define Fully Connected Layers:
● Number of Neurons in Fully Connected Layer
● Initialize Weights and Biases for Hidden and
output layers
● Drop out Layer (if required)
64 feature map def weight_variable(shape):
7*7 7*7 7*7
initial
7*7
= tf.truncated_normal(shape, stddev=0.1)
return tf.Variable(initial)
7x7x64 values
…………………………………………..definitial
bias_variable(shape):
= tf.constant(0.1,shape=shape)
return tf.Variable(initial)
Hidden layer 1024
W_FC1 = weight_variable([7 * 7 * 64, 1024])
10
b_FC1 = bias_variable([1024])
O/P layer
W_FC2 = weight_variable([ 1024 ,10])
Softmax b_FC1 = bias_variable([10])
Cross Entropy Error
Total Number of Parameters
Conv_1 =(5*5*3) * 32 + 32
=2432
Training Labels Conv_2=(5*5*32) * 64 + 64
=30784
FC_Hidden=[7*7*64] *1024 +1024
=3212288
FC_Output= 1024 * 10 +10
=10250
No Drop Out Layer Total =3,255,754
64 feature map def weight_variable(shape):
7*7 7*7 7*7
initial
7*7
= tf.truncated_normal(shape, stddev=0.1)
return tf.Variable(initial)
7x7x64 values
…………………………………………..definitial
bias_variable(shape):
= tf.constant(0.1,shape=shape)
return tf.Variable(initial)
Hidden layer 1024 W_FC1 = weight_variable([7 * 7 * 64, 1024])
b_FC1 = bias_variable([1024])
Drop out Layer
keep_prob = tf.placeholder(tf.float32)
10 h_fc1_drop = tf.nn.dropout(h_fc1, keep_prob)
O/P layer
W_FC2 = weight_variable([ 1024 ,10])
Softmax b_FC1 = bias_variable([10])
Cross Entropy Error
Training Labels
Train the Network
● Feed Forward, Calculate Estimated output
● Apply “Softmax” on Output
● Loss Calculation (Cross_Entropy)
● Optimization (Minimize Loss and Update
Weights)
64 feature map
y_conv =
7*7 7*7 7*7 7*7 tf.matmul(h_fc1_drop, W_fc2) + b_fc2
7x7x64 values
cross_entropy =
…………………………………………..
tf.reduce_mean(
tf.nn.softmax_cross_entropy_with_logits(y_conv, y_))
Hidden layer 1024
train_step =
tf.train.AdamOptimizer(1e-4).minimize(cross_entropy)
Drop out Layer
10
O/P layer
y_conv
Softmax
Cross Entropy Error
y_
Training Labels
64 feature map
7*7 7*7 7*7 correct_prediction
7*7 =
tf.equal(tf.argmax(y_conv, 1), tf.argmax(y_, 1))
7x7x64 values
…………………………………………..
Hidden layer 1024
Drop out Layer
Accuracy =
10 tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
O/P layer
Softmax
y_conv
Cross Entropy Error
y_
Training Labels
Model Evaluation
(Training Data / Test Data)
Start
Accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
correct_prediction = tf.equal(tf.argmax(y_conv, 1), tf.argmax(y_, 1))
y_conv = tf.matmul(h_fc1_drop, W_fc2) + b_fc2
keep_prob = tf.placeholder(tf.float32)
h_fc1_drop = tf.nn.dropout(h_fc1, keep_prob)
Should be =1 .0 for Test data,
<1 .0 for training data
Model Training
(Use Training Data )
Start
train_step = tf.train.AdamOptimizer(1e-4).minimize(cross_entropy)
cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(y_conv, y_))
y_conv = tf.matmul(h_fc1_drop, W_fc2) + b_fc2
keep_prob = tf.placeholder(tf.float32)
h_fc1_drop = tf.nn.dropout(h_fc1, keep_prob)
Useful Operations
in Tensorflow
Operation Description
tf.add sum
tf.sub substraction
tf.mul multiplication
tf.div division
tf.mod
tf.abs
module
return the absolute value
Some
tf.neg return negative value Useful
tf.sign
tf.inv
return the sign
returns the inverse
Arithmetic
tf.square calculates the square Operations
tf.round returns the nearest integer
tf.sqrt calculates the square root
tf.pow calculates the power
tf.exp calculates the exponential
tf.log calculates the logarithm
tf.maximum returns the maximum
tf.minimum returns the minimum
tf.cos calculates the cosine
tf.sin calculates the sine
Some useful Matrix Operations
Operation Description
tf.diag returns a diagonal tensor with a given diagonal values
tf.transpose returns the transposes of the argument
tf.matmul returns a tensor product of multiplying two tensors listed as arguments
tf.matrix_determinant returns the determinant of the square matrix specified as an argument
tf.matrix_inverse returns the inverse of the square matrix specified as an argument