[Fall 2024] Deep Learning 1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 55

Deep Learning 1

By: ML@B Edu Team

● Quiz 1 released and due next Monday September 16
○ It spans content from lectures 1 (Intro to ML) and 2 (Deep Learning 1)
● Assignment 1 released and due next Monday, September 23
Outline 1.
Math Review
Universal Model Class
3. Neural Networks
4. Gradient Descent
5. More Neural Networks
6. Deep Double Descent
Math Review
Vector Dot Products

● The dot product of 2 vectors is the element-wise sum of their element-wise

● We can also write this as the matrix product of w_transpose and x
○ Transposing a vector changes it from a column vector to a row vector, and vice versa
● For neural networks, we will frequently think about weighting a vector’s individual
components and adding them, so keep this in mind
Matrix Vector Product
● Matrix vector products can be thought of
as repeated vector dot products between
the rows of matrix W and the vector x
● This is just an extension of the last slide,
except now we are stacking dot products
on top of each other to form a new vector
Vector Function Notation
● With neural networks we are going to deal with vector/matrix functions a lot
● Here is some notation that we will consistently use throughout this lecture:
○ x — This usually denotes an input vector
○ y — This usually denotes the label corresponding to x
○ W, b, or theta — parameters that we control, usually W for a matrix, b for vector, and theta for a set
of parameter(s) in general (also, when we are feeling lazy and don’t want to write them all out)
● In this dummy example below, the subscript i on a vector denotes the ith value of
the vector in question
○ theta_i and theta_j are DIFFERENT SCALAR PARAMETERS residing inside the parameter vector
Derivatives of Vector Functions
● Vectors are just comprised of scalars, so there is no reason we can’t take the
derivatives of our function output (a scalar) with respect to some component of
the input vector (a scalar)
○ This is super important!
○ Don’t worry about having to take derivatives with respect to matrices or vectors in this course
(though, if you are curious, take EECS 127) — we will use software that will handle this for us
● In the function below, you can take the derivative of a scalar function with
respect to one of the elements in a parameter vector or matrix taken as input
Scalar Partial Derivatives of Vector Functions
● This is just the vector of the partial
derivatives w.r.t. each function input
● Math fact: In vector calculus, the direction of
the steepest increase in a function’s value
from a point is along the gradient vector
○ If you aren’t convinced check out Khan Academy
starting with directional derivatives… it will help
● Steepest descent is along the negative
gradient vector
○ → If we want to go down a hill, follow the negative
gradient evaluated at our current position
Universal Function Approximator
Motivation for Neural Networks
● Lots of problems that we can (or want to) solve — regression, classification, etc —
frequently center around creating functions that are super non-linear
● Hard to figure out which class of models works the best for each task / dataset
○ Is there some model class that can do any one of these tasks almost straight out of the box?
Motivation for Neural Networks
● Our brain is able to do lots of tasks with the same
○ What if we tried to create a model of the brain?
● Our brain has neurons
○ Neurons take in signal from surrounding neurons
○ Neurons output a signal based on the amount of signal
taken in
○ Output(inputs) = ReLU(weighted sum of inputs)
○ ReLU(x) = max(0, x) = x if x > 0 else 0
● Fair warning: deep learning isn’t the same as
cognitive science
Neural Networks (graphical view)
and Bias
● Rough model of a neuron
● Weight each component of the input by x1
some amount, then “activate” on the sum Weighted
x2 Sum
○ Note: b is just a scalar being added to the w2
weighted sum and is independent of the input —
we call this term a “bias” value
⋮ ⋮ ∑
wn-1 ReLU = max(0, x)
● The function we use for “activating” is a xn-1
ReLU xn
○ The motivation for choosing it will be clear later

(Don’t worry about solving for the weights yet)

x1 * w1 + x2 * w2 + x3 * w3 + b x1 * w1 + x2 * w2 + x3 * w3 + b
Perceptron Example with = (3)(3) + (2)(-1) + (1)(-1) + (-2) = (3)(-2) + (2)(1) + (1)(1) + (-2)
ReLU instead of the Step =4 = -5
Function and ReLU(4) = max(0, 4) = 4 and ReLU(-5) = max(0, -5) = 0
and Bias
● Let’s simplify the notation a bit
● The sum in the previous slide can be x1
rewritten as a dot product between a Weighted
x2 Sum
weight vector and an input vector plus a Output

bias scalar Inputs

⋮ ⋮ ∑
wn-1 ReLU = max(0, x)
Perceptron Layer
perceptron 1
● What if we have multiple perceptrons that share w1

the same input? w2 perceptron 2
○ Neurons in a brain form all kinds of connections
x2 w3 …
○ Maybe different perceptrons can be used to extract
different kinds of signals out of the same input w… …
● Each perceptron (with its unique weights and
w_n perceptron n
biases) can be stacked into a “layer”
Perceptron Layer
perceptron 1
● To calculate the output of a layer, we can stack w1

the output of each individual perceptron into a w2 perceptron 2

vector x2 w3 …
● Note: The ReLU is applied to each perceptron
w… …
output independently x3

○ Can view ReLU as a vector/matrix function that applies w_n perceptron n

the ReLU operation element-wise to each component of
its input
Each row is just the perceptron equation for an independent perceptron
Perceptron Layer
perceptron 1
● If we stack the bias terms into a vector, we can w1

rewrite everything with a single matrix-vector w2 perceptron 2

product and vector addition x2 w3 …

● This is the most compact form of writing a layer
w… …
● We can now just abandon the view of stacked x3

perceptrons if we want and start thinking in w_n perceptron n

terms of entire layers at once
Review of How We Got Here

w/ ReLU

Matrix Form
Neural Network Note: We don’t
activate with ReLU
● Each node (except for the column of input on the output layer
nodes) is just a perceptron
● Each perceptron layer is also called a neural
network layer
○ We can choose to stack arbitrary numbers of
perceptrons at each layer or cascade with an arbitrary
number of layers
○ Each layer can have a different number of perceptrons
○ The middle layers are often called “hidden layers” since
they aren’t immediately interpretable
● We will still choose to use the ReLU function for
each layer for now This model is also called a “Multi-Layer
Perceptron” or MLP for obvious reasons
Cascading Layers
● We can start to cascade these layers to form the full
neural network x W1, b1 W2, b2

● We will choose the dimensions of each matrix W_i

to get the right sized layers that we want
○ The width of W has to equal the size of the previous layer
○ The height of W is equal to the number of perceptrons we
are stacking for this layer
What is wrong with abandoning the ReLU?

Without ReLU, it basically looks like we are only using one giant linear layer

Why do we need a ReLU?

Important observations about
this example:
1) The last layer doesn’t have a
ReLU so that its output
range isn’t limited
2) The perceptrons in the
middle have fewer weights
since they have fewer
incoming connections

Neural Network for a Regression Style Problem (i.e.

a single scalar output in range [-inf, inf])
Important observations about
this example:
1) The last layer doesn’t have a
ReLU so that its output
range isn’t limited
2) The perceptrons in the
middle have fewer weights
since they have fewer
incoming connections

Neural Network for a Regression Style Problem (i.e.

a single scalar output in range [-inf, inf])

● Reminder: this is the task where the output is a vector

where entry i is the probability that an input belongs to
the ith class
● Since we are dealing with probabilities, we need to
enforce two rules:
○ Output of any neuron in the final layer is between 0 to 1
○ All outputs sum to 1
● Our network outputs (green nodes) can right now be any
value in the range (-inf, inf)
● We need some operation/function to perform at the end
that turns our outputs into a valid probability distribution
● You don’t (and shouldn’t) have to commit the
formula to memory, just the following ideas:
○ This first makes all vector entries positive by taking e^z_i
for each entry z_i
○ We then divide each entry by the sum of all exponentiated
entries, ensuring that the sum of all entries is 1
● The largest entry before will be the largest
probability after the softmax, simulating a hard
maximum while being nicely differentiable (needed
in the next section)
● It also fills our need to turn any vector of numbers
into something that can be interpreted as a
probability distribution
Word of warning…
● PyTorch (the ML library we will use in this class) already has the softmax layer
built-in to the CrossEntropyLoss function so you should never have to use this
layer explicitly when coding up a network
○ Unknowing adding an additional softmax layer can tank your model’s performance, and this
“hidden” bug has made many people’s life miserable before…
● However, it’s still good to know what it is, in case you ever have to use other ML
1. Our modified perceptron takes in a number of
inputs and does the operation of weighting them,
adding a bias and then doing a ReLU
2. Neural Networks are basically stacked
perceptrons forming a “layer” that are then
cascaded, and the number of stacks (often called
the “size” of the layer) is arbitrary for each layer
3. If you want to do classification and enforce that
our model spits out valid probabilities, perform a
softmax operation on the outputs of your model
Gradient Descent
Loss Functions
● When we randomly initialize the parameters of our model, it is probably
outputting complete garbage
○ We need to find a way to improve the weights so they are not complete garbage anymore :)
● We need a metric we can optimize for… something that quantifies how poorly the
model is performing. This “loss” functions needs to be
○ High when our model’s predictions are bad
○ Low when our model’s predictions are good
● For deep learning, the loss functions need to be differentiable, i.e., you should
be able to take the derivative of your loss with respect to your inputs
Loss Function Example: MSE
● Classification or regression: suppose we have N total data pairs (x_i, y_i) and the
model’s prediction given x_i is (y_pred)_i
● Can we just use something like Mean Squared Error as our loss function?
○ MSE returns the average of the squared differences between y_i and (y_pred)_i across all samples
○ This quantity will be low when the model’s predictions get reasonably close to the corresponding
true labels for all (or most) of the N samples, but high otherwise
● There will be better losses, but this one is easy to write and interpret
● Note: MSE outputs a scalar value and is also a differentiable function

Note: we can stack the true labels and model predictions for all training samples into N-dimensional vectors,
and define MSE as a vector function
Hill Metaphor
● If you are on a hill and you want to reach the bottom, but
can only see a foot around you, what do you do?
○ We should just follow the slope of the hill and hope it gets us down
● We will need to take a couple steps in the downward
direction, stop and re-evaluate our direction, then take a
few more steps and so on
Which direction is the steepest?
● Suppose the hill’s elevation is given by a vector function
● It’s steepest descent is along the negative gradient vector
○ → If we want to go down a hill, follow the negative gradient evaluated at our current position
Gradient Descent Example
The model’s loss function for a set of data-points (x_i, y_i) can be written as

● This loss function spits out a scalar, just like the elevation of the hill previously was also a scalar
● L is differentiable at every point since all we have done so far is just a bunch of multiplications, additions
and a ReLU
● The parameters of this model are just the scalar components of the weight matrices and bias vectors… we
can take those as the input variables to the loss function
● The data-points are fixed so we can treat x_i and y_i as constants
Gradient Descent
● Our loss is a function of the model parameters and we want to minimize it
● This is just the same as going down a hill
○ We want to find what direction we can step our parameters in to decrease the neural network’s loss
(on its training data) as much as possible with a single step
○ All our parameters are scalars, so we can just take the partial derivatives (i.e. gradient) of our loss
with respect to our parameters

Scalar parameters written out

Matrix form
Gradient Descent
● Bear in mind what we need to calculate: the gradient
evaluated at our current position
○ When you’re walking down a hill in the fog, we only want to know
the hill slopes where we are currently standing
● Therefore, we only care about finding the gradient vector
evaluated at our current weights and training examples
● With neural networks being as complex as they are, it’s
hard to solve for the gradient w.r.t. their parameters
● However, it is surprisingly easy to solve for when it’s
evaluated at a given set of parameters
○ More on this next time… for now, assume that a genie gives the
gradients to us
Gradient Update
● To “reach” the parameters at the bottom of the
“loss function hill”, we need to step in the
direction of its steepest descent, i.e., along the
negative gradient vector
○ We might also want to adjust our step size so it’s not
too small or too big, and we do so by scaling the Notation Clarity:
gradient update with a scalar lambda Here, theta represents the model’s
○ This lambda is also called our “learning rate” parameters all in one vector, and data is
some training example (along with its label
● We keep taking such steps to continue going if there is one).
down this “hill”
The first equation is a component-wise
update while the second equation is the
same update in vector form.
Gradient Descent Convergence
● You will notice that as we get closer to the bottom
of the hill, the partial derivatives become smaller,
and we eventually slow to a halt (assuming our
learning rate isn’t too big)
● What happens when our step size is too big?
○ We will just skip across the minimum instead of slowing
down to hit it
○ This causes training instability
● What happens when our step size is too small?
○ Our steps will be so small that it will take forever to
True (Batch) Gradient Descent
● Remember that MSE averages the squared error over all training samples
○ After all, we want our loss to decrease on the entire training set and not just a single example
● Math fact: this is equivalent to averaging the gradients across all individual
training examples
○ In reality, an auto-differentiation software (again, more on this next time) will handle any such
calculations by itself so you won’t have to think about this
Mini-Batch Gradient Descent
● Why might finding the true gradient on the entire dataset not be ideal?
○ Computationally inefficient, also overfits easier
● How can we approximate?
○ We can instead take “mini-batches” of data, which are just chunks of the training dataset
■ This “batch size” or “mini-batch size” is a hyperparameter you have to tune
○ Again, our loss is the average of all losses on all training examples in the mini-batch
○ We will hope that this will approximate the true gradient
○ The math works out to be the same, just average the gradients for all the examples in our batch
● Smaller batches can help prevent overfitting by only allowing you to estimate the
true direction — the randomness in this process ends up helping!
More Neural Network Building Blocks
Activation Functions
Sigmoid Function
● Do we have to use ReLU?
○ No, there are plenty of other choices, we just need something nonlinear
○ Recall what happens when we don’t have a nonlinear function!
● Examples:
○ Sigmoid Function: This is also known as the logistic function and spits out
values between (0, 1)
■ Not really used that much anymore but historically significant!
○ LeakyReLU: ReLU, except for values less than zero we have a super small
Leaky ReLU
slope instead of just being zero
○ GeLU: A version of ReLU that is used in transformers (which is a popular
neural network architecture)
○ Hyperbolic tangent: Like sigmoid but outputs in the range [-1, 1]
● We really just use a ReLU because it is cheap to compute and
seems to work well enough for a good amount of applications
Loss Functions
● Beyond MSE?
○ Yes, there are many different metrics you might want to optimize for, so long as they are
differentiable w.r.t the model outputs
○ Something like Cross Entropy will generally be preferred for classification
■ Go start with “binary cross entropy” on your own time, it is pretty good to know
● If you want to optimize for two things, you can just create a second loss function
and add it to any existing loss function you’re using
○ Math fact: when you add 2 loss functions, the gradient of the combined loss function is the sum of
the gradients of the loss functions independently
○ Autodiff software will once again do this math for you anyways, but good to know
○ You can scale each loss term by a constant weight to emphasize its effects less or more
Weight Regularization / Weight Decay
● Our weights can be as wild and crazy as they want… is this desirable?
● If a model’s weights are allowed to grow huge, then it can start to predict wild,
erratic, insane decision boundaries
○ Large weights make it so that small changes to input can still result in big changes to the output
○ Maybe we can try and constrain the weights?
● We can add a term to our loss function that is the sum of all the weights squared
○ This is called L2 regularization, and penalizes the network for having large weights
○ We scale this loss term by some constant (denoted by the lambda below) to control just how much
we want to enforce it
● Weight decay can be very useful sometimes so it is good to have this tool in your
Deep Double Descent
Why is Deep Learning Pervasive?
● When you start making you model larger, you
can start to overfit, leading to worse
test-time performance on unseen data
● But when you asymptotically add more
layers, and train for longer, you may very well
see the phenomena of “deep double descent”
for certain classes of models
○ You will see your test accuracy go down when you
begin overparameterizing, but as you keep on going,
you will observe the model’s training accuracy begin The x-axis represents
to increase for a second time the size of the model
● This is very weird (but useful) behavior that is
not really explainable by classical theory
● A neural network is a cascade of stacked perceptrons
○ We can choose the number of layers we want, the activation functions we want, and the loss
function we want
● We can optimize this with gradient descent
○ We take the partial derivative of the loss (scalar) on a batch of training examples with respect to
each of the parameters
○ We step down the hill in the direction of greatest change
○ We can tune the learning rate (step size), batch size
● Your job as an ML engineer:
○ Pick the number (and, later, the type) of layers and layer sizes that will work well, the activation
functions, and any other model architecture specific details
○ Pick your loss function or come up with a new one (so long as it is a scalar function)
○ Pick how you will optimize the neural network / use gradient descent (this is your batch size,
learning rate, etc). You can even choose to vary your learning rate over time (usually a decaying
learning rate)!
■ We will touch on this more in the next lecture
● Picking the best configuration of your network will be trial and error, so make sure
you have a split of data just for tuning these hyperparameters
○ Remember validation sets?
Additional Resources
Deep Learning Additional Video Content

- 3Blue1Brown Videos for an additional perspective :


Play With Neural Networks:

- https://playground.tensorflow.org/
Lecture Attendance

● Slides by Jake Austin
● Edited by Aryan Jain

You might also like