[Fall 2024] Deep Learning 1
[Fall 2024] Deep Learning 1
[Fall 2024] Deep Learning 1
x1
the same input? w2 perceptron 2
○ Neurons in a brain form all kinds of connections
x2 w3 …
○ Maybe different perceptrons can be used to extract
different kinds of signals out of the same input w… …
x3
● Each perceptron (with its unique weights and
w_n perceptron n
biases) can be stacked into a “layer”
Perceptron Layer
perceptron 1
● To calculate the output of a layer, we can stack w1
x1
the output of each individual perceptron into a w2 perceptron 2
vector x2 w3 …
● Note: The ReLU is applied to each perceptron
w… …
output independently x3
x1
rewrite everything with a single matrix-vector w2 perceptron 2
Stacking
Perceptrons
w/ ReLU
Matrix Form
Neural Network Note: We don’t
activate with ReLU
● Each node (except for the column of input on the output layer
nodes) is just a perceptron
● Each perceptron layer is also called a neural
network layer
○ We can choose to stack arbitrary numbers of
perceptrons at each layer or cascade with an arbitrary
number of layers
○ Each layer can have a different number of perceptrons
○ The middle layers are often called “hidden layers” since
they aren’t immediately interpretable
● We will still choose to use the ReLU function for
each layer for now This model is also called a “Multi-Layer
Perceptron” or MLP for obvious reasons
Cascading Layers
● We can start to cascade these layers to form the full
neural network x W1, b1 W2, b2
Without ReLU, it basically looks like we are only using one giant linear layer
Note: we can stack the true labels and model predictions for all training samples into N-dimensional vectors,
and define MSE as a vector function
Hill Metaphor
● If you are on a hill and you want to reach the bottom, but
can only see a foot around you, what do you do?
○ We should just follow the slope of the hill and hope it gets us down
right?
● We will need to take a couple steps in the downward
direction, stop and re-evaluate our direction, then take a
few more steps and so on
Which direction is the steepest?
● Suppose the hill’s elevation is given by a vector function
● It’s steepest descent is along the negative gradient vector
○ → If we want to go down a hill, follow the negative gradient evaluated at our current position
Gradient Descent Example
The model’s loss function for a set of data-points (x_i, y_i) can be written as
Notice:
● This loss function spits out a scalar, just like the elevation of the hill previously was also a scalar
● L is differentiable at every point since all we have done so far is just a bunch of multiplications, additions
and a ReLU
● The parameters of this model are just the scalar components of the weight matrices and bias vectors… we
can take those as the input variables to the loss function
● The data-points are fixed so we can treat x_i and y_i as constants
Gradient Descent
● Our loss is a function of the model parameters and we want to minimize it
● This is just the same as going down a hill
○ We want to find what direction we can step our parameters in to decrease the neural network’s loss
(on its training data) as much as possible with a single step
○ All our parameters are scalars, so we can just take the partial derivatives (i.e. gradient) of our loss
with respect to our parameters
Matrix form
Gradient Descent
● Bear in mind what we need to calculate: the gradient
evaluated at our current position
○ When you’re walking down a hill in the fog, we only want to know
the hill slopes where we are currently standing
● Therefore, we only care about finding the gradient vector
evaluated at our current weights and training examples
● With neural networks being as complex as they are, it’s
hard to solve for the gradient w.r.t. their parameters
symbolically
● However, it is surprisingly easy to solve for when it’s
evaluated at a given set of parameters
○ More on this next time… for now, assume that a genie gives the
gradients to us
Gradient Update
● To “reach” the parameters at the bottom of the
“loss function hill”, we need to step in the
direction of its steepest descent, i.e., along the
negative gradient vector
○ We might also want to adjust our step size so it’s not
too small or too big, and we do so by scaling the Notation Clarity:
gradient update with a scalar lambda Here, theta represents the model’s
○ This lambda is also called our “learning rate” parameters all in one vector, and data is
some training example (along with its label
● We keep taking such steps to continue going if there is one).
down this “hill”
The first equation is a component-wise
update while the second equation is the
same update in vector form.
Gradient Descent Convergence
● You will notice that as we get closer to the bottom
of the hill, the partial derivatives become smaller,
and we eventually slow to a halt (assuming our
learning rate isn’t too big)
● What happens when our step size is too big?
○ We will just skip across the minimum instead of slowing
down to hit it
○ This causes training instability
● What happens when our step size is too small?
○ Our steps will be so small that it will take forever to
converge
True (Batch) Gradient Descent
● Remember that MSE averages the squared error over all training samples
○ After all, we want our loss to decrease on the entire training set and not just a single example
● Math fact: this is equivalent to averaging the gradients across all individual
training examples
○ In reality, an auto-differentiation software (again, more on this next time) will handle any such
calculations by itself so you won’t have to think about this
Mini-Batch Gradient Descent
● Why might finding the true gradient on the entire dataset not be ideal?
○ Computationally inefficient, also overfits easier
● How can we approximate?
○ We can instead take “mini-batches” of data, which are just chunks of the training dataset
■ This “batch size” or “mini-batch size” is a hyperparameter you have to tune
○ Again, our loss is the average of all losses on all training examples in the mini-batch
○ We will hope that this will approximate the true gradient
○ The math works out to be the same, just average the gradients for all the examples in our batch
● Smaller batches can help prevent overfitting by only allowing you to estimate the
true direction — the randomness in this process ends up helping!
More Neural Network Building Blocks
Activation Functions
Sigmoid Function
● Do we have to use ReLU?
○ No, there are plenty of other choices, we just need something nonlinear
○ Recall what happens when we don’t have a nonlinear function!
● Examples:
○ Sigmoid Function: This is also known as the logistic function and spits out
values between (0, 1)
■ Not really used that much anymore but historically significant!
○ LeakyReLU: ReLU, except for values less than zero we have a super small
Leaky ReLU
slope instead of just being zero
○ GeLU: A version of ReLU that is used in transformers (which is a popular
neural network architecture)
○ Hyperbolic tangent: Like sigmoid but outputs in the range [-1, 1]
● We really just use a ReLU because it is cheap to compute and
seems to work well enough for a good amount of applications
Loss Functions
● Beyond MSE?
○ Yes, there are many different metrics you might want to optimize for, so long as they are
differentiable w.r.t the model outputs
○ Something like Cross Entropy will generally be preferred for classification
■ Go start with “binary cross entropy” on your own time, it is pretty good to know
● If you want to optimize for two things, you can just create a second loss function
and add it to any existing loss function you’re using
○ Math fact: when you add 2 loss functions, the gradient of the combined loss function is the sum of
the gradients of the loss functions independently
○ Autodiff software will once again do this math for you anyways, but good to know
○ You can scale each loss term by a constant weight to emphasize its effects less or more
Weight Regularization / Weight Decay
● Our weights can be as wild and crazy as they want… is this desirable?
● If a model’s weights are allowed to grow huge, then it can start to predict wild,
erratic, insane decision boundaries
○ Large weights make it so that small changes to input can still result in big changes to the output
○ Maybe we can try and constrain the weights?
● We can add a term to our loss function that is the sum of all the weights squared
○ This is called L2 regularization, and penalizes the network for having large weights
○ We scale this loss term by some constant (denoted by the lambda below) to control just how much
we want to enforce it
● Weight decay can be very useful sometimes so it is good to have this tool in your
toolkit
Deep Double Descent
Why is Deep Learning Pervasive?
● When you start making you model larger, you
can start to overfit, leading to worse
test-time performance on unseen data
● But when you asymptotically add more
layers, and train for longer, you may very well
see the phenomena of “deep double descent”
for certain classes of models
○ You will see your test accuracy go down when you
begin overparameterizing, but as you keep on going,
you will observe the model’s training accuracy begin The x-axis represents
to increase for a second time the size of the model
● This is very weird (but useful) behavior that is
not really explainable by classical theory
Takeaways
Takeaways
● A neural network is a cascade of stacked perceptrons
○ We can choose the number of layers we want, the activation functions we want, and the loss
function we want
● We can optimize this with gradient descent
○ We take the partial derivative of the loss (scalar) on a batch of training examples with respect to
each of the parameters
○ We step down the hill in the direction of greatest change
○ We can tune the learning rate (step size), batch size
Takeaways
● Your job as an ML engineer:
○ Pick the number (and, later, the type) of layers and layer sizes that will work well, the activation
functions, and any other model architecture specific details
○ Pick your loss function or come up with a new one (so long as it is a scalar function)
○ Pick how you will optimize the neural network / use gradient descent (this is your batch size,
learning rate, etc). You can even choose to vary your learning rate over time (usually a decaying
learning rate)!
■ We will touch on this more in the next lecture
● Picking the best configuration of your network will be trial and error, so make sure
you have a split of data just for tuning these hyperparameters
○ Remember validation sets?
Additional Resources
Deep Learning Additional Video Content
- https://playground.tensorflow.org/
Lecture Attendance
http://tinyurl.com/fa24-dl4cv
Contributors
● Slides by Jake Austin
● Edited by Aryan Jain