CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2015
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2015
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2015
Spring 2015
z = Wx + b
Figure 3: This image captures how
The activations of the sigmoid function can then be written as: multiple sigmoid units are stacked on
the right, all of which receive the same
a (1) input x.
.
. = σ (z) = σ (Wx + b)
.
a(m)
So what do these activations really tell us? Well, one can think
of these activations as indicators of the presence of some weighted
combination of features. We can then use a combination of these
activations to perform classification tasks.
cs 224d: deep learning for nlp 3
minimize J = max(sc − s, 0)
minimize J = max(∆ + sc − s, 0)
We can scale this margin such that it is ∆ = 1 and let the other
parameters in the optimization problem adapt to this without any
change in performance. For more information on this, read about
functional and geometric margins - a topic often covered in the study
of Support Vector Machines. Finally, we define the following opti- The max-margin objective function
mization objective which we optimize over all training windows: is most commonly associated with
Support Vector Machines (SVMs)
minimize J = max(1 + sc − s, 0)
θ ( t +1) = θ ( t ) − α ∇ θ ( t ) J
• Each layer (including the input and output layers) has neurons
which receive an input and produce an output. The j-th neuron
(k)
of layer k receives the scalar input z j and produces the scalar
(k)
activation output aj .
(k) (k)
• We will call the backpropagated error calculated at z j as δj .
• Layer 1 refers to the input layer and not the first hidden layer. For
(1) (1)
the input layer, x j = z j = aj .
• W (k) is the transfer matrix that maps the output from the k-th layer
to the input to the (k + 1)-th Thus, W (1) = W and W (2) = U to put
this new generalized notation in perspective of Section 1.3.
Backpropagation Notation:
Let us begin: Suppose the cost J = (1 + sc − s) is positive and • xi is an input to the neural network.
(1)
we want to perform the update of parameter W14 (in Figure 5 and • s is the output of the neural net-
(1) (2) work.
Figure 6), we must realize that W14 only contributes to z1 and
(2) • The j-th neuron of layer k receives
thus a1 . This fact is crucial to understanding backpropagation – (k)
the scalar input z j and produces
backpropagated gradients are only affected by values they contribute (k)
(2) the scalar activation output a j .
to. a1 is consequently used in the forward computation of score by
(1) (1)
(2) • For the input layer, x j = z j = aj .
multiplication with W1 . We can see from the max-margin loss that:
• W (k) is the transfer matrix that maps
∂J ∂J the output from the k-th layer to
=− = −1 the input to the (k + 1)-th. Thus,
∂s ∂sc
W (1) = W and W (2) = U T using
Therefore we will work with ∂s
here for simplicity. Thus, notation from Section 1.3.
(1)
∂Wij
(2) (1)
We see above that the gradient reduces to the product δi · aj where
(2)
δi is essentially the error propagating backwards from the i-th neu-
(1)
ron in layer 2. a j is an input fed to i-th neuron in layer 2 when
scaled by Wij .
Let us discuss the "error sharing/distribution" interpretation of
backpropagation better using Figure 6 as an example. Say we were to
(1) Figure 6: This subnetwork shows the
update W14 :
relevant parts of the network required
(1)
to update Wij
1. We start with the an error signal of 1 propagating backwards from
(3)
a1 .
(1)
Bias Updates: Bias terms (such as b1 ) are mathematically equivalent
(2)
to other weights contributing to the neuron input (z1 ) as long as the
input being forwarded is 1. As such, the bias gradients for neuron
(k) (1)
i on layer k is simply δi . For instance, if we were updating b1
(1) (2)
instead of W14 above, the gradient would simply be σ (z1 )(1 −
(2) (2)
σ(z1 ))W1 .
( k −1) (k)
2. We propagate this error backwards to a j by multiplying δi by
( k −1)
the path weight Wij .
( k −1)
4. However, a j may have been forwarded to multiple nodes in
the next layer as shown in Figure 8. It should receive responsibility
for errors propagating backward from node m in layer k too, using
the exact same mechanism.
( k −1) (k) ( k −1) (k) ( k −1)
5. Thus, error received at a j is δi Wij + δm Wmj .
(k) ( k −1)
6. In fact, we can generalize this to be ∑i δi Wij .
( k −1)
7. Now that we have the correct error at a j , we move it across
neuron j at layer k − 1 by multiplying with with the local gradient
( k −1)
σ0 (z j ).
( k −1) ( k −1)
8. Thus, the error that reaches z j , called δj is
( k −1) (k) ( k −1)
σ0 (z j ) ∑i δi Wij
(k)
For a given parameter Wij , we identified that the error gradient is
( k +1) (k)
simply δi · a j . As a reminder, W (k) is the matrix that maps a(k)
to z(k+1) . We can thus establish that the error gradient for the entire
matrix W (k) is:
( k +1) ( k ) ( k +1) ( k )
δ1 a1 δ1 a2 ···
( k +1) ( k ) ( k +1) ( k ) ( k +1) ( k ) T
∇W ( k ) =
δ2 a1 δ2 a2 · · ·
=δ a
.. .. ..
. . .
Error propagates from layer (k + 1) to
Thus, we can write an entire matrix gradient using the outer prod- (k) in the following manner:
uct of the error vector propagating into the matrix and the activations
δ ( k ) = f 0 ( z ( k ) ) ◦ (W ( k ) T δ ( k + 1 ) )
forwarded by the matrix.
Of course, this assumes that in the
Now, we will see how we can calculate the error vector δ(k) . We forward propagation the signal z(k) first
(k) (k) ( k +1) ( k )
established earlier using Figure 8 that δj = σ0 (z j ) ∑i δi Wij . goes through activation neurons f to
generate activations a(k) and are then
This can easily generalize to matrices such that:
linearly combined to yield z(k+1) via
transfer matrix W (k) .
δ ( k ) = σ 0 ( z ( k ) ) ◦ (W ( k ) T δ ( k + 1 ) )
∂J J (θ + h) − J (θ − h) J (θ + h) − J (θ − h)
≈ lim = lim
∂θ h →0 ( θ + h ) − ( θ − h ) h →0 2h
We know the above holds true from the definition of differentia-
tion, but how does this apply to getting an error gradient? Well, the
term J (θ + h) is simply the error calculated on a forward pass for a
given dataset when we change perturb the parameter θ by +h. Sim-
ilarly, the term J (θ − h) is the error calculated on a forward pass for
the same dataset when we change perturb the parameter θ by −h.
Thus, using two forward passes, we can approximate the gradient
with respect to any given parameter in the model. Of course, consid-
ering how many operations are required for even a forward pass, this
is an extremely expensive procedure of evaluating gradients – this
is a great way of verifying implementations of the backpropagation
however. Gradient checks are a great way to
A naive implementation of gradient check can be implemented in compare analytical and numerical
gradients. Analytical gradients should
the following manner: be close and numerical gradients can be
calculated using:
∂J J (θ + h) − J (θ − h)
Snippet 2.1 ≈ lim
∂θ h →0 2h
def eval_numerical_gradient(f, x): J (θ + h) and J (θ − h) can be evalu-
""" ated using two forward passes. An
implementation of this can be seen in
a naive implementation of numerical gradient of f at x
Snippet 2.1.
- f should be a function that takes a single argument
- x is the point (numpy array) to evaluate the gradient
at
"""
important!)
2.2 Regularization
Like most classifiers, neural networks are also prone to overfitting
which causes their validation and test performances to be subop-
timal. We can implement L2 regularization such that the loss with
regularization, JR , is calculated to be:
L
JR = J + λ ∑
W (i)
F
i =1
In the above formulation,
W (i)
is the Frobenius norm of the
F
matrix W (i) and λ is the relative weight assigned to regularization
in the weighted-sum objective. Adding this type of regularization
penalizes weights for being large in magnitude by contributing
to the cost quadratically with it. This reduces the flexibility of the
target function (i.e. classifier) thereby reducing the overfitting phe-
nomenon. Imposing such a constraint can be interpreted as the prior
Bayesian belief that the optimal weights are close to zero. How close
you wonder? Well, that is what λ controls – large values of λ lead to
a stronger belief that the weights should be chose to zero. It must be
noted that the bias terms are not regularized and do not contribute to
the cost term above – try thinking about why this is the case.
Sigmoid: This is the default choice we have discussed and the activa-
tion function σ is:
1
σ(z) =
1 + exp(−z)
Figure 9: The response of a sigmoid
nonlinearity
cs 224d: deep learning for nlp 11
− exp(−z)
σ0 (z) = = σ(z)(1 − σ(z))
1 + exp(−z)
exp(z) − exp(−z)
tanh(z) = = 2σ(2z) − 1
exp(z) + exp(−z) Figure 10: The response of a tanh
nonlinearity
where tanh(z) ∈ (−1, 1)
Hard tanh: The hard tanh function is sometimes preferred over the
tanh function since it is computationally cheaper. It does however
saturate for magnitudes of z greater than 1. The activation of the
hard tanh is:
−1
: z < −1
hardtanh(z) = z : −1 ≤ z ≤ 1
Figure 11: The response of a hard tanh
1 :z>1
nonlinearity
The derivative can also be expressed in a piecewise functional form:
(
0 1 : −1 ≤ z ≤ 1
hardtanh (z) =
0 : otherwise
Soft sign: The soft sign function is another nonlinearity which can
be considered an alternative to tanh since it too does not saturate as
easily as hard clipped functions:
z
softsign(z) =
1 + |z|
The derivative is the expressed as: Figure 12: The response of a soft sign
nonlinearity
sgn(z)
softsign0 (z) =
(1 + z )2
where sgn is the signum function which returns ± 1 depending on the sign of z
cs 224d: deep learning for nlp 12
rect(z) = max(z, 0)
The derivative is then the piecewise function: Figure 13: The response of a ReLU
( nonlinearity
1 :z>0
rect0 (z) =
0 : otherwise
leaky(z) = max(z, k · z)
θ ( t +1) = θ ( t ) − α ∇ θ ( t ) J
You might think that for fast convergence rates, we should set α to
larger values however faster convergence is not guaranteed with
larger convergence rates. In fact, with very large learning rates, we
might experience that the loss function actually diverges because the
parameters update causes the model to overshoot the convex minima
as shown in Figure 15. In non-convex models (most of those we work
with), the outcome of a large learning rate is unpredictable but the
chances of diverging loss functions are very high.
The simple solution to avoiding a diverging loss is to use a very
small learning rate so that we carefully scan the parameter space.
Furthermore, we can keep this learning rate fixed for all parameters Figure 15: Here we see that updating
in the model instead of having a more complex scheme in which each parameter w2 with a large learning rate
can lead to divergence of the error.
parameter has its own learning rate.
Since training is the most expensive phase in a deep learning
system, some research has attempted to improve this naive approach
to setting learning learning rates. For instance, Ronan Collobert
( l +1) × n ( l )
scales the learning rate of a weight Wij (where W ∈ Rn ) by
( l )
the inverse square root of the fan-in of the neuron (n ). Another
approach is to allow the learning rate to decrease over time such that:
α (0) τ
α(t) =
max(t, τ )
updated much in the past are likelier to have higher learning rates
now. Formally:
α
θ ( t ) = θ ( t −1) − q ∇θ J (t)
∑tτ =1 (∇θ J (τ ) )2
In this technique, we see that if the RMS of the history of gradients
is extremely low, the learning rate is very high. A simple implemen-
tation of this technique is:
Snippet 2.2
# Assume the gradient dx and parameter vector x
cache += dx**2
x += - learning_rate * dx / np.sqrt(cache + 1e-8)