cs519 hw2
cs519 hw2
cs519 hw2
Professor: Fuxin Li
3. report.pdf
This report is divided into two part according the assignment requirements: Decision
Tree, and Random Forest.
• In the first part, I would explain how I build a network on the training dataset
with loss function. Specifically how to determine the best parameter and the
corresponding threshold for accuracy and smoothness. At last, I show the exper-
iment result given different tuning plan and would give an explanation.
• In the second part, I would discuss some math tricks and precious experience
from the time-consuming implementation.
2
Fully Connected Neural Network
1. Problem Description
The task is to implement a one hidden layer fully connected neural network using
Python.
Given N training examples in 2 categories {(x1 , y1 ), . . . , (xN , yN )}, I need to imple-
ment backpropagation using the cross-entropy loss on top of a softmax layer: (p(c1 |x) =
1 1
1+exp(−f (x))
, p(c2 |x) = 1+exp(f (x))
), where the output is f (x) = wT2 g(W1T x + b) + c, and
g(x) = max(0, x) is the ReLU activation function. To be simple, I give some symbols
to all positions on the network. The training network works on the 2-class dataset
extracted from CIFAR-10. The data has 10,000 training examples in 3072 dimensions
and 2,000 testing examples. For this assignment, we just treat each dimension as
uncorrelated to each other.
z1 = W1T x + b
z2 = g(z1 ) = max(0, z1 )
z3 = wT2 z2 + c
1
p = σ(z3 ) =
1 + e−z3
E = −y log(p) − (1 − y) log(1 − p)
Figure 1: momentum
2. Network Structure
I construct a network structure with the data structure of class in Python from the
given skeleton. In order to accelerate the running, I put examples within one mini-batch
into a matrix, extending the input x to matrix x, and dimension [d, 1] → [d, batch]. So
Fi.2 the list illustrate the dimension of all data and parameters:
3
Figure 2: dimension in the network
The forward part is easy to implement. Here are some functions for backproro-
gation.
– Output layer
In output layer, the forward direction is z2 → z3 , and the backward direction
is from SigmoidCrossEntropy layer→ back out of output layer.
4
∂E j ∂pj ∂z3j
∂E j
= = (p − y)j zij
i j j i 2
∂w2 ∂p ∂z3 ∂w2
∂E j ∂pj ∂z3j
∂E j
= j j = (p − y)j
∂c
∂p ∂z3 ∂c
j
∂E j ∂pj ∂z3j
∂E j i
ij = ij = (p − y) w2
j j
∂z2 ∂p ∂z3 ∂z2
– Hidden layer
∂E ∂E ∂z2 ∂z1
=
∂W1 ∂z2 ∂z1 ∂W1
∂E ∂E ∂z2 ∂z1
=
∂b ∂z2 ∂z1 ∂b
To simplify, I use matrix to implement and treat the two linear transform layer
as one group of expression. Here, b is denoted as input dimension, m is denoted
as output dimension after linear combination.
batch
∂E X
= z[:, j] · grad output[:, j]T /batch (1)
∂W
j=1
batch
∂E X
= grad output[:, j]/batch (2)
∂b
j=1
∂E
= W · grad output (3)
∂z
Here, grad output means the next layer’s backward output, can be either a
vector or matrix, in output layer or hidden layer, respectively.
In equation (1), ∂E/∂W is [d, m] = [d, 1]·[1, m], then get mean on each batch(all
j).
In equation (2), ∂E/b is [m, 1], derivatived from the mean of a [m, batch] on each
batch(all j).
In equation (3),∂E/z is [d, batch], derivated from the produt of [d, m] · [m, batch].
5
between lowest and largest eigenvalue is large then performing a gradient descent
is slow even if the learning rate large due to the conditioning of the matrix. The
momentum introduces some balancing in the update between the eigenvectors
associated to lower and larger eigenvalues.[1]:
Figure 3: momentum
Listing 3 shows the detailed structure of a ReLU layer, including its forward and
backward calculation.
Listing 4 shows the pseudocode for constructing a layer for probability and loss
function calculation.
This layer
6
Listing 4: class for sigmoid and cross Entropy layer
1 c l a s s SigmoidCrossEntropy(object):
2 def __init__(self, x=0):
3 self.layer_i ← x
4 self.layer_o ← 0 # p, sigmoid output
5 self.label ← 0 # true label
6 self.loss ← 0 # cross Entropy
7 self.back ← 0 # back output
8 def forward(self, x,y): # [1,batch]->[1,batch]
9 def backward(self): # [1,batch]->[1,batch]
7
(7) WriteData
I use this function for saving data and figure results into files. Due to different
type of tuning conditions, the permutation of parameters should be corresponding
to the condition. This is why I need this part.
8
Network training
1. Training Monitoring
I used print and savefig commends to monitor all the training loss, testing loss,training
accuracy, and testing accuracy(from misclassifies) calculated. More results are shown
in subsequent tuning sections.
9
2. Tuning learning rate
In this part, I tuned the learning rate with proper values, and fix other parameters:
learning rate = [0.005,0.001,0.0005,0.0001]
hidden units=100
num_batch=1000
momentum=0.8
The results are shown in Fig.5-8, which are train loss,test loss,train accuracy and test
accuracy against epoch with different learning rate, respectively.
We can see from the results that, with smaller learning rate, the better stability
will the trend of loss and accuracy be. 0.005 seems a good learning rate with fast
convergence and higher accuracy within 30 epochs. But it will impact the testing
accuracy in Fig.6. That means this learning rate causes some sort of over-fitting.
Figure 5: Train Loss with learning rate Figure 6: Test Loss with learning rate
Figure 7: Train Accuracy with learning rate Figure 8: Test Accuracy with learning rate
10
3. Tuning number of hidden units
In this part, I tuned the number of hidden units with proper values, and fix other
parameters with:
learning rate = 0.001
hidden units = [10,100,1000]
num_batch = 1000
momentum = 0.8
The results are shown in Fig.9-12, which are train loss,test loss,train accuracy and test
accuracy against epoch with different number of hidden units, respectively.
We can see from the results that, with more hidden units, the better accuracy we
can get. 1000 seems a good hidden units with fast convergence and higher accuracy
within 30 epochs. But it will impact the testing accuracy in Fig.10. That means this
number of hidden units causes some sort of over-fitting. So a reasonable hidden units
cannot be too big, making 100 is a good choice in my future trail.
Figure 9: Train Loss with hidden units Figure 10: Test Loss with hidden units
Figure 11: Train Accuracy with hidden Figure 12: Test Accuracy with hidden units
units
11
4. Tuning the number of batches
In this part, I tuned the number of hidden units with proper values, and fix other
parameters with:
Figure 13: Train Loss with batches Figure 14: Test Loss with batches
Figure 15: Train Accuracy with batches Figure 16: Test Accuracy with batches
The results are shown in Fig.13-16, which are train loss,test loss,train accuracy and
test accuracy against epoch with different number of batches, respectively.
We can see from the results that, with more number of batches, the better stability
will the trend of loss and accuracy be. 2000 seems a good number of batches with
fast convergence and higher accuracy within 30 epochs. But it will impact the testing
accuracy in Fig.14. The reason is that, if number of batches are big enough to the
number of examples, the network will update every backward procedure of example, so
12
the last status of a epoch is mainly related to the last example, which can not reflect
the whole dataset. So a reasonable number of batches cannot be too big, making 1000
is a good choice.
13
Discussion
1.Initialization
In my strategy,elements in weight W is [−1, 1], then after fist epoch, average loss is
2.64, accuracy is 54%. However if I set the initial value in W ∈ [−0.1, 0.1], after first
epoch, the average loss is 0.55, accuracy is 74%. So making the initialization of weights
can help fast convergence.
2. Overflow in calculation
Even though I set initial values for weights and bias very small, they can grow after
several updates if learning rate is not small enough, and they eventually accumulate to
z, the output of hidden layer. It is possible that z < −50, making the output of soft-
max layer, p = σ(z), naturally to 0. However, considering the intermediate calculation
e−z , I would encounter some N AN results due to overflow. To solve this problem, I
need to put every calculation in safe place, and use a better equation for both Sigmoid
and Croos-Entropy[2].
Sigmoid:
1
,z > 0
1 1 + e−|z|
p = σ(z) =
1 + e−z e−|z|
,z < 0
1 + e−|z|
Cross-Entropy:
E = −y log(p) − (1 − y) log(1 − p)
= log(1 + e−z ) − zy + z %z > 0, otherwise would overflow
= log(1 + ez ) − zy %z < 0, otherwise would overflow
= log(1 + e−|z| ) − zy + max(z, 0) % overflow free
14
Reference
15