Programming Test: Learning Activations in Neural Networks: Monk AI
Programming Test: Learning Activations in Neural Networks: Monk AI
Programming Test: Learning Activations in Neural Networks: Monk AI
Networks
Monk AI
Abstract—The choice of Activation Functions (AF) has proven where the coefficients k0 , k1 have to be learned during training
to be an important factor that affects the performance of an via back-propagation of error gradients, on a particular data
Artificial Neural Network (ANN). Use a 1-hidden layer neural set specified in the problem statement.
network model that adapts to the most suitable activation
function according to the data-set. The ANN model can learn for For the purpose of demonstration, consider a feed-forward
itself the best AF to use by exploiting a flexible functional form, neural network consisting of an input layer L0 consisting
k0 + k1 ∗ x with parameters k0 , k1 being learned from multiple of m nodes for m features, two hidden layers L1 and L2
runs. You can use this code-base for implementation guidelines consisting of n and p nodes respectively, and an output
and help. https://github.com/sahamath/MultiLayerPerceptron layer L3 consisting of k nodes for k classes. Let zi and
ai denote the inputs to and the activations of the nodes in
I. BACKGROUND
layer Li respectively. Let wi and bi denote the weights and
Selection of the best performing AF for classification task the biases applied to the nodes of layer Li−1 , and let the
is essentially a naive (or brute-force) procedure wherein, a activations of layer L0 be the input features of the training
popularly used AF is picked and used in the network for examples. Finally, let K denote
the column matrix containing
k0
approximating the optimal function. If this function fails, the the equation coefficients: k1 and let t denote the number of
process is repeated with a different AF, till the network learns k2
to approximate the ideal function. It is interesting to inquire training examples being taken in one batch. Then the forward-
and inspect whether there exists a possibility of building a propagation equations will be:
framework which uses the inherent clues and insights from z1 = a0 × w1 + b1
data and bring about the most suitable AF. The possibilities
of such an approach could not only save significant time and a1 = g(z1 )
effort for tuning the model, but will also open up new ways z2 = a1 × w2 + b2
for discovering essential features of not-so-popular AFs.
a2 = g(z2 )
II. P ROBLEM S TATEMENT
z3 = a2 × w3 + b3
Given a specific activation function
a3 = Sof tmax(z3 )
g(x) = k0 + k1 x (1)
where × denotes the matrix multiplication operation and
and categorical cross-entropy loss, design a Neural Network on Sof tmax() denotes the Softmax activation function.
Banknote, MNIST or IRIS data where the activation function For back-propagation, let the loss function used in this
parameters k0 , k1 are learned from the data you choose from model be the Categorical Cross-Entropy Loss, and let dfi
one of the above-mentioned data sets. Your solution must denote the gradient matrix of the loss with respect to the matrix
include the learnable parameter values i.e. final k0 , k1 values fi , where f can be substituted with z, a, b, or w. and let there
at the end of training, a plot depicting changes in k0 , k1 at be matrices dK2 and dK1 of dimension 3 × 1. Then the back-
each epoch, training vs test loss, train vs. test accuracy and a propagation equations will be:
Loss function plot. dz3 = a3 − y
III. M ATHEMATICAL F RAMEWORK 1 T
dw3 = a × dz3
t 2
A. Compact Representation
db3 = avg col (dz3 )
Let the proposed Ada-Act activation function be mathemat-
ically defined as: da2 = dz3 × w3T
g(x) = k0 + k1 x (2) dz2 = g 0 (z2 ) ∗ da2
1 T
dw2 = a × dz2
t 1
db2 = avg col (dz2 )