DL Assignment Solution 00 To 10

NPTEL Online Certification Courses
Indian Institute of Technology Kharagpur
Deep Learning
Assignment- Week 0
TYPE OF QUESTION: MCQ/MSQ
Number of questions: 10 Total mark: 10 X 1 = 10
______________________________________________________________________________
QUESTION 1:
𝑑𝑓
Find where 𝑓 = |𝑥|? |𝑥| means absolute of 𝑥.
𝑑𝑥
a. 1
b. 𝑆𝑖𝑔𝑛(𝑥)
c. 0
d. ∞
Correct Answer: b
Detailed Solution:
𝑑𝑓 1 𝑥>0
= {−1 𝑥 < 0 = 𝑠𝑖𝑔𝑛(𝑥)
𝑑𝑥
0 𝑥=0
______________________________________________________________________________
QUESTION 2:
𝑑𝜎 1
Find , where 𝜎 (𝑥 ) =
𝑑𝑥 1+𝑒 −𝑥
𝑑𝜎
a. = 1 − 𝜎(𝑥)
𝑑𝑥
𝑑𝜎
b. = 1 + 𝜎(𝑥)
𝑑𝑥
𝑑𝜎
c. = 𝜎(𝑥)(1 − 𝜎 (𝑥 ))
𝑑𝑥
𝑑𝜎
d. = 𝜎(𝑥)(1 + 𝜎 (𝑥 ))
𝑑𝑥
Correct Answer: c
Detailed Solution:
1
𝜎(𝑥) =
1 + 𝑒 −𝑥
𝑑𝜎
= (1 + 𝑒 −𝑥 )−2 ∗ 𝑒 −𝑥
𝑑𝑥
𝑑𝜎 𝑒 −𝑥 1 + 𝑒 −𝑥 − 1 1 1 1 1
= = = − = (1 − )
𝑑𝑥 (1 + 𝑒 −𝑥 )2 (1 + 𝑒 −𝑥 )2 1 + 𝑒 −𝑥 (1 + 𝑒 −𝑥 )2 1 + 𝑒 −𝑥 1 + 𝑒 −𝑥
𝑑𝜎
= 𝜎(𝑥)(1 − 𝜎(𝑥))
𝑑𝑥
______________________________________________________________________________
QUESTION 3:
There are 5 black 7 white balls. Assume we have drawn two balls randomly one by one without
any replacement. What will be the probability that both balls are black?
a. 20/132
b. 25/144
c. 20/144
d. 25/132
Correct Answer: a
Detailed Solution:
Probability of first ball being black = 𝟓/(𝟓 + 𝟕) = 𝟓/𝟏𝟐.
Probability of drawing second ball black is = 𝟒/(𝟒 + 𝟕) = 𝟒 /𝟏𝟏.
Now overall probability of both balls being black = (𝟓/𝟏𝟐) × (𝟒/𝟏𝟏) = 𝟐𝟎/𝟏𝟑𝟐
______________________________________________________________________________
QUESTION 4:
Two dices are rolled together. What will be the probability of getting 1 and 4 together?
a. 1/18
b. 1/36
c. 1
d. None of the above
Correct Answer: a
Detailed Solution:
Number of possible outcomes = (𝟔 × 𝟔) = 𝟑𝟔.
Number of times getting 𝟏 & 𝟒 together = 𝟐 (where 𝟏 in first dice, 𝟒 in second dice or 𝟒 in
first dice, 𝟏 in second dice).
So, probability = 𝟐/𝟑𝟔 = 𝟏/𝟏𝟖
_____________________________________________________________________________
QUESTION 5:
What will be possible median of the distribution?
a. 26
b. 34
c. 43
d. 55
Correct Answer: b
Detailed Solution:
Total Population = (𝟐𝟕𝟓 + 𝟐𝟗𝟏 + 𝟏𝟎𝟓 + 𝟏𝟐𝟑 + 𝟏𝟑𝟏 + 𝟏𝟓𝟎 + 𝟏𝟏𝟎 + 𝟗𝟎 + 𝟔𝟎 + 𝟒𝟗 +

𝟓𝟎) = 𝟏𝟒𝟑𝟒.
So, median is the average of (𝟏𝟒𝟑𝟒/𝟐) = 𝟕𝟏𝟕th value and 𝟕𝟏𝟖th value.
So, median of the distribution is in the range of 𝟑𝟎 − 𝟒𝟎.

So, option b may be the result.
______________________________________________________________________________
QUESTION 6:
Image shows there normally distributed probability distribution function with zero mean and
three different variances (𝜎1 , 𝜎2 , 𝜎3 ). Which of the following relationship is valid?
a. 𝜎1 > 𝜎2 > 𝜎3
b. 𝜎1 < 𝜎2 < 𝜎3
c. 𝜎1 = 𝜎2 = 𝜎3
d. 𝜎1 > 𝜎2 < 𝜎3
Correct Answer: b
Detailed Solution:
Higher variance means the spread of the distribution will be higher. So, 𝝈𝟏 < 𝝈𝟐 < 𝝈𝟑
____________________________________________________________________________
QUESTION 7:
Matrix inverse of a square matrix 𝐴 exists if.
a. Determinant of 𝐴, 𝑑𝑒𝑡(𝐴) = 0
b. Eigen values of 𝐴 are non-zero
c. Sum of eigen values are non-zero
Correct Answer: b
Detailed Solution:
Matrix inverse exists if 𝒅𝒆𝒕(𝑨) is not equal to zero. 𝒅𝒆𝒕(𝑨) = product of all the eigen
values of the square matrix.
_________________________________________________________________________
QUESTION 8:
1 −2
𝑥1 , 𝑥2 , 𝑥3 are the linearly independent vectors. If 𝑥1 = [3] , 𝑥2 = [ 4 ], what is the possible
0 −5
value of 𝑥3 ?
−1
a. [ 7 ]
−5
0
b. [ 10 ]
−5
3
c. [4]
5
5
d. [−5]
10
Correct Answer: c
Detailed Solution:
𝑿 = [𝒙𝟏 𝒙𝟐 𝒙𝟑 ]. 𝒙𝟏 , 𝒙𝟐 , 𝒙𝟑 are linearly independent if 𝒅𝒆𝒕𝒆𝒓𝒎𝒊𝒏𝒂𝒏𝒕(𝑿)/𝒅𝒆𝒕(𝑿) = 𝟎
𝟏 −𝟐 𝟑
𝒅𝒆𝒕([𝟑 𝟒 𝟒]) ≠ 𝟎
𝟎 −𝟓 𝟓
We also can validate linear dependency of option a, b, d.
Option a: 𝒙𝟏 + 𝒙𝟐 = 𝒙𝟑 ,
Option b: 𝟐𝒙𝟏 + 𝒙𝟐 = 𝒙𝟑 ,
Option d: 𝒙𝟏 − 𝟐𝒙𝟐 = 𝒙𝟑
______________________________________________________________________________
QUESTION 9:
𝑥 + 2𝑦 − 𝑧 = 1 ∙∙∙∙∙∙∙∙∙∙ (1)
−2𝑥 − 4𝑦 + 2𝑧 = −2 ∙∙∙∙∙∙∙∙∙∙∙ (2)
𝑧 = 2 ∙∙∙∙∙∙∙∙∙∙ (3)
What are the values of 𝑥, 𝑦, 𝑧?
a. 𝑥 = 0, 𝑦 = 0, 𝑧 = 2
b. 𝑧 = 2 and infinitely possible 𝑥, 𝑦
c. 𝑧 = 2 and no possible 𝑥, 𝑦
Correct Answer: b
Detailed Solution:
____________________________________________________________________________
QUESTION 10:
What are the eigen values of the matrix A?
5 4
𝐴=[ ]
−3 −2
a. 4, −3
b. 5, −2
c. −2, −1
d. 2, 1
Correct Answer: d
Detailed Solution:
𝐝𝐞 𝐭(𝝀𝑰 − 𝑨) = 𝟎
𝝀 − 𝟓 −𝟒
𝒐𝒓, 𝒅𝒆𝒕 ([ ]) = 𝟎 𝒐𝒓, (𝝀 − 𝟓)(𝝀 + 𝟐) + 𝟏𝟐 = 𝟎 𝒐𝒓, 𝝀𝟐 − 𝟑𝝀 + 𝟐 = 𝟎 𝒐𝒓, 𝝀 = 𝟐, 𝟏
𝟑 𝝀+𝟐
______________________________________________________________________
______________________________________________________________________________
************END*******
Deep Learning
Assignment- Week 1
Number of questions: 10 Total mark: 10 X 2= 20
______________________________________________________________________________
QUESTION 1:
Signature descriptor of an unknown shape is given in the figure, can you identify the unknown
shape?
a. Circle
b. Square
c. Straight line
d. Cannot be predicted
Correct Answer: a
Detailed Solution:
Distance from centroid to boundary is same for every value of ϴ. This is true for Circle
with a radius k.
______________________________________________________________________________
QUESTION 2:
To measure the Smoothness, coarseness and regularity of a region we use which of the
transformation to extract feature?
a. Gabor Transformation
b. Wavelet Transformation
c. Both Gabor, and Wavelet Transformation.
d. None of the Above.
Correct Answer: c
Detailed Solution:
One of the important approach to region description is texture content. This helps to provide
the measure of some of the important properties of an image like smoothness, coarseness
and regularity of the region. We use Gabor filter and Wavelet transformation to extract
texture feature.
QUESTION 3:
Suppose Fourier descriptor of a shape has K coefficient, and we remove last few coefficient and
use only first m (m<K) number of coefficient to reconstruct the shape. What will be effect of
using truncated Fourier descriptor on the reconstructed shape?
a. We will get a smoothed boundary version of the shape.
b. We will get only the fine details of the boundary of the shape.
c. Full shape will be reconstructed without any loss of information.
d. Low frequency component of the boundary will be removed from contour of the
shape.
Correct Answer: a
Detailed Solution:
Low frequency component of Fourier descriptor captures the general shape properties of
the object and high frequency component captures the finer detail. So, if we remove the last
few component, then the finer details will be lost, and as a result the reconstructed shape
will be smoothed version of original shape. The boundary of the reconstructed shape will
be a low frequency approximation of the original shape boundary.
______________________________________________________________________________
QUESTION 4:
While computing polygonal descriptor of an arbitrary shape using splitting technique, which of
the following we take as the starting guess?
a. Vertex joining the two closet point above a threshold on the boundary.
b. Vertex joining the two farthest point on the boundary.
c. Vertex joining any two arbitrary point on the boundary.
d. None of the above.
Correct Answer: b
Detailed Solution:
Options are self-explanatory.
_____________________________________________________________________________
QUESTION 5:
Consider two class Bayes’ Minimum Risk Classifier. Probability of classes W1 and W2 are, P (ω1)
=0.3 and P (ω2) =0.7 respectively. P(x) = 0.545, P (x| ω1) = 0.65, P (x| ω2) =0.5 and the loss
𝜆 𝜆12
matrix values are [ 11 ]
𝜆21 𝜆22
If the classifier assign x to class W1, then which one of the following is true.
𝜆21 −𝜆11
a. < 1.79
𝜆12 −𝜆22
𝜆21 −𝜆11
b. > 1.79
𝜆12 −𝜆22
𝜆21 −𝜆11
c. < 1.09
𝜆12 −𝜆22
𝜆21 −𝜆11
d. > 1.09
𝜆12 −𝜆22
Correct Answer: b
Detailed Solution:
𝜆21 − 𝜆11 𝑃(𝜔2 ⁄𝑥)

>
𝜆12 − 𝜆22 𝑃(𝜔1⁄𝑥)
Now, P(ω1/x) = P (ω1)* P (x| ω1) / P(x) = 0.3*0.65 / 0.545 = 0.358
P(ω2/x) = P (ω2)* P (x| ω2) / P(x) = 0.7*0.50 / 0.545 = 0.642
𝜆21 − 𝜆11
> 1.79
𝜆12 − 𝜆22
____________________________________________________________________________
QUESTION 6:
The Fourier transformation of a complex sequence of number 𝑠(𝑘) for 𝑘 = 0, … , 𝑁 − 1 is given
by:
a. 𝑎(𝑢) = ∑𝑁−1
𝑘=0 𝑠(𝑘)𝑒
𝑗2𝜋𝑢𝑘/𝑁
b. 𝑎(𝑢) = ∑𝑁
𝑗2𝜋𝑢𝑘/𝑁
c. 𝑎(𝑢) = ∑𝑁−1
−𝑗2𝜋𝑢𝑘/𝑁
𝑁⁄
2 −𝑗2𝜋𝑢𝑘/𝑁
d. 𝑎(𝑢) = ∑𝑘=− 𝑁⁄ 𝑠(𝑘)𝑒
2
Correct Answer: c
Detailed Solution:
_____________________________________________________________________________
QUESTION 7:
The gray co-occurrence matrix C of an unknown image is given in below. What is the value of
maximum probability descriptor?
1 2 2
2 1 2
2 3 2
Fig 1: C
a. 3/17
b. 1/12
c. 3/16
d. 5/16
Correct Answer: a
Detailed Solution:
Maximum probability = max (cij). cij is normalized co-occurrence matrix. Total values in C
is 17.
______________________________________________________________________________
QUESTION 8:
Which of the following is not a boundary descriptor.
a. Polygonal Representation
b. Fourier descriptor
c. Signature
d. Histogram.
Correct Answer: d
Detailed Solution:
Histogram is a region descriptor.
______________________________________________________________________________
QUESTION 9:
We use gray co-occurrence matrix to extract which type of information?

a. Boundary
b. Texture
c. MFCC
d. Zero Crossing rate.

Correct Answer: b
Detailed Explanation: We use different feature from the gray co-occurrence matrix to
determine the textural content of an image region.
______________________________________________________________________________
QUESTION 10:
If the larger values of gray co-occurrence matrix are concentrated around the main diagonal,
then which one of the following will be true?
e. The value of element difference moment will be low.
f. The value of inverse element difference moment will be low.
g. The value of entropy will be very low.
h. None of the above.
Correct Answer: a
Detailed Solution:
Options are self-explanatory. We can’t comment anything on the entropy based on the
values of diagonal elements. Because it depends on the randomness of the value. Whereas
element difference moment will be low and inverse element difference moment will be high.
______________________________________________________________________________
************END***********
Deep Learning
Assignment- Week 2
______________________________________________________________________________
QUESTION 1:
Suppose if you are solving an n-class problem, how many discriminant function you will need
for solving?
a. n-1
b. n
c. n+1
d. n-2
Correct Answer: b
Detailed Solution: For n class problem we need n number of discriminant function.
______________________________________________________________________________
QUESTION 2:
If we choose the discriminant function 𝑔𝑖 (𝑥) as a function of posterior probability. i.e. 𝑔𝑖 (𝑥) =
𝑓(𝑝(𝑤𝑖 ⁄𝑥)). Then which of following cannot be the function 𝑓( )?
a. f(x) = a𝑥 , 𝑤ℎ𝑒𝑟𝑒 𝑎 > 1

b. f(x) = a−𝑥 , 𝑤ℎ𝑒𝑟𝑒 𝑎 > 1
c. f(x) = 2x + 3
d. 𝑓(𝑥) = exp(𝑥)
Correct Answer: b
Detailed Solution:
The function f () should be a monotonic increasing function.
______________________________________________________________________________
QUESTION 3:
What will be the nature of decision surface when the covariance matrices of different classes are
identical but otherwise arbitrary? (Given all the classes has equal class probabilities)
a. Always orthogonal to two surfaces

b. Generally not orthogonal to two surfaces
c. Bisector of the line joining two mean, but not always orthogonal to two surface.
d. Arbitrary
Correct Answer: c
Detailed Solution:
_____________________________________________________________________________
QUESTION 4:
The mean and variance of all the samples of two different normally distributed class ω1 and ω2
are given
3 1⁄ 0 3 2 0
𝜇1 = [ ] ; Σ1 = [ 2 ] 𝑎𝑛𝑑 𝜇2 = [ ] ; Σ2 = [ ]
6 0 2 −2 0 2
What will be the value expression of decision boundary between these two classes if both the
𝑥1
class has equal class probability 0.5? For the input sample 𝑥 = [𝑥 ] consider 𝑔𝑖 (𝑥) = 𝑥 𝑡 −
2
1 1 1
Σ −1 𝑥 + Σ𝑖−1 𝜇𝑖 𝑥 − 2 𝜇𝑖𝑡 Σ𝑖−1 𝜇𝑖 − 2 ln|Σ𝑖 | + ln|𝑃(𝜔𝑖 )|
2 𝑖
a. 𝑥2 = 3.514 − 1.12𝑥1 + 0.187𝑥12

b. 𝑥1 = 3.514 − 1.12𝑥2 + 0.187𝑥22
c. 𝑥1 = 0.514 − 1.12𝑥2 + 0.187𝑥22
d. 𝑥2 = 0.514 − 1.12𝑥2 + 0.187𝑥22
Correct Answer: a
Detailed Solution:
This is the most general case of discriminant function for normal density. The inverse
matrices are
2 0 1/2 0
Σ1 −1 = [ ] , 𝑎𝑛𝑑 Σ2 −1 = [ ]
0 1/2 0 1/2
Setting 𝑔1 (𝑥) = 𝑔2 (𝑥) we get the decision boundary as 𝑥2 = 3.514 − 1.12𝑥1 + 0.187𝑥12
QUESTION 5:
For a two class problem, the linear discriminant function is given by g(x) =aty. What is the
updating rule for finding the weight vector a. Here y is augmented feature vector.
a. Adding the sum of all augmented feature vector which are misclassified multiplied by the
learning rate to the current weigh vector.
b. Subtracting the sum of all augmented feature vector which are misclassified multiplied by
the learning rate from the current weigh vector.
c. Adding the sum of the all augmented feature vector belonging to the positive class
multiplied by the learning rate to the current weigh vector.
d. Subtracting the sum of all augmented feature vector belonging to the negative class
multiplied by the learning rate from the current weigh vector.
Correct Answer: a
Detailed Solution:
𝑎(𝑘 + 1) = 𝑎(𝑘) + 𝜂 ∑ 𝑦
For derivation refer to video lectures.
____________________________________________________________________________
QUESTION 6:
For minimum distance classifier which of the following must be satisfied?
a. All the classes should have identical covariance matrix and diagonal matrix.
b. All the classes should have identical covariance matrix but otherwise arbitrary.
c. All the classes should have equal class probability.
d. None of above.
Correct Answer: c
Detailed Solution: Options are self-explanatory.

QUESTION 7:
Which of the following is the updating rule of gradient descent algorithm? Here ∇ is gradient
operator and 𝜂 is learning rate.
a. 𝑎𝑛+1 = 𝑎𝑛 − 𝜂∇𝐹(𝑎𝑛 )
b. 𝑎𝑛+1 = 𝑎𝑛 + 𝜂∇𝐹(𝑎𝑛 )
c. 𝑎𝑛+1 = 𝑎𝑛 − 𝜂∇𝐹(𝑎𝑛−1 )
d. 𝑎𝑛+1 = 𝑎𝑛 + 𝜂∇𝐹(𝑎𝑛−1 )
Correct Answer: a
Detailed Solution:
Gradient descent is an optimization algorithm used to minimize some function by

iteratively moving in the direction of steepest descent as defined by the negative of
the gradient.
______________________________________________________________________________
QUESTION 8:
The decision surface between two normally distributed class ω1 and ω2 is shown on the figure.
Can you comment which of the following is true?
a. 𝑝(𝜔1 ) = 𝑝(𝜔2 )
b. 𝑝(𝜔2 ) > 𝑝(𝜔1 )
c. 𝑝(𝜔1 ) > 𝑝(𝜔2 )
Correct Answer: c
Detailed Solution:
If the prior probabilities are not equal, the optimal boundary hyperplane is shifted away
from the more likely mean.
______________________________________________________________________________
QUESTION 9:
In k-nearest neighbour’s algorithm (k-NN), how we classify an unknown object?
a. Assigning the label which is most frequent among the k nearest training samples.
b. Assigning the unknown object to the class of its nearest neighbour among training
sample.
c. Assigning the label which is most frequent among the all training samples except
the k farthest neighbor.
d. None of this.
Correct Answer: a
Detailed Solution:
QUESTION 10:
What is the direction of weight vector w.r.t. decision surface for linear classifier?
a. Parallel
b. Normal
c. At an inclination of 45
d. Arbitrary
Correct Answer: b
Detailed Solution:
************END*******
Deep Learning
Assignment- Week 3
______________________________________________________________________________
QUESTION 1:
Find the distance of the 3D point, 𝑃 = (−3, 1, 3) from the plane defined by
2𝑥 + 2𝑦 + 5𝑧 + 9 = 0?
a. 3.1
b. 4.6
c. 0
d. ∞ (infinity)
Correct Answer: b
Detailed Solution:
−𝟑∗𝟐 + 𝟏∗𝟐 + 𝟑∗𝟓 + 𝟗

Distance = = 𝟒. 𝟔
√−𝟑∗−𝟑 + 𝟏∗𝟏 + 𝟑∗𝟑
______________________________________________________________________________
QUESTION 2:
What is the shape of the loss landscape during optimization of SVM?
a. Linear
b. Paraboloid
c. Ellipsoidal
d. Non-convex with multiple possible local minimum
Correct Answer: b
Detailed Solution:
In SVM the objective to find the maximum margin based hyperplane (W) such that
WTx + b =1 for class = +1 else WTx + b = -1
For the max-margin condition to be satisfied we solve to minimize ||W||.

The above optimization is a quadratic optimization with a paraboloid landscape for the loss
function.
______________________________________________________________________________
QUESTION 3:
How many local minimum can be encountered while solving the optimization for maximizing
margin for SVM?
a. 1
b. 2
c. ∞ (infinite)
d. 0
Correct Answer: a
Detailed Solution:
In SVM the objective to find the maximum margin-based hyperplane (W) such that
WTx + b =1 for class = +1 else WTx + b = -1
For the max-margin condition to be satisfied we solve to minimize ||W||.
The above optimization is a quadratic optimization with a paraboloid landscape for the loss
function. Since the shape is paraboloid, there can be only 1 global minimum.
______________________________________________________________________________
QUESTION 4:
Which of the following classifiers can be replaced by a linear SVM?
a. Logistic Regression
b. Neural Networks
c. Decision Trees
Correct Answer: a
Detailed Solution:
Logistic regression framework belongs to the genre of linear classifier which means the
decision boundary can segregate classes only if they are linearly separable. SVM is also
capable of doing so and thus can be used instead of logistic regression classifiers. Neural
networks and decision trees are capable of modeling non-linear decision boundaries which
linear SVM cannot model directly.
______________________________________________________________________________
QUESTION 5:
Find the scalar projection of vector b = <-2, 3> onto vector a = <1, 2>?
a. 0
4
b.
√5
2
c.
√17
−2
d. 17
Correct Answer: b
Detailed Solution:
𝒃∙𝒂
Scalar projection of b onto vector a is given by the scalar value |𝒂|
____________________________________________________________________________
QUESTION 6:
For a 2-class problem what is the minimum possible number of support vectors. Assume there
are more than 4 examples from each class?
a. 4
b. 1
c. 2
d. 8
Correct Answer: c
Detailed Solution:
To determine the separating hyper-plane, we need at least 1 example (which becomes a

support vector) from each of the classes.
____________________________________________________________________________
QUESTION 7:
Which one of the following is a valid representation of hinge loss (of margin = 1) for a two-class
problem?
y = class label (+1 or -1).
p = predicted (not normalized to denote any probability) value for a class.?
a. L(y, p) = max(0, 1- yp)

b. L(y, p) = min(0, 1- yp)
c. L(y, p) = max(0, 1 + yp)
Correct Answer: a
Detailed Solution:
Hinge loss is meant to yield a value of 0 if the predicted output (p) has the same sign as that
of the class label and satisfies the margin condition, |p| > 1. If the signs differ, the loss is
meant to increase linearly as a function of p. Option (a) satisfies the above criteria.
______________________________________________________________________________
QUESTION 8:
Suppose we have one feature x ∈ R and binary class y. The dataset consists of 3 points: p1: (x1,
y1) = (−1, −1), p2: (x2, y2) = (1, 1), p3: (x3, y3) = (3, 1). Which of the following true with respect
to SVM?
a. Maximum margin will increase if we remove the point p2 from the training
set.
b. Maximum margin will increase if we remove the point p3 from the training
set.
c. Maximum margin will remain same if we remove the point p2 from the
training set.
Correct Answer: a
Detailed Solution:
Here the point p2 is a support vector, if we remove the point p2 then maximum margin will
increase.
____________________________________________________________________________
Question 9:
If we employ SVM to realize two input logic gates, then which of the following will be true?
a. The weight vector for AND gate and OR gate will be same.
b. The margin for AND gate and OR gate will be same.
c. Both the margin and weight vector will be same for AND gate and OR
gate.
d. None of the weight vector and margin will be same for AND gate and
OR gate.
Correct Answer: b
Detailed Solution:
As we can see although the weight vectors are not same but the margin is same.
______________________________________________________________________________
QUESTION 10:
What will happen to the margin length of a max-margin linear SVM if one of non-support vector
training example is removed??
a. Margin will be scaled down by the magnitude of that vector

b. Margin will be scaled up by the magnitude of that vector
c. Margin will be unaltered
d. Cannot be determined from the information provided
Correct Answer: c
Detailed Solution:
In max-margin linear SVM, the separating hyper-planes are determined only by the
training examples which are support vectors. The non-support vector training examples do
not influence the geometry of the separating planes. Thus, the margin, in our case, will be
unaltered.
____________________________________________________________________________
************END*******
Deep Learning
Assignment- Week 4
______________________________________________________________________________
QUESTION 1:
A given cost function is of the form J(θ) = θ2 - θ+2? What is the weight update rule for gradient
descent optimization at step t+1? Consider, 𝛼=0.01 to be the learning rate.
a. 𝜃𝑡+1 = 𝜃𝑡 − 0.01(2𝜃 − 1)
b. 𝜃𝑡+1 = 𝜃𝑡 + 0.01(2𝜃)
c. 𝜃𝑡+1 = 𝜃𝑡 − (2𝜃 − 1)
d. 𝜃𝑡+1 = 𝜃𝑡 − 0.01(𝜃 − 1)
Correct Answer: a
Detailed Solution:
𝜕𝐽(𝜃)
= 2𝜃 − 1
𝜕𝜃
So, weight update will be
𝜃𝑡+1 = 𝜃𝑡 − 0.01(2𝜃 − 1)
______________________________________________________________________________
QUESTION 2:
Can you identify in which of the following graph gradient descent will not work correctly?
a. First figure
b. Second figure
c. First and second figure
d. Fourth figure
Correct Answer: b
Detailed Solution:
This is a classic example of saddle point problem of gradient descent. In the second graph
gradient descent may get stuck in the saddle point.
______________________________________________________________________________
QUESTION 3:
From the following two figures can you identify which one corresponds to batch gradient
descent and which one to Stochastic gradient descent?
a. Graph-A: Batch gradient descent, Graph-B: Stochastic gradient descent

b. Graph-B: Batch gradient descent, Graph-A: Stochastic gradient descent
c. Graph-A: Batch gradient descent, Graph-B: Not Stochastic gradient descent
d. Graph-A: Not batch gradient descent, Graph-B: Not Stochastic gradient descent
Correct Answer: a
Detailed Solution:
The graph of cost vs epochs is quite smooth for batch gradient descent because we are
averaging over all the gradients of training data for a single step. The average cost over the
epochs in Stochastic gradient descent fluctuates because we are using one example at a
time.
______________________________________________________________________________
QUESTION 4:
Suppose for a cost function 𝐽(𝜃) = 0.25𝜃 2 as shown in graph below, in which point do you feel
magnitude of weight update will be more? 𝜃 is plotted along horizontal axis.
a. Red point (Point 1)

b. Green point (Point 2)
c. Yellow point (Point 3)
d. Red (Point 1) and yellow (Point 3) have same magnitude of weight update
Correct Answer: a
Detailed Solution:
Weight update is directly proportional to the magnitude of the gradient of the cost
𝜕𝐽(𝜃)
function. In our case, 𝜕𝜃
= 0.5𝜃. So, the weight update will be more for higher values of 𝜃.
______________________________________________________________________________
QUESTION 5:
Which logic function can be performed using a 2-layered Neural Network?
a. AND
b. OR
c. XOR
d. All
Correct Answer: d
Detailed Solution:
A two layer neural network can be used for any type logic Gate (linear or non linear)
implementation.
____________________________________________________________________________
QUESTION 6:
Let X and Y be two features to discriminate between two classes. The values and class labels of
the features are given hereunder. The minimum number of neuron-layers required to design
the neural network classifier
X Y #Class
0 2 Class-II
1 2 Class-I
2 2 Class-I
1 3 Class-I
1 -3 Class-II
a. 1
b. 2
c. 4
d. 5
Correct Answer: a.
Detailed Solution:
Plot the feature points. They are linearly separable. Hence single layer is able to do the
classification task.
____________________________________________________________________________
QUESTION 7:
Which among the following options give the range for a logistic function?
a. -1 to 1
b. -1 to 0
c. 0 to 1
d. 0 to infinity
Correct Answer: c
Detailed Solution:
Refer to lectures, specifically the formula for logistic function.
______________________________________________________________________________
QUESTION 8:
The number of weights (including bias) to be learned by the neural network having 3 inputs and
2 classes and a hidden layer with 5 neurons is:
a. 12
b. 15
c. 25
d. 32
Correct Answer: d
Detailed Solution:
Please refer to lecture note week 4
(#input=3)+1(bias)x(#Hidden nodes=5) =(3+1)x5= 20 (#weights in 1st layer)

(#Hidden Nodes+1(bias))x(#classes=2)=(5+1)x2=12 (#weights in 2nd layer)
Hence, total weights= 20+12 =32

______________________________________________________________________________
QUESTION 9:
For a XNOR function as given in the figure below, activation function of each node is given by:
1, 𝑥 ≥ 0
𝑓(𝑥) = { . Consider 𝑋1 = 1 and𝑋2 = 0, what will be the output for the above
0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
neural network?
a. 1.5
b. 2
c. 0
d. 1
Correct Answer: c
Detailed Solution:
Output of 𝒂𝟏 : 𝒇(𝟎. 𝟓 ∗ 𝟏 + −𝟏 ∗ 𝟏 + −𝟏 ∗ 𝟎) = 𝒇(−𝟎. 𝟓) = 𝟎
Output of 𝒂𝟐 : 𝒇(−𝟏. 𝟓 ∗ 𝟏 + 𝟏 ∗ 𝟏 + 𝟏 ∗ 𝟎) = 𝒇(−𝟎. 𝟓) = 𝟎
Output of 𝒂𝟑 : 𝒇(−𝟎. 𝟓 ∗ 𝟏 + 𝟏 ∗ 𝟎 + 𝟏 ∗ 𝟎) = 𝒇(−𝟎. 𝟓) = 𝟎
So, the correct answer is c.
____________________________________________________________________________
QUESTION 10:
Which activation function is more prone to vanishing gradient problem?
a. ReLU
b. Tanh
c. sigmoid
d. Threshold
Correct Answer: b
Detailed Solution:
Please refer to the lectures of week 4.
************END*******
Deep Learning
Assignment- Week 5
_____________________________________________________________________________
QUESTION 1:
Suppose a fully-connected neural network has a single hidden layer with 30 nodes. The input is
represented by a 3D feature vector and we have a binary classification problem. Calculate the
number of parameters of the network. Consider there are NO bias nodes in the network.
a. 100
b. 120
c. 140
d. 125
Correct Answer: b
Detailed Solution:
Number of parameters = (3 * 30) + (30 * 1) = 120
--------------------------------------------------------------------------------------------------------------------
QUESTION 2:
For a binary classification setting, if the probability of belonging to class= +1 is 0.22, what is the
probability of belonging to class= -1 ?
a. 0
b. 0.22
c. 0.78
d. -0.22
Correct Answer: c
Detailed Solution:
In the binary classification setting we keep a single output node which can denote the probability
(p) of belonging to class= +1. So, probability of belonging to class= -1 is (1 - p) since the 2
classes are mutually exclusive.
______________________________________________________________________________
QUESTION 3:
Input to SoftMax activation function is [2,4,6]. What will be the output?
a. [0.11,0.78,0.11]
b. [0.016,0.117, 0.867]
c. [0.045,0.910,0.045]
d. [0.21, 0.58,0.21]
Correct Answer: b
Detailed Solution:
𝒙
𝒆 𝒋
SoftMax, 𝝈(𝒙𝒋 ) = 𝑛 for j=1,2…,n
∑𝑘=1 𝒆𝒙𝒌
𝒙
𝒆 𝒋
Therefore, 𝝈(𝟐) = 𝑛 =0.016and similarly the other values
______________________________________________________________________________
QUESTION 4:
A 3-input neuron has weights 1, 0.5, 2. The transfer function is linear, with the constant of
proportionality being equal to 2. The inputs are 2, 20, 4 respectively. The output will be:
a. 40
b. 20
c. 80
d. 10
Correct Answer: a
Detailed Solution:
In order to find out the output, we multiply the weights with their respective inputs, add the
results and then further multiply them with their transfer function.
Thus, output= 2*(1*2 + 0.5*20 + 2*4 ) = 40
______________________________________________________________________
QUESTION 5:
Which one of the following activation functions is NOT analytically differentiable for all real
values of the given input?
a. Sigmoid
b. Tanh
c. ReLU
Correct Answer: c
Detailed Solution:
ReLU(x) is not differentiable at x = 0, where x is the input to the ReLu layer.
______________________________________________________________________________
QUESTION 6:
Which function do the following perceptron realize? X1 and x2 can take only binary values. h(x)
is the activation function. ℎ(𝑥) = 1 if 𝑥 > 0, else 0.
a. NAND
b. NOR
c. AND
d. OR
Correct Answer: b
Detailed Solution:
In the above figure, when either i1 or i2 is 1, output is zero. When both i1 and i2 is 0,
output is 1, When both i1 and i2 is 1, output is 0. This is NOR logic.
______________________________________________________________________________
QUESTION 7:
In a simple MLP model with 10 neurons in the input layer, 100 neurons in the hidden layer and
1 neuron in the output layer. What is the size of the weight matrices between hidden output
layer and input hidden layer?
a. [10x1] , [100 X 2]
b. [100x1] , [ 10 X 1]
c. [100 x 10], [10 x 1]
d. [100x 1] , [10 x 100]
Correct Answer: d
Detailed Solution:
The size of weights between any layer 1 and layer 2 Is given by [nodes in layer 1 X nodes in layer
2]
______________________________________________________________________________
QUESTION 8:
Consider a fully connected neural network with input, one hidden layer, and output layer with
40, 2, 1 nodes respectively in each layer. What is the total number of learnable parameters (no
biases)?
a. 2
b. 82
c. 80
d. 40
Correct Answer: b
Detailed Solution:
Number of learnable parameters are weights and bias. Given there are no bias nodes. For
fully connected network, since each node is connected :
Thus it will be (40*2)+(2*1) =82.
QUESTION 9:
You want to build a 10-class neural network classifier, given a cat image, you want to classify
which of the 10 cat breeds it belongs to. Which among the 4 options would be an appropriate
loss function to use for this task?
a. Cross Entropy Loss

b. MSE Loss
c. SSIM Loss
Correct Answer: a
Detailed Solution:
Out of the given options, Cross Entropy Loss is well suited for classification problems which is
the end task given in the question.
______________________________________________________________________________
QUESTION 10:
You’d like to train a fully-connected neural network with 5 hidden layers, each with 10 hidden
units. The input is 20-dimensional and the output is a scalar. What is the total number of
trainable parameters in your network? There is no bias.
a. (20+1)*10 + (10+1)*10*4 + (10+1)*1
b. (20)*10 + (10)*10*4 + (10)*1
c. (20)*10 + (10)*10*5 + (10)*1
d. (20+1)*10 + (10+1)*10*5 + (10+1)*1
Correct Answer: b
Detailed Solution:
Option (b) explains the answer.
______________________________________________________________________________
************END*******
Deep Learning
Assignment- Week 6
______________________________________________________________________________
QUESTION 1:
Suppose a neural network has 3 input nodes, a, b, c. There are 2 neurons, X and Y. X = a+ b and
Y = X * c. What is the gradient of Y with respect to a, b and c? Assume, (a, b, c) = (6, -1, -4).
a. (5, -4, -4)
b. (4, 4, -5)
c. (-4, -4, 5)
d. (3, 3, 4)
Correct Answer: c
Detailed Solution:
𝝏𝒀
𝒀 = 𝑿. 𝒄, =𝑿=𝒂+𝒃=𝟓
𝝏𝒄
𝝏𝒀 𝝏𝒀
𝒀 = 𝑿. 𝒄 = (𝒂 + 𝒃). 𝒄, = 𝒄 = −𝟒, = 𝒄 = −𝟒
𝝏𝒂 𝝏𝒃
______________________________________________________________________________
QUESTION 2:
𝑑𝑦 𝑑𝑦
𝑦 = max(𝑎, 𝑏) and 𝑎 > 𝑏. What is the value of 𝑑𝑎 and 𝑑𝑏 ?
a. 1, 0
b. 0, 1
c. 0, 0
d. 1, 1
Correct Answer: a
Detailed Solution:
𝑦 = max(𝑎, 𝑏) and 𝑎 > 𝑏.

𝑑𝑦 𝑑𝑦
Now 𝑦 = 𝑎. So 𝑑𝑎 = 1 and 𝑑𝑏 = 0
______________________________________________________________________________
QUESTION 3:
PCA reduces the dimension by finding a few________.
a. Hexagonal linear combination

b. Orthogonal linear combinations
c. Octagonal linear combination
d. Pentagonal Linear Combination
Correct Answer: b
Detailed Solution:
Direct from classroom lecture
______________________________________________________________________________
QUESTION 4:
Consider the four sample points below, 𝑋𝑖 ∈ ℝ2 .
We want to represent the data in 1D using PCA. Compute the unit-length principal component
directions of X, and then choose from the options below which one the PCA algorithm would
choose if you request just one principal component.
a. [1/√2 1/√2]𝑇
b. [1/√2 −1/√2]𝑇
c. [−1/√2 1/√2]𝑇
d. [1/√2 1/√2]𝑇
Correct Answer: d
Detailed Solution:
Centering X,
The above matrix is 𝑿𝒄 . Now,
𝟏𝟎 𝟔
𝟏
𝑿𝑻𝒄 𝑿𝒄 = 𝟒 [ ].
𝟔 𝟏𝟎
Now eigen vector with eigen value 16 is [1/√2 1/√2]𝑇
Now eigen vector with eigen value 4 is [1/√2 −1/√2]𝑇
QUESTION 5:
Which of the following is FALSE about PCA and Autoencoders?
a. Both PCA and Autoencoders can be used for dimensionality reduction

b. PCA works well with non-linear data but Autoencoders are best suited for linear
data
c. Output of both PCA and Autoencoders is lossy
Correct Answer: b
Detailed Solution:
Options are self-explanatory
____________________________________________________________________________
QUESTION 6:
What is true regarding backpropagation rule?
a. It is a feedback neural network
b. Gradient of the final layer of weights being calculated first and the gradient of the first
layer of weights being calculated last
c. Hidden layers is not important, only meant for supporting input and output layer
d. None of the mentioned
Correct Answer: b
Detailed Solution:
Option is self explanatory
_____________________________________________________________________________
QUESTION 7:
Which of the following is true for PCA? Tick all the options that are correct.
a. Rotates the axes to lie along the principal components

b. Is calculated from the covariance matrix
c. Removes some information from the data
d. Eigenvectors describe the length of the principal components
Correct Answer: a,b,c
Detailed Solution:
See the definition
_________________________________________________________________________
QUESTION 8:
A single hidden and no-bias autoencoder has 100 input neurons and 10 hidden neurons. What
will be the number of parameters associated with this autoencoder?
a. 1000
b. 2000
c. 2110
d. 1010
Correct Answer: b
Detailed Solution:
As single hidden layer and no-bias autoencoder,
Input neurons = 100, Hidden neurons = 10. So Output neurons = 100
Total number of parameters = 100*10+10*100=2000
______________________________________________________________________________
QUESTION 9:
Which of the following two vectors can form the first two principal components?
a. {2; 3; 1} and {3; 1; −9}

b. {2; 4; 1} and {−2; 1; −8}
c. {2; 3; 1} and {−3; 1; −9}
d. {2; 3; −1} and {3; 1; −9}
Correct Answer: a
Detailed Solution:
Only in option (a), the vectors are othogonal
____________________________________________________________________________
QUESTION 10:
Lets say vectors 𝑎⃗ = {2; 4} and 𝑏⃗⃗ = {𝑛; 1} forms the first two principle components after
applying PCA. Under such circumstances, which among the following can be a possible value
of n?
a. 2
b. -2
c. 0
d. 1
Correct Answer: b
Detailed Solution:
Only option (b) makes the two vectors orthogonal.

______________________________________________________________________________
************END*******
Deep Learning
Assignment- Week 7
______________________________________________________________________________
QUESTION 1:
Select the correct option about Sparse Autoencoder?
Statement 1: Sparse autoencoders introduces information bottleneck by reducing the number

of nodes at hidden layers
Statement 2: The idea is to encourage network to learn an encoding and decoding which only
relies on activating a small number of neurons
a. Both the statements are true

b. Statement 1 is true, but Statement 2 is false
c. Statement 1 is false, but statement 2 is true
d. Both the statements are false
Correct Answer: c
Detailed Solution:
Sparse autoencoders introduces an information bottleneck without requiring a reduction in

the number of nodes at hidden layers. It encourages network to learn an encoding and
decoding which only relies on activating a small number of neurons.
______________________________________________________________________________
QUESTION 2:
Select the correct option about Denoising autoencoders?
Statement A: The loss is between the original input and the reconstruction from a noisy version
of the input
Statement B: Denoising autoencoders can be used as a tool for feature extraction.
a. Both the statements are false

b. Statement A is false but Statement B is true
c. Statement A is true but Statement B is false

d. Both the statements are true
Correct Answer: d
Detailed Solution:
For denoising autoencoder, both statement 1 and 2 are true. Thus option (d) is correct
______________________________________________________________________________
QUESTION 3:
Which of the following autoencoder methods uses corrupted versions of the input?
a. Overcomplete design
b. Undercomplete Design
c. Sparse Design
d. Denoising Design
Correct Answer: d
Detailed Solution:
Refer to classroom lecture.
______________________________________________________________________________
QUESTION 4:
Which of the following autoencoder methods uses a hidden layer with fewer units than the
input layer?
a. Overcomplete design
b. Undercomplete Design
c. Sparse Design
d. Denoising Design
Correct Answer: b
Detailed Solution:
Refer to classroom lecture.

QUESTION 5:
Which of the following is false about autoencoder?
a. Autoencoders possesses generalization capabilities

b. Autoencoders are best suited for image captioning task
c. Its objective is to minimize the reconstruction loss so that output is similar to
input
d. It compresses the input into a latent space representation and then reconstruct
the output from it
Correct Answer: b
Detailed Solution:
Except option (b), rest all the options are true about auroencoders
____________________________________________________________________________
QUESTION 6:
Find the value of 𝑑(𝑡 − 34) ∗ 𝑥(𝑡 + 56); 𝑑(𝑡) being the delta function and * being the
convolution operation.
a. 𝑥(𝑡 + 56)
b. 𝑥(𝑡 + 32)
c. 𝑥(𝑡 + 22)
d. 𝑥(𝑡 − 56)
Correct Answer: c
Detailed Solution:
Convolution of a function with delta shifts accordingly
_____________________________________________________________________________
QUESTION 7:
Impulse response is the output of ________________system due to impulse input applied at
time=0. Fill in the blanks from the options below.
a. Linear
b. Time Varying
c. Time Invariant
d. Linear And Time Invariant
Correct Answer: d
Detailed Solution:
Impulse response is output of LTI system due to impulse input pplied at time t=0 or n=0.
Behaviour of an LTI system is characterized by its impulse response.
_________________________________________________________________________
QUESTION 8:
Convolution of an input with the system impulse function gives the output of a___ system. Fill
in the blanks.
a. Linear Time Invariant

b. Non-linear system
c. Time Invariant system
Correct Answer: a
Detailed Solution:
______________________________________________________________________________
QUESTION 9:
Given the image below where, Row 1: Original Input, Row 2: Noisy input, Row 3: Reconstructed
output. Choose one of the following variants of autoencoder that is most suited to get Row 3
from Row 2.
a. Stacked autoencoder
b. Sparse autoencoder
c. Denoising autoencoder
Correct Answer: c
Detailed Solution:
Reconstruction of original noise-free data from noisy input is the tasks of denoising
autoencoder
____________________________________________________________________________
QUESTION 10:
Which of the following is true for Contractive Autoencoders?
a. penalizing instances where a small change in the input leads to a large change in
the encoding space
b. penalizing instances where a large change in the input leads to a small change in
the encoding space
c. penalizing instances where a small change in the input leads to a small change in
the encoding space
Correct Answer: a
Detailed Solution:
Direct from definition of Contractive autoencoders
______________________________________________________________________________
************END*******
Deep Learning
Assignment- Week 8
______________________________________________________________________________
QUESTION 1:
Which of the following is false about CNN?
a. Output should be flattened before feeding it to a fully connected lyer

b. There can be only 1 fully connected layer in CNN
c. We can use ana many convolutional layers in CNN
Correct Answer: b
Detailed Solution:

______________________________________________________________________________
QUESTION 2:
The input image has been converted into a matrix of size 64 X 64 and a kernel/filter of size 5x5
with a stride of 1 and no padding. What will be the size of the convoluted matrix?
a. 5x5
b. 59x59
c. 64x64
d. 60x60
Correct Answer: d
Detailed Solution:
The size of the convoluted matrix is given by CxC where C=((I-F+2P)/S)+1, where C is the
size of the Convoluted matrix, I is the size of the input matrix, F the size of the filter matrix
and P the padding applied to the input matrix. Here P=0, I=64, F=5 and S=1. Therefore,
the answer is 60x60.
______________________________________________________________________________
QUESTION 3:
Filter size of 3x3 is convolved with matrix of size 4x4 (stride=1). What will be the size of output
matrix if valid padding is applied:
a. 4x4
b. 3x3
c. 2x2
d. 1x1
Correct Answer: c
Detailed Solution:
This type is used when there is no requirement for Padding. The output matrix after
convolution will have the dimension of ((n – f +2P)/S+ 1) x ((n – f +2P)/S+ 1)
______________________________________________________________________________
QUESTION 4:
Let us consider a Convolutional Neural Network having three different convolutional layers in
its architecture as:
Layer-1: Filter Size – 3 X 3, Number of Filters – 10, Stride – 1, Padding – 0
Layer-2: Filter Size – 5 X 5, Number of Filters – 20, Stride – 2, Padding – 0
Layer-3: Filter Size – 5 X5 , Number of Filters – 40, Stride – 2, Padding – 0
Layer 3 of the above network is followed by a fully connected layer. If we give a 3-D
image input of dimension 39 X 39 to the network, then which of the following is the input
dimension of the fully connected layer.
a. 1960
b. 2200
c. 4563
d. 13690
Correct Answer: a
Detailed Solution:
the input image of dimension 39 X 39 X 3 convolves with 10 filters of size 3 X 3 and takes
the Stride as 1 with no padding. After these operations, we will get an output of 37 X 37 X
10.
Output of layer 2 would be 17x17x20
Ouput of layer 3 would be 7x7x40. Flattening this gives 1960.
______________________________________________________________________________
QUESTION 5:
Suppose you have 40 convolutional kernel of size 3 x 3 with no padding and stride 1 in the first
layer of a convolutional neural network. You pass an input of dimension 1024x1024x3 through
this layer. What are the dimensions of the data which the next layer will receive?
a. 1020x1020x40
b. 1022x1022x40
c. 1021x1021x40
d. 1022x1022x3
Correct Answer: b
Detailed Solution:
The layer accepts a volume of size W1×H1×D1. In our case, 1024x1024x3
Requires four hyperparameters: Number of filters K=40, their spatial extent F=3, the
stride S=1, the amount of padding P=0.
Produces a volume of size W2×H2×D2 i.e. where: W2=(W1−F+2P)/S+1 =(1024−3)/1+1

=1022, H2=(H1−F+2P)/S+1 =(1024−3)/1+1 =1022, (i.e. width and height are computed
equally by symmetry), D2= Number of filters K=40.
____________________________________________________________________________
QUESTION 6:
Consider a CNN model which aims at classifying an image as either a rose,or a marigold, or a lily
or an orchid (consider the test image can have only 1 of the classes at a time) . The last (fully-
connected) layer of the CNN outputs a vector of logits, L, that is passed through a ____
activation that transforms the logits into probabilities, P. These probabilities are the model
predictions for each of the 4 classes. Fill in the blanks with the appropriate option.
Fill in the blanks with the appropriate option.
a. Leaky ReLU
b. Tanh
c. ReLU
d. Softmax
Correct Answer: d
Detailed Solution:
Softmax works best if there is one true class per example, because it outputs a probability
vector whose entries sum to 1.
____________________________________________________________________________
QUESTION 7:
Suppose your input is a 300 by 300 color (RGB) image, and you use a convolutional layer with
100 filters that are each 5x5. How many parameters does this hidden layer have (without bias)
a. 2501
b. 2600
c. 7500
d. 7600
Correct Answer: c
Detailed Solution:
As we have a RGB Image so each filter would be 3D, whose dimension is 5 * 5 * 3 = 75
Now we have 100 such filters. Now, as there is no bias so, total number of parameters= = 5
* 5 * 3 * 100 = 7500
______________________________________________________________________________
QUESTION 8:
Which of the following activation functions can lead to vanishing gradients?
a. ReLU
b. Sigmoid
c. Leaky ReLU
Correct Answer: b
Detailed Solution:
For sigmoid activation, a large change in the input of the sigmoid function will cause a
small change in the output. Hence, the derivative becomes small. When more and more
layers uses such activation, the gradient of the loss function becomes very small making the
network difficult to train.
___________________________________________________________________________
QUESTION 9:
Statement 1: Residual networks can be a solution for vanishing gradient problem
Statement 2: Residual networks provide residual connections straight to earlier layers
Statement 3: Residual networks can never be a solution for vanishing gradient problem
Which of the following option is correct?
a. Statement 2 is correct
b. Statement 3 is correct
c. Both Statement 1 and Statement 2 are correct
d. Both Statement 2 and Statement 3 are correct
Correct Answer: c
Detailed Solution:
Residual networks can be a solution to vanishing gradient problems, as they provide

residual connections straight to earlier layers. This residual connection doesn’t go through
activation functions that “squashes” the derivatives, resulting in a higher overall derivative
of the block.
____________________________________________________________________________
QUESTION 10:
Input to SoftMax activation function is [0.5,0.5,1]. What will be the output?
a. [0.28,0.28,0.44]
b. [0.022,0.956, 0.022]
c. [0.045,0.910,0.045]
d. [0.42, 0.42,0.16]
Correct Answer: a
Detailed Solution:
𝒙
𝒆 𝒋
SoftMax, 𝝈(𝒙𝒋 ) = 𝑛 for j=1,2…,n
𝒙
𝒆 𝒋
Therefore, 𝝈(𝟎. 𝟓) = 𝑛 =0.28and similarly the other values
______________________________________________________________________
______________________________________________________________________________
************END*******
Deep Learning
Assignment- Week 9
______________________________________________________________________________
QUESTION 1:
What can be a possible consequence of choosing a very small learning rate?
a. Slow convergence
b. Overshooting minima
c. Oscillations around the minima
d. All of the above
Correct Answer: a
Detailed Solution:
Choosing a very small learning rate can lead to slower convergence and thus option (a) is
correct.
______________________________________________________________________________
QUESTION 2:
The following is the equation of update vector for momentum optimizer. Which of the
following is true for 𝛾?
𝑉𝑡 = 𝛾𝑉𝑡−1 + 𝜂∇𝜃 𝐽(𝜃)
a. 𝛾 is the momentum term which indicates acceleration
b. 𝛾 is the step size
c. 𝛾 is the first order moment
d. 𝛾 is the second order moment
Correct Answer: a
Detailed Solution:
A fraction of the update vector of the past time step is added to the current update vector. 𝛾 is
that fraction which indicates how much acceleration you want and its value lies between 0 and 1.
______________________________________________________________________________
QUESTION 3:
Which of the following is true about momentum optimizer?
a. It helps accelerating Stochastic Gradient Descent in right direction

b. It helps prevent unwanted oscillations
c. It helps to know the direction of the next step with knowledge of the previous step
d. All of the above
Correct Answer: d
Detailed Solution:
Option (a), (b) and (c) all are true for momentum optimiser. Thus, option (d) is correct.
______________________________________________________________________________
QUESTION 4:
Let 𝐽(𝜃) be the cost function. Let the gradient descent update rule for 𝜃𝑖 be,
𝜃𝑖+1 = 𝜃𝑖 + ∇𝜃𝑖
What is the correct expression of ∇𝜃𝑖 . 𝛼 is the learning rate.
𝑑𝐽(𝜃𝑖 )
a. −𝛼 𝑑 𝜃𝑖
𝑑𝐽(𝜃𝑖 )
b. 𝛼 𝑑 𝜃𝑖
𝑑𝐽(𝜃𝑖 )
c. − 𝑑𝜃
𝑖+1
𝑑𝐽(𝜃𝑖 )
d. 𝑑𝜃𝑖
Correct Answer: a
Detailed Solution:
Gradient descent update rule for 𝜃𝑖 is,
𝑑𝐽(𝜃𝑖 )
𝜃𝑖+1 = 𝜃𝑖 − 𝛼 , 𝛼 is the learning rate
𝑑 𝜃𝑖
______________________________________________________________________________
QUESTION 5:
A given cost function is of the form J(θ) =6 θ2 - 6θ+6? What is the weight update rule for
gradient descent optimization at step t+1? Consider, 𝛼 to be the learning rate.
a. 𝜃𝑡+1 = 𝜃𝑡 − 6𝛼(2𝜃 − 1)
b. 𝜃𝑡+1 = 𝜃𝑡 + 6𝛼(2𝜃)
c. 𝜃𝑡+1 = 𝜃𝑡 − 𝛼(12𝜃 − 6 + 6)
d. 𝜃𝑡+1 = 𝜃𝑡 − 6𝛼(2𝜃 + 1)
Correct Answer: a
Detailed Solution:
𝜕𝐽(𝜃)
= 12𝜃 − 6
𝜕𝜃
So, weight update will be
𝜃𝑡+1 = 𝜃𝑡 − 6𝛼(2𝜃 − 1)
______________________________________________________________________________
QUESTION 6:
If the first few iterations of gradient descent cause the function f(θ0,θ1) to increase rather than
decrease, then what could be the most likely cause for this?
a. we have set the learning rate to too large a value

b. we have set the learning rate to zero
c. we have set the learning rate to a very small value
d. learning rate is gradually decreased by a constant value after every epoch
Correct Answer: a
Detailed Solution:
If learning rate were small enough, then gradient descent should successfully take a tiny small
downhill and decrease f(θ0,θ1) at least a little bit. If gradient descent instead increases the
objective value that means learning rate is too high.
______________________________________________________________________________
QUESTION 7:
For a function f(θ0,θ1), if θ0 and θ1 are initialized at a global minimum, then what should be the
values of θ0 and θ1 after a single iteration of gradient descent?
a. θ0 and θ1 will update as per gradient descent rule

b. θ0 and θ1 will remain same
c. Depends on the values of θ0 and θ1
d. Depends on the learning rate
Correct Answer: b
Detailed Solution:
At a local minimum, the derivative (gradient) is zero, so gradient descent will not change the
parameters.
______________________________________________________________________________
QUESTION 8:
What can be one of the practical problems of exploding gradient?
a. Too large update of weight values leading to unstable network
b. Too small update of weight values inhibiting the network to learn
c. Too large update of weight values leading to faster convergence
d. Too small update of weight values leading to slower convergence
Correct Answer: a
Detailed Solution:
Exploding gradients are a problem where large error gradients accumulate and result in very
large updates to neural network model weights during training. This has the effect of your model
being unstable and unable to learn from your training data.
______________________________________________________________________________
QUESTION 9:
What are the steps for using a gradient descent algorithm?
1. Calculate error between the actual value and the predicted value
2. Update the weights and biases using gradient descent formula
3. Pass an input through the network and get values from output layer
4. Initialize weights and biases of the network with random values
5. Calculate gradient value corresponding to each weight and bias
a. 1, 2, 3, 4, 5
b. 5, 4, 3, 2, 1
c. 3, 2, 1, 5, 4
d. 4, 3, 1, 5, 2
Correct Answer: d
Detailed Solution:
Initialize random weights, and then start passing input instances and calculate error response
from output layer and back-propagate the error through each subsequent layers. Then update the
neuron weights using a learning rate and gradient of error. Please refer to the lectures of week 4.
______________________________________________________________________________
QUESTION 10:
You run gradient descent for 15 iterations with learning rate 𝜂 = 0.3 and compute error after
each iteration. You find that the value of error decreases very slowly. Based on this, which of
the following conclusions seems most plausible?
a. Rather than using the current value of a, use a larger value of 𝜂

b. Rather than using the current value of a, use a smaller value of 𝜂
c. Keep 𝜂 = 0.3
Correct Answer: a
Detailed Solution:
Error rate is decreasing very slowly. Therefore increasing the learning rate is a most plausible
solution.
______________________________________________________________________________
______________________________________________________________________________
************END*******
Deep Learning
Assignment- Week 10
______________________________________________________________________________
QUESTION 1:
What is not a reason for using batch-normalization??
a. Prevent overfitting
b. Faster convergence
c. Faster inference time
d. Prevent Co-variant shift
Correct Answer: c
Detailed Solution:
Inference time does not become faster due to batch normalization. It increases the computational
burden. So, inference time increases.
____________________________________________________________________________
QUESTION 2:
A neural network has 3 neurons in a hidden layer. Activations of the neurons for three batches
1 0 6
are[2] , [2] , [9] respectively. What will be the value of mean if we use batch normalization in
3 5 2
this layer?
2.33
a. [4.33]
3.33
2.00
b. [2.33]
5.66
1.00
c. [1.00]
1.00
0.00
d. [0.00]
0.00
Correct Answer: a
Detailed Solution:
1 1 0 6 2.33
× ([2] + [2] + [9]) = [4.33]
3
3 5 2 3.33
______________________________________________________________________________
QUESTION 3:
How can we prevent underfitting?
a. Increase the number of data samples

b. Increase the number of features
c. Decrease the number of features
d. Decrease the number of data samples
Correct Answer: b
Detailed Solution:
Underfitting happens whenever feature samples are capable enough to capture the data
distribution. We need to increase the feature size, so data can be fitted perfectly well.
______________________________________________________________________________
QUESTION 4:
How do we generally calculate mean and variance during testing?
a. Batch normalization is not required during testing

b. Mean and variance based on test image
c. Estimated mean and variance statistics during training
Correct Answer: c
Detailed Solution:
We generally calculate batch mean and variance statistics during training and use the estimated
batch mean and variance during testing.
______________________________________________________________________________
QUESTION 5:
Which one of the following is not an advantage of dropout?
a. Regularization
b. Prevent Overfitting
c. Improve Accuracy
d. Reduce computational cost during testing
Correct Answer: d
Detailed Solution:
Dropout makes some random features during training but while testing we don’t zero-down any
feature. So there is no question of reduction of computational cost.
______________________________________________________________________________
QUESTION 6:
What is the main advantage of layer normalization over batch normalization?
a. Faster convergence
b. Lesser computation
c. Useful in recurrent neural network
d. None of these
Correct Answer: c
Detailed Solution:
See the lectures/lecture materials.
______________________________________________________________________________
QUESTION 7:
While training a neural network for image recognition task, we plot the graph of training error
and validation error. Which is the best for early stopping?
a. A
b. B
c. C
d. D
Correct Answer: c
Detailed Solution:
Minimum validation point is the best for early stopping.
______________________________________________________________________________
QUESTION 8:
Which among the following is NOT a data augmentation technique?
a. Random horizontal and vertical flip of image

b. Random shuffle all the pixels of an image
c. Random color jittering
d. All the above are data augmentation techniques
Correct Answer: b
Detailed Solution:
Random shuffle of all the pixels of the image will distort the image and neural network will be
unable to learn anything. So, it is not a data augmentation technique.
______________________________________________________________________________
QUESTION 9:
Which of the following is true about model capacity (where model capacity means the ability of
neural network to approximate complex functions)?
a. As number of hidden layers increase, model capacity increases

b. As dropout ratio increases, model capacity increases
c. As learning rate increases, model capacity increases
d. None of these
Correct Answer: a
Detailed Solution:
Dropout and learning rate has nothing to do with model capacity. If hidden layers increase, it
increases the number of learnable parameter. Therefore, model capacity increases.
______________________________________________________________________________
QUESTION 10:
Batch Normalization is helpful because
a. It normalizes all the input before sending it to the next layer

b. It returns back the normalized mean and standard deviation of weights
c. It is a very efficient back-propagation technique
d. None of these
Correct Answer: a
Detailed Solution:
Batch normalization layer normalizes the input.
______________________________________________________________________________
______________________________________________________________________________
************END*******

DL Assignment Solution 00 To 10

Uploaded by

Copyright:

Available Formats

DL Assignment Solution 00 To 10

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DL Assignment Solution 00 To 10

Uploaded by

Copyright:

Available Formats

NPTEL Online Certification Courses

Indian Institute of Technology Kharagpur

Probability of first ball being black = 𝟓/(𝟓 + 𝟕) = 𝟓/𝟏𝟐.

Probability of drawing second ball black is = 𝟒/(𝟒 + 𝟕) = 𝟒 /𝟏𝟏.

Number of possible outcomes = (𝟔 × 𝟔) = 𝟑𝟔.

So, probability = 𝟐/𝟑𝟔 = 𝟏/𝟏𝟖

Total Population = (𝟐𝟕𝟓 + 𝟐𝟗𝟏 + 𝟏𝟎𝟓 + 𝟏𝟐𝟑 + 𝟏𝟑𝟏 + 𝟏𝟓𝟎 + 𝟏𝟏𝟎 + 𝟗𝟎 + 𝟔𝟎 + 𝟒𝟗 +

So, median of the distribution is in the range of 𝟑𝟎 − 𝟒𝟎.

𝑿 = [𝒙𝟏 𝒙𝟐 𝒙𝟑 ]. 𝒙𝟏 , 𝒙𝟐 , 𝒙𝟑 are linearly independent if 𝒅𝒆𝒕𝒆𝒓𝒎𝒊𝒏𝒂𝒏𝒕(𝑿)/𝒅𝒆𝒕(𝑿) = 𝟎

What are the values of 𝑥, 𝑦, 𝑧?

Options are self-explanatory.

𝜆21 − 𝜆11 𝑃(𝜔2 ⁄𝑥)

P(ω2/x) = P (ω2)* P (x| ω2) / P(x) = 0.7*0.50 / 0.545 = 0.642

Options are self-explanatory.

Histogram is a region descriptor.

We use gray co-occurrence matrix to extract which type of information?

d. Zero Crossing rate.

e. The value of element difference moment will be low.

f. The value of inverse element difference moment will be low.

g. The value of entropy will be very low.

h. None of the above.

Detailed Solution: For n class problem we need n number of discriminant function.

a. f(x) = a𝑥 , 𝑤ℎ𝑒𝑟𝑒 𝑎 > 1

The function f () should be a monotonic increasing function.

a. Always orthogonal to two surfaces

Options are self-explanatory.

a. 𝑥2 = 3.514 − 1.12𝑥1 + 0.187𝑥12

For derivation refer to video lectures.

For minimum distance classifier which of the following must be satisfied?

Detailed Solution: Options are self-explanatory.

Gradient descent is an optimization algorithm used to minimize some function by

b. 𝑝(𝜔2 ) > 𝑝(𝜔1 )

c. 𝑝(𝜔1 ) > 𝑝(𝜔2 )

d. None of the above.

In k-nearest neighbour’s algorithm (k-NN), how we classify an unknown object?

Options are self-explanatory.

Options are self-explanatory.

−𝟑∗𝟐 + 𝟏∗𝟐 + 𝟑∗𝟓 + 𝟗

WTx + b =1 for class = +1 else WTx + b = -1

For the max-margin condition to be satisfied we solve to minimize ||W||.

WTx + b =1 for class = +1 else WTx + b = -1

For the max-margin condition to be satisfied we solve to minimize ||W||.

To determine the separating hyper-plane, we need at least 1 example (which becomes a

y = class label (+1 or -1).

p = predicted (not normalized to denote any probability) value for a class.?

a. L(y, p) = max(0, 1- yp)

a. Margin will be scaled down by the magnitude of that vector

a. Graph-A: Batch gradient descent, Graph-B: Stochastic gradient descent

a. Red point (Point 1)

Refer to lectures, specifically the formula for logistic function.

Please refer to lecture note week 4

(#input=3)+1(bias)x(#Hidden nodes=5) =(3+1)x5= 20 (#weights in 1st layer)

Hence, total weights= 20+12 =32

Output of 𝒂𝟏 : 𝒇(𝟎. 𝟓 ∗ 𝟏 + −𝟏 ∗ 𝟏 + −𝟏 ∗ 𝟎) = 𝒇(−𝟎. 𝟓) = 𝟎

Output of 𝒂𝟐 : 𝒇(−𝟏. 𝟓 ∗ 𝟏 + 𝟏 ∗ 𝟏 + 𝟏 ∗ 𝟎) = 𝒇(−𝟎. 𝟓) = 𝟎

Output of 𝒂𝟑 : 𝒇(−𝟎. 𝟓 ∗ 𝟏 + 𝟏 ∗ 𝟎 + 𝟏 ∗ 𝟎) = 𝒇(−𝟎. 𝟓) = 𝟎

So, the correct answer is c.

Please refer to the lectures of week 4.

Number of parameters = (3 * 30) + (30 * 1) = 120

Thus, output= 2(12 + 0.520 + 24 ) = 40

Total number of parameters = 10010+10100=2000