Data Science Interview Questions 30 Days 1686062665

30 Days of Interview
Preparation
Q1. What is the difference between AI, Data Science, ML, and DL?
Ans 1 :
Artificial Intelligence
: AI is purely math and scientific exercise, but when it became computational, it
started to solve human problems formalized into a subset of computer science. Artificial intelligence
has changed the original computational statistics paradigm to the modern idea that machines could
mimic actual human capabilities, such as decision making and performing more “human” tasks.
Modern AI into two categories
1. General AI - Planning, decision making, identifying objects, recognizing sounds, social &
business transactions
2. Applied AI - driverless/ Autonomous car or machine smartly trade stocks
Machine Learning: Instead of engineers “teaching” or programming computers to have what they
need
to carry out tasks, that perhaps computers could teach themselves – learn something without being
explicitly programmed to do so. ML is a form of AI where based on more data, and they can change
actions and response, which will make more efficient, adaptable and scalable. e.g., navigation apps
and
recommendation engines. Classified into:-
1. Supervised
2. Unsupervised
3. Reinforcement learning
Data Science: Data science has many tools, techniques, and algorithms called from these fields, plus
others –to handle big data
The goal of data science, somewhat similar to machine learning, is to make accurate predictions and
to automate and perform transactions in real-time, such as purchasing internet traffic or automatically
generating content.
Page 2
Data science relies less on math and coding and more on data and building new systems to process the
data. Relying on the fields of data integration, distributed architecture, automated machine learning, data
visualization, data engineering, and automated data-driven decisions, data science can cover an entire
spectrum of data processing, not only the algorithms or statistics related to data.
Deep Learning: It is a technique for implementing ML.

ML provides the desired output from a given input, but DL reads the input and applies it to another data. In
ML, we can easily classify the flower based upon the features. Suppose you want a machine to look at an
image and determine what it represents to the human eye, whether a face, flower, landscape, truck,
building, etc.
Machine learning is not sufficient for this task because machine learning can only produce an output from
a data set – whether according to a known algorithm or based on the inherent structure of the data. You
might be able to use machine learning to determine whether an image was of an “X” – a flower, say – and
it would learn and get more accurate. But that output is binary (yes/no) and is dependent on the
algorithm, not the data. In the image recognition case, the outcome is not binary and not dependent on
the algorithm.
The neural network performs MICRO calculations with computational on many layers. Neural networks
also support weighting data for ‘confidence. These results in a probabilistic system, vs. deterministic, and
can handle tasks that we think of as requiring more ‘human-like’ judgment.
Q2. What is the difference between Supervised learning, Unsupervised learning and
Reinforcement learning?
Ans 2 :
Machine Learning
Machine learning is the scientific study of algorithms and statistical models that computer systems use to
effectively perform a specific task without using explicit instructions, relying on patterns and inference
instead.
Building a model by learning the patterns of historical data with some relationship between data to make
a data-driven prediction.
Types of Machine Learning

• Supervised Learning
• Unsupervised Learning
• Reinforcement Learning
Supervised learning
In a supervised learning model, the algorithm learns on a labeled dataset, to generate reasonable
predictions for the response to new data. (Forecasting outcome of new data)
• Regression
• Classification
Page 3
Unsupervised learning
An unsupervised model, in contrast, provides unlabelled data that the algorithm tries to make sense of by
extracting features, co-occurrence and underlying patterns on its own. We use unsupervised learning for
• Clustering
• Anomaly detection
• Association
• Autoencoders
Reinforcement Learning
Reinforcement learning is less supervised and depends on the learning agent in determining the output
solutions by arriving at different possible ways to achieve the best possible solution.
Q3. Describe the general architecture of Machine learning.
Business understanding: Understand the give use case, and also, it's good to know more about the
domain for which the use cases are built.
Data Acquisition and Understanding: Data gathering from different sources and understanding the
data. Cleaning the data, handling the missing data if any, data wrangling, and EDA( Exploratory data
analysis).
Page 4
Modeling: Feature Engineering - scaling the data, feature selection - not all features are important. We
use the backward elimination method, correlation factors, PCA and domain knowledge to select the
features.
Model Training based on trial and error method or by experience, we select the algorithm and train with
the selected features.
Model evaluation Accuracy of the model , confusion matrix and cross-validation.
If accuracy is not high, to achieve higher accuracy, we tune the model...either by changing the algorithm
used or by feature selection or by gathering more data, etc.
Deployment - Once the model has good accuracy, we deploy the model either in the cloud or Rasberry
py or any other place. Once we deploy, we monitor the performance of the model.if its good...we go live
with the model or reiterate the all process until our model performance is good.
It's not done yet!!!
What if, after a few days, our model performs badly because of new data. In that case, we do all the
process again by collecting new data and redeploy the model.
Q4. What is Linear Regression?
Ans 4:
Linear Regression tends to establish a relationship between a dependent variable(Y) and one or more
independent variable(X) by finding the best fit of the straight line.
The equation for the Linear model is Y = mX+c, where m is the slope and c is the intercept
In the above diagram, the blue dots we see are the distribution of 'y' w.r.t 'x.' There is no straight line that
runs through all the data points. So, the objective here is to fit the best fit of a straight line that will try to
minimize the error between the expected and actual value.
Page 5
Q5. OLS Stats Model (Ordinary Least Square)
Ans 5:
OLS is a stats model, which will help us in identifying the more significant features that can has an
influence on the output. OLS model in python is executed as:
lm = smf.ols(formula = 'Sales ~ am+constant', data = data).fit() lm.conf_int() lm.summary()
And we get the output as below,
The higher the t-value for the feature, the more significant the feature is to the output variable. And
also, the p-value plays a rule in rejecting the Null hypothesis(Null hypothesis stating the features has zero
significance on the target variable.). If the p-value is less than 0.05(95% confidence interval) for a
feature, then we can consider the feature to be significant.
Q6. What is L1 Regularization (L1 = lasso) ?
Ans 6:
The main objective of creating a model(training data) is making sure it fits the data properly and reduce
the loss. Sometimes the model that is trained which will fit the data but it may fail and give a poor
performance during analyzing of data (test data). This leads to overfitting. Regularization came to
overcome overfitting.
Lasso Regression (Least Absolute Shrinkage and Selection Operator) adds “Absolute value of
magnitude” of coefficient, as penalty term to the loss function.
Page 6
Lasso shrinks the less important feature’s coefficient to zero; thus, removing some feature altogether. So,
this works well for feature selection in case we have a huge number of features.
Methods like Cross-validation, Stepwise Regression are there to handle overfitting and perform feature
selection work well with a small set of features. These techniques are good when we are dealing with a
large set of features.
Along with shrinking coefficients, the lasso performs feature selection, as well. (Remember the
‘selection‘ in the lasso full-form?) Because some of the coefficients become exactly zero, which is
equivalent to the particular feature being excluded from the model.
.Q7L2 Regularization(L2 =Ridge Regression)
Ans 7:
Overfitting happens when the model learns signal as well as noise in the training data and wouldn’t
perform well on new/unseen data on which model wasn’t trained on.
To avoid overfitting your model on training data like , cross-validation samplingreducing the
, f, feetca.tures, pruningregularization
o number
.So to avoid overfitting, we perform Regularization
Page 7
The Regression model that uses L2 regularization is called Ridge Regression. The
formula for Ridge Regression:-
Regularization adds the penalty as model complexity increases. The regularization parameter
(lambda) penalizes all the parameters except intercept so that the model generalizes the data and
won’t overfit.
Ridge regression adds “squared magnitude of the coefficient" as penalty term to the loss
function. Here the box part in the above image represents the L2 regularization element/term.
Lambda is a hyperparameter.
Page 8
Iiffllaammbbddaaissvzeerryo,lathrgeen,itthiesneiqtuwivilalaedndtttooOoLmSu.cBhut
.weight, and it will lead to under-fitting
Ridge regularization and forces the weights to be small but does not make them zerodoes not give
.the sparse solution
nRoidtgreobisusatstosqouuatrlieertesrmsblowuptheerrordiferencesoftheoutliers,andtheregularizationtermtriestofixitbypenalizing
the weights
Rwiediggehtrsegressionperformsbetterwhenalltheinputfeaturesinfluencetheoutput,andallwith
are of roughly equal size.
L2 regularization can learn complex data patterns.
Q8. What is R square(where to use and where not)?
Ans 8.
R-squared is a statistical measure of how close the data are to the fitted regression line. It is also
known as the coefficient of determination, or the coefficient of multiple determination for
multiple regression.
The definition of R-squared is the percentage of the response variable variation that is explained by a
linear model.
R-squared = Explained variation / Total variation
R-squared is always between 0 and 100%.
0% indicates that the model explains none of the variability of the response data around its mean.
100% indicates that the model explains all the variability of the response data around its mean.
In general, the higher the R-squared, the better the model fits your data.
Page 9
There is a problem with the R-Square. The problem arises when we ask this question to ourselves.** Is it
good to help as many independent variables as possible?**
The answer is No because we understood that each independent variable should have a meaningful
impact. But, even** if we add independent variables which are not meaningful**, will it improve R-
Square value?
Yes, this is the basic problem with R-Square. How many junk independent variables or important
independent variable or impactful independent variable you add to your model, the R-Squared value will
always increase. It will never decrease with the addition of a newly independent variable, whether it
could be an impactful, non-impactful, or bad variable, so we need another way to measure
equivalent R- Square, which penalizes our model with any junk independent variable.
So, we calculate the Adjusted R-Square with a better adjustment in the formula of generic R-square.
Q9. What is Mean Square Error?
The mean squared error tells you how close a regression line is to a set of points. It does this by taking
the distances from the points to the regression line (these distances are the “errors”) and squaring
them.
Giving an intuition
The line equationy=Mx+B. We want to findM (slope)andB (y-intercept)that minimizes the

is squared error.
Page 10
Q10. Why Support Vector Regression? Difference between SVR and a simple regression
model?
Ans 10:
In simple linear regression, try to minimize the error rate. But in SVR, we try to fit the error within
a certain threshold.
Main Concepts:-
1. Boundary
2. Kernel
3. Support Vector
4. Hyper Plane
Blueline: Hyper Plane; Red Line: Boundary-Line
Page 11
Our best fit line is the one where the hyperplane has the maximum number of points.
We are trying to do here is trying to decide a decision boundary at ‘e’ distance from the
original hyperplane such that data points closest to the hyperplane or the support
vectors are within that boundary line
Page 12
DATA SCIENCE
INTERVIEW
PREPARATION
Preparation)
(30 Days of Interview
#DAY 02
Page 1 | 22
Q1. What is Logistic Regression?
Answer:
Thelogistic regressiontechniqueinvolves thedependentvariable,whichcanberepresentedin the
binary(0 or1, trueorfalse, yesorno)values,which meansthattheoutcomecould only bein either
oneformof two. For example, it can beutilized whenweneedtofind theprobability of a
successful orfail event.
Logistic Regressionis used whenthedependent variable (target) is categorical.
Model
Output = 0or 1
Z = WX + B
hΘ (x) = sigmoid(Z)
hΘ (x) = log(P(X) /1- P(X) ) = WX +B
If ‘Z’ goestoinfinity, Y(predicted) will become1,andif ‘Z’ goestonegativeinfinity, Y(predicted)

will become0.
The outputfrom thehypothesis is the estimatedprobability. This is used toinfer howconfident can
predicted value beactual value whengiven aninput X.
Page 2 | 22
CostFunction
Cost( hΘ (x) , Y(Actual)) = -log (hΘ (x)) if y=1

-log (1- hΘ (x)) if y=0
This implementationis forbinarylogistic regression.For datawithmorethan2classes, softmaxre

gression hastobeused.
Q2. Difference between logistic and linear regression?

Answer:
Linear and Logistic regression are the most basic form of regression which are commonly used. The
essential difference between these two is that Logistic regression is used when the dependent
variable is binary. In contrast, Linear regression is used when the dependent variable is continuous,
andthenatureof theregressionlineis linear.
Key Differences between Linear and Logistic Regression
Linearregression modelsdatausingcontinuousnumeric value. As against, logistic regression

modelsthedatain thebinary values.
Linear regressionrequirestoestablishthelinearrelationshipamongdependentandindependent
variables, whereas it is notnecessary for logistic regression.
In linear regression, theindependentvariable can becorrelated with each other. On thecontrary, in
thelogistic regression,thevariable mustnotbecorrelated with eachother.
Q3. Why we can’t do a classification problem using Regression?
Answer:-
Withlinear regression you fit apolynomial through thedata- say, like ontheexample below, wefit a
straightlinethrough{tumor size, tumor type} sampleset:
Page 3 | 22
Above, malignant tumors get 1, andnon-malignant ones get 0, and thegreen line is our hypothesis
h(x). To make predictions, we may say that for any given tumor size x, if h(x) gets bigger than 0.5,
wepredict malignant tumors.Otherwise, wepredict benignly.
It looks like this way, wecould correctly predict every single training set sample, but now let's change
thetask abit.
Intuitively it's clear that all tumors larger certain threshold are malignant. So let's addanother sample
withhugetumor size, andrunlinear regressionagain:
Nowourh(x)>0.5→malignant doesn't workanymore. To keep making correct predictions, we

need tochangeittoh(x)>0.2orsomething-butthatnothowthealgorithmshouldwork.
Wecannotchangethehypothesiseachtimeanewsamplearrives.Instead,weshouldlearnitoffthe
training setdata,andthen(using thehypothesiswe'velearned)makecorrectpredictionsfor the
data wehaven'tseen before.
Linear regression is unbounded.
Q4. What is Decision Tree?
A decision tree is a type of supervised learning algorithm that can be used in

classification aswell as regressor problems. The input to a decision tree canbe both continuous
as well as categorical. The decision tree works on an if-then statement. Decision tree tries to solve a
problembyusingtree representation(Nodeand Leaf)
Assumptionswhilecreatingadecisiontree:1)Initiallyallthetrainingsetisconsideredasa root2)
Featurevaluesarepreferredtobecategorical,ifcontinuousthentheyarediscretized3)Recordsare
Page 4 | 22
distributedrecursively onthebasisof attributevalues4) Whichattributesareconsideredtobein root
nodeorinternal nodeis doneby using astatistical approach.
Q5. Entropy, Information Gain, Gini Index, Reducing Impurity?

Answer:
Therearedifferentattributeswhichdefinethesplit of nodesin adecision tree.Therearefew
algorithms tofindtheoptimal split.
1) ID3(Iterative Dichotomiser 3): This solution uses Entropy and Information gain as metrics
to form a betterdecision tree. The attribute with the highest information gain is used as a root
node,andasimilar approachis followedafter that. Entropy is themeasurethatcharacterizes
theimpurityof anarbitrarycollection of examples.
Entropyvariesfrom0to1.0if all thedatabelongtoasingle class and1if theclass distributionis

equal.Inthisway,entropywillgiveameasure ofimpurityinthedataset.
Steps to decide which attribute to split:
Page 5 | 22
1.Computetheentropy forthedataset
2.For every attribute:
1. Calculateentropy forall categorical values.
2. Take average informationentropy fortheattribute.
3. Calculate gain forthecurrent attribute.

3. Pick theattributewiththehighest informationgain.
4.Repeat until wegetthedesiredtree.
A leaf nodeis decidedwhenentropy is zero

InformationGain= 1- ∑ (Sb/S)*Entropy (Sb)
Sb- Subset, S - entiredata
2)CART Algorithm (Classification andRegression trees): In CART, weusetheGINI index as

ametric. Gini index is usedasacost function toevaluate split in adataset
Steps tocalculateGini for asplit:
1. Calculate Gini forsubnodes,using formulasumofthesquareofprobability for success and
failure (p2+q2).
2. Calculate Gini for split using weighted Gini scoreof eachnodeof thatsplit.
Choosethesplitbasedonhigher Gini value
SplitonGender:
Gini for sub-nodeFemale = (0.2)*(0.2)+(0.8)*(0.8)=0.68
Gini for sub-nodeMale= (0.65)*(0.65)+(0.35)*(0.35)=0.55
Page 6 | 22
Q6. How to control leaf height and Pruning?
Answer:
Tocontrol theleaf size, wecanset theparameters:-
1. Maximumdepth:
Maximumtreedepthis alimit tostopthefurthersplitting ofnodeswhenthespecified treedepthhas
beenreachedduring thebuilding of theinitial decisiontree.
NEVER use maximum depth to limit the further splitting of nodes. In other words: use the
largestpossiblevalue.
2. Minimumsplit size:
Minimumsplit size is alimit tostopthefurther splitting of nodes whenthenumberof observations
in thenodeis lowerthantheminimumsplit size.
This is agoodwaytolimit thegrowthof thetree. Whenaleaf containstoofewobservations,further
splitting will result in overfitting (modeling of noise in thedata).
3. Minimumleaf size
Minimumleafsize is alimit tosplit anodewhenthenumberof observations in oneof thechild nodes
is lowerthan theminimumleaf size.
Pruning is mostly donetoreducethechances of overfitting the treetothe training dataandreduce
theoverall complexity of thetree.
Therearetwotypes of pruning: Pre-pruning andPost-pruning.
1. Pre-pruningis alsoknownastheearly stoppingcriteria. As thenamesuggests, thecriteria

areset as parametervalues while building themodel.The treestops growing whenit meets
any of thesepre-pruningcriteria, orit discovers thepureclasses.
Page 7 | 22
2.InPost-pruning, theideaistoallowthedecisiontreetogrowfullyandobservetheCPvalue.
Next,weprune/cut thetreewiththeoptimalCP(Complexity Parameter) value as
the
parameter.
TheCP (complexity parameter)is usedtocontroltreegrowth.If thecostofaddingavariable is

higher,thenthevalueofCP,treegrowthstops.
Q7. How to handle a decision tree for numerical and categorical

data?
Answer:
Decisiontrees canhandlebothcategorical andnumerical variables at thesametime as features.
There is notany problemin doing that.
Every splitinadecisiontreeis basedonafeature.
1. If thefeatureis categorical, thesplit is donewith theelementsbelonging toa particular

class.
2.If thefeatureis continuous,thesplit is donewith theelementshigher thana threshold.
At everysplit, thedecisiontreewill takethebestvariableatthatmoment.This will bedone
according
to an impurity measure with the split branches. And the fact that the variable used to do split is
categorical or continuous is irrelevant (in fact, decision trees categorize continuous variables by
creating binary regions withthethreshold).
At last, thegoodapproachis toalwaysconvert your categoricals tocontinuous
using L abelEncoder or OneHotEncoding.
Page 8 | 22
Q8. What is the Random Forest Algorithm?
Answer:
Random Forest is an ensemble machine learning algorithm that follows the bagging technique. The
base estimators in the random forest are decision trees. Random forest randomly selects a set of
features that areusedtodecide thebestsplit ateachnodeof thedecision tree.
Looking atit step-by-step,this is whatarandomforest modeldoes:
1. Randomsubsets arecreatedfromtheoriginal dataset (bootstrapping).
2. At eachnodein thedecisiontree,only arandomsetoffeaturesareconsideredtodecidethe best
split.
3.A decision treemodelis fitted oneachof thesubsets.
4. The final prediction is calculated by averagingthepredictions fromall decision trees.
To sumup,theRandomforestrandomlyselectsdatapointsandfeaturesandbuilds multiple trees
(Forest).
RandomForestis usedfor featureimportanceselection.Theattribute(.feature_importances_)is
usedtofind feature importance.
SomeImportantParameters:-
1. n_estimators:-It defines thenumberof decision treestobecreatedin arandomforest.
2.criterion:- "Gini" or "Entropy."
3. min_samples_split:- Usedtodefine theminimumnumberof samples required in aleaf
nodebeforeasplit is attempted
4. max_features:-It definesthemaximumnumberof featuresallowedforthesplit in each
decision tree.
5. n_jobs:- The numberof jobs torunin parallel for bothfit andpredict. Always keep(-1) to use
all thecores for parallel processing.
Q9. What is Variance and Bias tradeoff?
Answer:
In predicting models,theprediction erroris composedof two different errors
1. Bias
2. Variance
Page 9 | 22
It is important to understand thevariance and bias trade-off which tells about to minimize theBias
andVariance in theprediction andavoidsoverfitting & underfitting of themodel.
Bias: It is thedifference betweentheexpected or average prediction of the model and thecorrect
value which we are trying to predict. Imagine if we are trying to build more than one model by
collecting different data sets, and later on, evaluating the prediction, we may end up by different
prediction for all themodels. So, bias is something which measures howfar these model prediction
fromthecorrect prediction.It always leadstoahigh errorin training andtestdata.
Variance: Variability of amodelprediction foragiven datapoint. Wecan build themodelmultiple

times,sothevarianceis howmuchthepredictionsforagiven pointvary betweendifferent
realizations of themodel.
For example:VotingRepublican-13VotingDemocratic-16Non-Respondent-21Total-50 The

probability ofvotingRepublicanis13/(13+16),or44.8%.Weputoutourpressreleasethatthe
Democratsaregoingtowinbyover10points;but,whentheelectioncomesaround,itturnsoutthey
loseby10points.Thatcertainlyreflectspoorlyonus.Wheredidwegowronginourmodel?
Bias scenario's: usingaphonebooktoselectparticipants in oursurveyis oneofoursourcesof
bias. By only surveying certainclassesof people,it skewstheresultsin awaythatwill beconsistent
if we repeated the entire model building exercise. Similarly, not following up with
respondentsisanother sourceofbias,asitconsistentlychangesthemixtureofresponsesweget.Onourbulls-
eyediagram,
thesemoveusawayfromthecenterofthetarget,buttheywouldnotresultinanincreasedscatterofestimates.
Variancescenarios:thesmallsamplesizeisasourceofvariance.Ifweincreasedoursamplesize,
theresultswouldbemoreconsistenteachtimewerepeatedthesurveyandprediction.Theresults stillmightbe
highly inaccurate due to our large sources of bias, but the variance of predictions will be
reduced
Page 10 | 22
Q10. What are Ensemble Methods?
Answer
1. Bagging andBoosting
Decision treeshave beenaroundfor along time andalso knowntosuffer frombias andvariance.
You will have alarge bias withsimple treesandalarge variance withcomplex trees.
Ensemble methods- which combines several decision treestoproducebetterpredictive
performancethanutilizing asingle decisiontree.Themainprinciple behindtheensemble modelis
thatagroup of weaklearners cometogether toformastrong learner.
Twotechniques toperformensembledecisiontrees:
1. Bagging
2. Boosting
Bagging (Bootstrap Aggregation)is usedwhenourgoal is toreducethevariance of adecision tree.

Here the idea is to create several subsets of data from the training sample chosen randomly with
replacement. Now, each collection of subset data is used to train their decision trees. As a result, we
end upwith an ensemble of different models. Average of all thepredictions from different trees are
usedwhichis morerobustthanasingledecisiontree.
Boosting is another ensemble technique to create a collection of predictors. In this technique, learners
arelearned sequentiallywith early learners fitting simple modelstothedataandthenanalyzing data
Page 11 | 22
for errors.In otherwords,wefit consecutivetrees(randomsample),andateverystep,thegoal is to
solve forneterrorfromtheprior tree.
Whenahypothesismisclassifies aninput, its weightis increased,sothatthenexthypothesisis more
likely toclassify it correctly. By combining thewhole set at theendconverts weak learners into a
better performing model.
Thedifferenttypes of boosting algorithms are:

1. AdaBoost
2.GradientBoosting
3. XGBoost
Q11. What is SVM Classification?
Answer:
SVM orLarge marginclassifier is asupervisedlearningalgorithmthatusesapowerfultechnique

called SVM forclassification.
Wehavetwotypes of SVMclassifiers:
1) Linear SVM: In Linear SVM, the data points are expected to beseparated by someapparent
gap. Therefore, the SVM algorithm predicts a straight hyperplane dividing the two classes. The
hyperplane is also called asmaximummargin hyperplane
Page 12 | 22
2) Non-Linear SVM: It is possiblethatourdatapoints arenotlinearly separable in ap-
dimensionalspace,butcanbelinearly separablein ahigherdimension.Kernel tricks makeit
possibletodrawnonlinearhyperplanes.Somestandardkernelsarea) Polynomial Kernel b) RBF
kernel(mostlyused).
Advantages of SVM classifier:

1) SVMs are effective when the number of features is quite large.
2) It works effectively even if the number of features is greater than the number of samples.
3) Non-Lineardatacanalsobeclassifiedusingcustomizedhyperplanesbuiltbyusingkerneltrick.
4) It is a robust model to solve prediction problems since it maximizes margin.
Disadvantagesof SVM classifier:

1)Thebiggestlimitationof theSupportVectorMachineis thechoiceof thekernel.Thewrongchoice of
thekernelcanleadtoanincreaseinerrorpercentage.
2) Withagreaternumberofsamples,itstartsgivingpoorperformances.
3) SVMs have good generalization performance, but they can be extremely slow in the test
phase.
4)SVMs have high algorithmic complexity and extensive memory requirements due to
theuseof quadraticprogramming.
Q11. What is Naive Bayes Classification and Gaussian Naive Bayes

Page 13 | 22
Answer:
Bayes’Theoremfindstheprobabilityofaneventoccurringgiventheprobabilityofanother
event thathasalreadyoccurred.Bayes’theoremisstatedmathematicallyasthefollowingequation:
Now,withregardstoourdataset,wecanapplyBayes’ theoremin following way:

P(y|X) = {P(X|y) P(y)}/{P(X)}
where,y is class variableandX is adependentfeature vector (of size n) where:
X = (x_1,x_2,x_3,.....,x_n)
To clear, anexampleof afeaturevectorandcorresponding class variable can be: (refer 1st rowof
thedataset)
Page 14 | 22
X = (Rainy, Hot, High, False) y = No So basically, P(X|y) here means,
theprobabilityof“Not playinggolf”giventhattheweatherconditionsare“Rainyoutlook”,“Temperatureis
hot”,“high humidity”and “nowind”.
NaiveBayesClassification:
1. Weassumethatnopairoffeaturesaredependent.Forexample,thetemperaturebeing‘Hot’ hasnothing
todowiththehumidity,ortheoutlookbeing‘Rainy’doesnotaffectthewinds.
Hence,thefeaturesareassumed tobeindependent.
2.Secondly,eachfeatureisgiventhesameweight(orimportance).Forexample, knowing
the onlytemperatureandhumidityalonecan’tpredicttheoutcomeaccurately.Noneofthe
attributesisirrelevantandassumedtobecontributingequallytotheoutcome
Gaussian Naive Bayes
Continuousvaluesassociatedwitheachfeatureareassumedtobedistributedaccordingtoa Gaussian
distribution.A Gaussiandistributionis alsocalledNormaldistribution.Whenplotted,it givesabell-
shapedcurvewhichissymmetricaboutthemeanofthefeaturevaluesasshownbelow:
This is as simple as calculating the mean and standard deviation values of each input
variable(x)for eachclassvalue.
Mean(x)=1/n*sum(x)
Wherenisthenumberofinstances,andxisthevaluesforaninputvariableinyourtrainingdata. Wecan
calculatethestandarddeviationusingthefollowingequation:
Standarddeviation(x)=sqrt(1/n*sum(xi-mean(x)^2))
Whentousewhat?StandardNaiveBayesonlysupportscategoricalfeatures,whileGaussianNaive Bayes
onlysupportscontinuouslyvaluedfeatures.
Q12. What is the Confusion Matrix?
Answer:
A confusion matrix is a table that is often used to describe theperformance of aclassification model
(or “classifier”) ona set of test data for which the true values are known. It allows the visualization
of theperformanceof analgorithm.
Page 15 | 22
A confusion matrix is a summary of prediction results on a classification
problem.Thenumberof correctandincorrectpredictionsaresummarizedwithcountvaluesand
brokendownbyeachclass.
This is the key to the confusion matrix.
It gives us insight not only into the errors being made by a classifier but, more importantly,
thetypes oferrorsthatarebeingmade.
Here,
Class 1: Positive
Class 2: Negative
Definitionof theTerms:
1. Positive (P): Observation is positive (for example: is an apple).
2. Negative (N): Observation is notpositive (for example: is notanapple).
3.True Positive (TP): Observation is positive, andis predicted to bepositive.
4. False Negative (FN): Observation is positive, butis predicted negative.
5.True Negative (TN): Observation is negative, andis predicted tobenegative.
6. False Positive (FP): Observation is negative, but is predicted positive.
Q13. What is Accuracy and Misclassification Rate?
Answer:
Accuracy
Accuracy is defined as theratio of thesumof TruePositive andTrue
Negativeby Total(TP+TN+FP+FN)
Page 16 | 22
However, there are problems with accuracy. It assumes equal costs for both kinds of errors.
A 99% accuracy can beexcellent,good,mediocre,poor,or terrible dependingupon the
problem.
MisclassificationRate
Misclassification Rateis definedastheratioofthesumofFalse Positive andFalse

Negativeby Total(TP+TN+FP+FN)
Misclassification Rate is also called Error Rate.
Q14. True Positive Rate & True Negative Rate
Answer:
True Positive Rate:
Sisecnaslictiuvliatyte(dSN as) the number of correct positive predictions divided by the
total number of positives. It is also called Recall (REC) or true positive rate (TPR).
The best sensitivity is 1.0, whereas the worst is 0.0.
Page 17 | 22
True Negative Rate
Specificity (SP) is calculated as the number of correct negative predictions divided by the
total number of negatives. It is also called a true negative rate (TNR). The best specificity is
1.0, whereas the worst is 0.0.
Q15. What is False Positive Rate & False negative Rate?

FalsePositiveRate
False positive rate (FPR) is calculated as the numberof incorrect positive predictions divided by the
total number of negatives. The best false positive rate is 0.0, whereas the worst is 1.0. It can also be
calculated as 1– specificity.
FalseNegativeRate
False Negativerate(FPR) is calculatedasthenumberof incorrect positive predictions divided by
thetotal numberof positives. The bestfalse negative rateis 0.0, whereastheworstis 1.0.
Page 18 | 22
Q16. What are F1 Score, precision and recall?
Recall:-
Recall canbe defined asthe ratioofthe total number ofcorrectlyclassified positive
examples dividetothetotalnumberofpositiveexamples.
1. High Recall indicates the class iscorrectly recognized (small number ofFN).
2. LowRecall indicates the class is incorrectly recognized (large number ofFN).
Recall is given by the relation:
Precision:
Toget the value ofprecision, wedivide the total number ofcorrectly classified positive
examples bythetotalnumberofpredictedpositiveexamples.
1. High Precision indicates anexample labeled as positive is indeed positive (a small
number
ofFP).
2. LowPrecision indicates anexample labeled aspositive isindeed positive (large
numberof
FP).
The relation gives precision:
Remember:-
High recall, lowprecision: This means that mostofthe positive examples are correctly
recognized (lowFN),buttherearealotoffalsepositives.
Lowrecall, high precision: This shows that wemiss alot ofpositive examples (high FN),but
those
wepredictaspositiveareindeedpositive(lowFP).
F-measure/F1-Score:
Page 19 | 22
Since we have two measures (Precision and Recall), it helps to have a measurement that
represents both of them. We calculate an F-measure, which uses Harmonic Mean in place of
Arithmetic Mean as it punishes the extremevalues more.
TheF-MeasurewillalwaysbenearertothesmallervalueofPrecisionorRecall.
Q17. What is RandomizedSearchCV?
Answer:
RandomizedsearchCV is usedtoperformarandomsearchonhyperparameters.Randomized
searchCV uses afit and score method, predict proba, decision_func, transform,etc..,
Theparametersof theestimatorusedtoapplythesemethodsareoptimizedbycross-validated
searchoverparameter settings.
In contrast toGridSearchCV, notall parametervalues are tried out, but rather a fixed number of
parametersettings is sampledfromthespecified distributions. The numberof parametersettings that
aretriedis given by n_iter.
CodeExample:
class sklearn.model_selection.RandomizedSearchCV(estimator,
param_distributions, n_iter=10, scoring=None, fit_params=None,
n_jobs=None, iid=’warn’, refit=Trnu_ejo,
cbvs=’,rwaanrdno’,mv_e
deprecating’, srbtaotsee==N0o,nper,e_edriosrp_astccohr=e‘=2’raise-
return_train_score=’warn’)
Q18. What is
GridSearchCV?
Answer:
Page 20 | 22
Grid search is theprocess of performing hyperparameter tuning todetermine theoptimal values for
agiven model.
CODE Example:-
fromsklearn.model_selectionimportGridSearchCV fromsklearn.svmimportSVR gsc =
GridSearchCV( estimator=SVR(kernel='rbf'),param_grid={ 'C': [0.1, 1, 100, 1000],'epsilon':
[0.0001, 0.0005, 0.001,0.005,0.01, 0.05, 0.1, 0.5, 1, 5, 10], 'gamma':[0.0001, 0.001,0.005,0.1, 1,
3, 5] }, cv=5, scoring='neg_mean_squared_error', verbose=0,n_jobs=-1)
Grid searchrunsthemodelonall thepossiblerangeof hyperparametervaluesandoutputsthebest
model
Q19. What is BaysianSearchCV?
Answer:
Bayesiansearch,incontrasttothegridandrandomsearch,keepstrackofpastevaluationresults,
whichtheyusetoformaprobabilistic modelmappinghyperparameterstoaprobability ofascore
ontheobjectivefunction.
Code:
fromskopt import BayesSearchCV
opt=BayesSearchCV(
SVC(),
Page 21 | 22
{
'C': (1e-6, 1e+6, 'log-uniform'),
'gamma': (1e-6, 1e+1, 'log-uniform') ,
'degree':(1, 8), # integer valued parameter
'kernel': ['linear', 'poly', 'rbf']
},
n_iter=32,
cv=3)
Q20. What is ZCA Whitening?
Answer:
ZeroComponentAnalysis:
Makingtheco-variancematrixastheIdentity matrix is called whitening. This will remove thefirst
andsecond-orderstatistical structure
ZCA transformsthedata tozero meansandmakesthefeatures linearly independent of eachother
In someimage analysis applications, especially whenworkingwithimages of thecolor andtiny typ
e, it is frequently interesting toapply somewhitening tothedatabefore,e.g. training aclassifier.
Page 22 | 22
DATA SCIENCE
INTERVIEW PREPARATION
Preparation)
# DAY 03
Page 1 | 18
Q1. How do you treat heteroscedasticity in regression?
Heteroscedasticitymeansunequalscattereddistribution. In regression analysis, wegenerally talk about

theheteroscedasticityin thecontextof theerrorterm.Heteroscedasticityis thesystematicchangein the
spreadof theresidualsorerrorsovertherangeof measuredvalues.Heteroscedasticityis theproblem
becauseOrdinary leastsquares(OLS) regressionassumesthatall residualsaredrawnfromarandom
populationthathasaconstantvariance.
WhatcausesHeteroscedasticity?
Heteroscedasticity occurs more often in datasets, where we have a large range between the largest
and the smallest observed values. There are many reasons why heteroscedasticity can exist, and a generic
explanationisthattheerrorvariancechangesproportionallywithafactor.
WecancategorizeHeteroscedasticityintotwogeneraltypes:-
Pure heteroscedasticity:- It refers to cases where we specify the correct model and let us
observethe non-constantvarianceinresidualplots.
Impure heteroscedasticity:- It refers to cases where you incorrectly specify the model, and that
causes
thenon-constantvariance.Whenyouleaveanimportantvariableoutofamodel,theomittedeffectis absorbedinto
theerrorterm.Iftheeffectoftheomittedvariablevariesthroughouttheobservedrangeof data,itcanproducethe
telltalesignsofheteroscedasticityintheresidualplots.
How to Fix Heteroscedasticity
Redefining the variables:

If your model is a cross-sectional model that includes large differences between the
sizesof the observations,youcanfinddifferentways tospecify the model that reduces the impact
ofthesize
Page 2 | 18
differential.Todothis,changethemodelfromusingtherawmeasuretousingratesandpercapita
values. Of course,thistypeofmodelanswersaslightly differentkind ofquestion.You’ll needtodetermine
whetherthisapproachissuitableforbothyourdataandwhatyouneedtolearn.
Weightedregression:
It is a method that assigns each data point to a weight based on the variance of its fitted value.
Theidea is togive small weightstoobservationsassociatedwithhighervariancestoshrink theirsquared
residuals. Weightedregressionminimizesthesumoftheweightedsquaredresiduals.Whenyouusethe
correct weights,heteroscedasticityisreplacedbyhomoscedasticity.
Q2. What is multicollinearity, and how do you treat it?
Multicollinearity meansindependentvariablesarehighlycorrelatedtoeachother.Inregression analysis,it'san

importantassumptionthattheregression modelshouldnotbefacedwithaproblemof multicollinearity.
If two explanatory variables are highly correlated, it's hard to tell, which affects the
dependentvariable. Let'ssayYisregressedagainstX1andX2andwhereX1andX2arehighlycorrelated.Thentheeffect
ofX1onYishardtodistinguish fromtheeffectofX2onYbecauseanyincreaseinX1tendstobe
associatedwithan increaseinX2.
Anotherwaytolookatthemulticollinearityproblemis:Individualt-testPvaluescanbemisleading.It means
aP-valuecanbehigh,whichmeansthevariableisnotimportant,eventhoughthevariableisimportant.
Correcting Multicollinearity:
1)Removeoneofthehighly correlatedindependentvariables fromthemodel.Ifyouhavetwoormore
factors with a high VIF, remove one from the model.
2) PrincipleComponentAnalysis(PCA)-Itcutthenumberofinterdependentvariablestoasmallerset
ofuncorrelatedcomponents.Insteadofusinghighlycorrelatedvariables,usecomponents inthe
model thathaveeigenvaluegreaterthan1.
3) RunPROCVARCLUSandchoosethevariablethathasaminimum(1-R2)ratiowithinacluster.
4)Ridge Regression-It is atechniqueforanalyzing multipleregressiondatathatsuffer
from multicollinearity.
5)If you include an interaction term (the product of two independent variables), you
canalsoreduce
multicollinearity by"centering"thevariables.By "centering,"it meanssubtractingthemeanfrom
the valuesoftheindependentvariablebeforecreatingtheproducts.
Page 3 | 18
Whenismulticollinearitynotaproblem?
1)If your goal is to predict Y from a set of X variables, then multicollinearity is not a
problem.The predictionswillstillbeaccurate,andtheoverallR2(oradjustedR2)quantifieshowwellthe
model predictstheYvalues.
2) Multipledummy(binary)variablesthatrepresentacategoricalvariablewiththreeormorecategories.
Q3. What is market basket analysis? How would you do it in Python?
Market basket analysis is the study of items that are purchased or grouped in a single transaction or
multiple, sequential transactions. Understanding the relationships and the strength of those relationships
is valuable information that can be used to make recommendations, cross-sell, up-sell, offer coupons,
etc.
Market Basket Analysis is oneof thekey techniques used by large retailers touncover associations
betweenitems.It worksby looking for combinationsof itemsthatoccurtogether frequently in
transactions.To putit anotherway,it allowsretailerstoidentify relationships betweentheitemsthat
peoplebuy.
Q4. What is Association Analysis? Where is it used?
Association analysis usesasetoftransactionstodiscover rules thatindicate thelikely occurrenceof an

itembasedontheoccurrences ofother items in thetransaction.
The technique of association rules is widely usedfor retail basket analysis. It can also beused for
classification byusing ruleswithclass labels ontheright-handside. It is evenusedforoutlier detection
withrules indicating infrequent/abnormal association.
Association analysis also helps us to identify cross-selling opportunities, for example, wecan use the
rules resulting from the analysis to place associated products together in a catalog, in thesupermarket,
or the Webshop, or apply them when targeting a marketing campaign for product B at customers who
havealready purchasedproduct A.
Page 4 | 18
Associationrules aregivenintheformas below:
A=>B[Support,Confidence] The partbefore=> is referredtoasif (Antecedent)andthepartafter=> is
referred toasthen(Consequent).
WhereA andB aresetsof itemsin thetransactiondata,aandB aredisjoint sets.
Computer=>Anti−virusSoftware[Support=20%,confidence=60%] Above rule says:
1. 20% transaction showAnti-virus software is bought withpurchase of aComputer
2. 60%of customerswhopurchaseAnti-virus softwareis boughtwith purchase of aComputer An
exampleof AssociationRules * Assumethereare100 customers
1. 10of thembought milk, 8bought butter and6bought bothof them2.bought milk => bought
butter
2. support= P(Milk & Butter) = 6/100 = 0.06
3. confidence = support/P(Butter) = 0.06/0.08= 0.75
4.lift= confidence/P(Milk) = 0.75/0.10 = 7.5
Q5. What is KNN Classifier ?
KNN meansK-Nearest Neighbour Algorithm. It can beusedforbothclassification andregression.
It is the simplest machine learning algorithm. Also known as lazy learning (why? Because it
doesnot createageneralizedmodelduringthetimeoftraining,sothetestingphaseisveryimportantwhereit
doestheactualjob.HenceTestingisverycostly-intermsoftime&money).Alsocalledaninstance- basedor
memory-basedlearning
In k-NN classification, the output is a class membership. An object is classified by a plurality vote
of its neighbors, with the object being assigned to the class most common among its k nearest
neighbors(kis apositiveinteger,typicallysmall).Ifk=1,thentheobjectisassignedtotheclassofthatsingle
nearest neighbor.
Page 5 | 18
In k-NN regression,theoutputis theproperty value fortheobject. This value is theaverage of the
values of k nearest neighbors.
All three distance measures are only valid for continuous variables. In the
instanceofcategorical variables,theHammingdistancemustbeused.
How to choose thevalue of K: K value is ahyperparameterwhich needs tochoose during thetime of

modelbuilding
Also, asmall numberof neighbors aremostflexible fit, which will have alow bias, butthehigh variance
andalarge numberof neighbors will have asmootherdecision boundary, whichmeanslowervariance
buthigher bias.
Weshouldchooseanoddnumberif thenumberof classes is even. It is said themostcommonvalues are
tobe3& 5.
Q6. What is Pipeline in sklearn ?

A pipelineis whatchainsseveralstepstogether,oncetheinitial explorationis done.For example,some
codesaremeanttotransform features—normalize numerically, orturntext into vectors, orfill up
missing data,andtheyaretransformers;othercodesaremeanttopredictvariables byfitting an
algorithm,
Page 6 | 18
suchasrandomforestorsupportvectormachine,theyareestimators.Pipeline chains all thesetogether,
which can thenbeappliedtotraining datain block.
Exampleof apipeline thatimputes datawiththemostfrequentvalue of each column, andthenfit a
decisiontreeclassifier.
From sklearn.pipeline import
Pipeline
steps=[('imputation',Imputer(missing_values='NaN',strategy='most_frequent',axis=0)),
('clf', DecisionTreeClassifier())]
pipeline= Pipeline(steps)
clf = pipeline.fit(X_train,y_train)```
Instead of fitting toonemodel, it can beloopedoverseveral modelstofind thebestone.
classifiers= [ KNeighborsClassifier(5), RandomForestClassifier(),GradientBoostingClassifier()]
forclf in classifiers:
steps= [('imputation',Imputer(missing_values='NaN', strategy = 'most_frequent', axis=0)),
('clf', clf)]
pipeline= Pipeline(steps)
I alsolearnedthepipelineitself canbeusedasanestimatorandpassedtocross-validation orgrid

search.
fromsklearn.model_selection importKFold
fromsklearn.model_selectionimportcross_val_score
kfold= KFold(n_splits=10, random_state=seed)
results = cross_val_score(pipeline, X_train, y_train, cv=kfold)
print(results.mean())
Q7. What is Principal Component Analysis(PCA), and why we do?
Themainideaofprincipal componentanalysis (PCA) is toreducethedimensionality ofadataset

consisting ofmanyvariables correlatedwitheachother,eitherheavily orlightly, whileretainingthe
variation present in the dataset, up tothe maximum extent. The same is done by transforming the
variablestoanewsetofvariables,whichareknownastheprincipal components(orsimply, thePCs)
andareorthogonal,orderedsuchthattheretentionofvariationpresentintheoriginalvariablesdecreases aswemove
downintheorder.So,inthisway,the1stprincipalcomponentretainsmaximumvariation thatwaspresentinthe
original components.The principal components are the eigenvectors of a covariance matrix,
andhencetheyareorthogonal.
Page 7 | 18
Mainimportantpointstobeconsidered:
1. Normalizethe data
2.Calculate the covariance matrix
3. Calculate theeigenvaluesand eigenvectors
4. Choosingcomponentsandformingafeaturevector
5.Forming Principal Components
Q8. What is t-SNE?
(t-SNE) t-Distributed Stochastic Neighbor Embedding is a non-linear dimensionality reduction

algorithm usedforexploring high-dimensionaldata.It mapsmulti-dimensionaldatatotwoormore
dimensions suitableforhumanobservation.Withthehelpofthet-SNEalgorithms,youmayhavetoplot
fewer exploratorydataanalysisplotsnexttimeyouworkwithhighdimensionaldata.
Q9. VIF(Variation Inflation Factor),Weight of Evidence &

Information Value. Why and when to use?
Variation Inflation Factor

It provides an index that measures how much the variance (the square of the estimate's
standard deviation)ofan estimatedregressioncoefficientisincreasedbecauseofcollinearity.
VIF = 1 / (1-R-Square of j-th variable) where R2 of jth variable is the coefficient of
determinationof
themodelthat includesallindependentvariablesexceptthejthpredictor.
WhereR-Squareofj-thvariableisthemultipleR2fortheregressionofXjontheotherindependent variables(a
regressionthatdoesnot involvethedependent variableY).
If VIF > 5, then there is a problem with multicollinearity.
UnderstandingVIF
If the variance inflation factor of a predictor variable is 5 this means that variance
for the coefficient of that predictor variable is 5 times as largeas it would be if that predictor variable
wereuncorrelatedwiththeotherpredictorvariables.
In other words, if theofvariance
errorforthecoefficient inflation
thatpredictor factor
variable is 2.23times (√5 =variable
of a predictor is 5 this
2.23) as large as it means that the
would beif
that
standard
predictorvariablewereuncorrelatedwiththeother predictorvariables.
Page 8 | 18
Weightofevidence(WOE) andinformation value(IV) aresimple, yetpowerfultechniquesto
performvariable transformation andselection.
TheformulatocreateWOE andIV is
Hereisasimpletablethatshowshowtocalculatethesevalues.
The IV value can beusedtoselect variables quickly.
Page 9 | 18
Q10: How to evaluate that data does not have any outliers ?
In statistics, outliers are data points that don’t belong to a certain population. It is an abnormal
observation thatliesfarawayfromothervalues.Anoutlierisanobservationthatdivergesfromotherwisewell-
structureddata.
Detection:
Method1—StandardDeviation:In statistics, If a data distribution is approximately normal, then
about68%ofthedata valuesliewithinonestandarddeviationofthemean,andabout95%arewithin two
standarddeviations,and about 99.7%liewithinthreestandarddeviations.
Therefore, if you have any data point that is morethan3 times the standard deviation, then those points
arevery likely tobeanomalousoroutliers.
Method2—Boxplots:Box plots areagraphical depictionof numerical datathrough their quantiles. It

is a very simple but effective way to visualize outliers. Think about the lower andupper whiskers as the
boundaries of the data distribution. Any data points that show above or below the whiskers can be
consideredoutliersoranomalous.
Page 10 | 18
Method3-ViolinPlots: Violin plotsaresimilar toboxplots,exceptthattheyalsoshowtheprobability
densityof thedataatdifferentvalues, usually smoothedby akerneldensityestimator.Typically aviolin plot
will include all the data that is in a box plot: a marker for the median of the data, a box or
marker indicatingtheinterquartilerange,andpossiblyallsamplepointsifthenumberofsamplesisnottoohigh.
Method4- Scatter Plots: A scatterplotis atypeofplotormathematicaldiagramusing Cartesian

coordinatestodisplay values fortypically twovariables forasetofdata.Thedataaredisplayedasa
collection of points, eachhaving thevalue of onevariable determining theposition onthehorizontalaxis
andthevalue of theothervariable determining theposition onthevertical axis.
The pointswhicharevery far awayfromthegeneralspreadof dataandhaveavery fewneighborsare

consideredtobeoutliers
Page 11 | 18
Q11: What you do if there are outliers?
Following aretheapproachestohandle theoutliers:

1.Droptheoutlier records
2. Assign anewvalue: If anoutlierseemstobeduetoamistake in your data,youtry imputing a
value.
3.If percentage-wisethenumberofoutliersis less, butwhenweseenumbers,thereareseveral,
then,in thatcase, dropping themmight cause aloss in insight. Weshould group themin that
caseandrunour analysis separately onthem.
Q12: What are the encoding techniques you have applied with
Examples ?
In many practical data science activities, the data set will contain categorical variables. These variables
are typically stored as text values". Since machine learning is based on mathematical equations, it
wouldcause aproblemwhenwekeepcategorical variables as is.
Let's considerthefollowing datasetoffruit namesandtheirweights.

Someof thecommonencoding techniques are:
Label encoding: In label encoding, wemapeachcategory to a numberor a label. The labels chosenfor
sthuechcategorieshavenorelationship.So categoriesthathave someties orareclose toeachotherlose
informationafter encoding.
One - hot encoding: In this method, we mapeach category to a vector that contains 1 and 0 denoting
the presence of the feature or not. The number of vectors depends on the categories which we want to
keep. For high cardinality features, this methodproduces a lot of columns that slows downthe learning
significantly.
Q13: Tradeoff between bias and variances, the relationship between

them.
Whenever we discuss model prediction, it’s important to understand prediction errors (bias and
variance). The prediction error for any machine learning algorithm can be broken down into three parts:
Bias Error
Variance Error
IrreducibleError
Page 12 | 18
The irreducible error cannot be reduced regardless of what algorithm is used. It is the error introduced
from the chosen framing of the problem and may be caused by factors like unknown variables that
influencethemapping of theinput variables totheoutput variable.
Bias: Bias meansthatthemodelfavors oneresult morethan theothers. Bias is thesimplifying
assumptions madeby a model to make the target function easier to learn. The model with high bias pays
very little attention to the training data and oversimplifies the model. It always leads to a high error in
training andtestdata.
Variance: Variance is theamountthattheestimateof thetarget function will change if different training
datawasused.Themodelwithhigh variancepaysalot ofattentiontotraining dataanddoesnot
generalizeonthedatawhichit hasn’tseenbefore.As aresult,such models performvery well ontraining
databuthave high errorrateson testdata.
So, the end goal is to come up with a model that balances both Bias and Variance. This is called
Bias Variance Trade-off.To builda goodmodel,we needto finda good balance betweenbias and
variance suchthatitminimizesthetotalerror.
Q14: What is the difference between Type 1 and Type 2 error

and severity of the error?
Type I Error
A Type I error is often referred to as a “false positive" and is the incorrect
rejectionofthetruenull hypothesisinfavorofthealternative.
In the example above, the null hypothesis refers to the natural state of things or the absence of
thetested
effectorphenomenon,i.e.,statingthatthepatientisHIVnegative.Thealternativehypothesisstates that
thepatientisHIVpositive.Manymedicaltestswillhavethediseasetheyaretestingforasthealternativehypothesisand
thelackofthatdiseaseasthenullhypothesis.
Page 13 | 18
A Type I errorwouldthusoccurwhenthepatient doesn’t have thevirus, butthetestshows thatthey do.
In otherwords,thetestincorrectly rejects thetruenull hypothesis thatthepatient is HIV negative.
TypeII Error
A Type II error is theinverse of a Type I error andis thefalse acceptance of a null hypothesis that is not
true, i.e., a false negative. A Type II error would entail the test telling the patient they are free of HIV
whentheyarenot.
Considering this HIV example, which errortypedoyouthink is moreacceptable? In otherwords,would
youratherhave atest thatwas morepronetoType I or Types II error? WithHIV, themomentarystress of
afalse positiveis likely betterthanfeeling relieved atafalse negativeandthenfailing totakestepsto treat
the disease. Pregnancy tests, blood tests, and any diagnostic tool that has serious consequences for the
healthofapatientareusually overlysensitiveforthisreason–theyshoulderronthesideofafalse positive.
But in mostfields of science, Type II errors areseenasless serious thanType I errors.WiththeType II
error, a chance to reject the null hypothesis was lost, and no conclusion is inferred from a non-rejected
null. But the Type I error is more serious because you have wrongly rejected the null hypothesis and
ultimately madeaclaim that is not true. In science, finding a phenomenonwhere there is noneis more
egregiousthanfailing tofind aphenomenonwherethereis.
Q15: What is binomial distribution and polynomial distribution?

Binomial Distribution: A binomialdistributioncanbethoughtofassimply theprobability of a
SUCCESS orFAILURE outcomein anexperimentorsurveythatis repeatedmultiple times.The
binomialis atypeofdistributionthat has twopossible outcomes (the prefix “bi” meanstwo, or twice).
For example,acoin tosshasonly twopossibleoutcomes:headsortails, andtakingatestcouldhavetwo
possibleoutcomes:passorfail.
Multimonial/PolynomialDistribution: Multi orPoly meansmany. In probability theory, the
multinomialdistribution is ageneralizationofthebinomialdistribution. For example,it modelsthe
probability of counts of eachside for rolling ak-sided die ntimes.For nindependent trials eachof which
leadstosuccessforexactly oneofk categories,witheachcategoryhaving agiven fixed success
probability,themultinomial distribution gives theprobability of any particular combination of numbers
of successesforthevarious categories
Page 14 | 18
Q16: What is the Mean Median Mode standard deviation for
the sample and population?
MeanIt is an important technique in statistics. Arithmetic Mean can also be called an average.
It is the number of the quantity obtained by summing two or more numbers/variables and
thendividingthesumbythenumberofnumbers/variables.
ModeThemodeisalsooneofthetypesforfindingtheaverage.Amodeisanumberthatoccursmost
frequentlyinagroupofnumbers.Someseriesmightnothaveanymode;somemighthavetwomodes,
whichiscalleda bimodalseries.
In the study of statistics, the three most common ‘averages’ in statistics are mean, median, and
mode.
Median is also a way of finding the average of a group of data points. It’s the middle number
ofasetof numbers.Therearetwopossibilities,thedatapointscanbeanoddnumbergroup,oritcan
beanevennumbergroup.
If the group is odd, arrange the numbers in the group from smallest to largest. The
medianwillbethe onewhichisexactlysittinginthemiddle,withanequalnumberoneithersideofit.Ifthe
groupis even,arrangethenumbersinorderandpickthetwomiddlenumbersandaddthemthendivideby
2. Itwillbethemediannumberofthatset.
Standard Deviation (Sigma) Standard Deviation is a measure of how much your data is spread out in
statistics.
Q17: What is Mean Absolute Error ?

What is Absolute Error? Absolute Error is the amount of error in your measurements. It is the
difference between the measured value and the “true” value. For example, if a scale states 90
pounds,butyouknowyourtrueweightis89pounds,thenthescalehasanabsoluteerrorof90 lbs–89
lbs = 1 lbs.
This can be caused by your scale, not measuring the exact amount you are trying to
measure.For example,yourscalemaybeaccuratetothenearestpound.Ifyouweigh89.6lbs,thescale
may “roundup”andgiveyou90lbs.Inthiscasetheabsoluteerroris90lbs–89.6lbs=.4lbs.
Mean Absolute Error The Mean Absolute Error(MAE) is the average of all absolute errors. The
formula
is: mean absolute error
Page 15 | 18
Where,
n= thenumberoferrors,Σ = summationsymbol(whichmeans“add themall up”), |xi – x|= the
absolute errors.The formula maylook alittle daunting, butthesteps areeasy:
Find all of your absolute errors, xi – x. Add themall up. Divide by thenumberof errors. For example,
if youhad 10measurements,divide by 10.
Q18: What is the difference between long data and wide data?
There are many different ways that you can present thesamedataset totheworld. Let's take alook at
oneof themostimportantandfundamentaldistinctions, whetheradataset is wideorlong.
Thedifferencebetweenwideandlong datasets boils downto whetherweprefer tohave morecolumns
in ourdataset ormorerows.
Wide Data A dataset thatemphasizes putting additional dataabout asingle subject in columns is called
awidedataset because, asweadd morecolumns,thedataset becomes wider.
Long Data Similarly, adataset thatemphasizes including additional dataaboutasubject in rowsis called
along datasetbecause,asweaddmorerows,thedatasetbecomeslonger.It's importanttopointoutthat
there's nothing inherently goodorbadaboutwideorlong data.
In theworldof datawrangling, wesometimes needtomakealong dataset wider, andwesometimes need
tomakeawidedatasetlonger. However,it is truethat,asageneralrule, datascientists whoembracethe
conceptof tidy datausually prefer longer datasets over wider ones.
Q19: What are the data normalization method you have applied, and
why?
Normalization is a technique often applied as part of data preparation for machine learning. The goal
of normalization is to change the values of numeric columns in the dataset to a common scale, without
distorting differences in the ranges of values. For machine learning, every dataset does not require
normalization.It is required only whenfeatures have different ranges.
In simple words, when multiple attributes are there, but attributes have values on different scales, this
may lead to poor data models while performing data mining operations. So they are normalized to bring
all theattributes onthesamescale, usually something between(0,1).
It is not always a good idea to normalize thedata since wemight lose information about maximum and
minimumvalues. Sometimesit is agoodidea todoso.
For example, ML algorithms such asLinear RegressionorSupportVector Machines typically
converge
cfaosuteldronnormalizeddata. But onalgorithms like K-means orK NearestNeighbours, normalization
Page 16 | 18
beagoodchoiceorabaddependingontheusecasesincethedistancebetweenthepointsplaysakey rolehere.
Types of Normalisation :
1Min-MaxNormalization:In most cases, standardization is used feature-
wise
2Z-score normalizationIn this technique,valuesarenormalizedbasedonameanandstandard

deviation of thedata
v’, v is newandoldof eachentryin datarespectively.σA, A is thestandarddeviationandmeanof A

respectively.
standardization(orZ-score normalization)is thatthefeatureswill berescaledsothatthey’ll havethe
properties of astandardnormal distributionwith
μ=0 andσ=1 whereμ is themean(average) andσ is thestandarddeviation fromthemean;standard
scores (also called z scores)of thesamples arecalculated asfollows:
z=(x−μ)/σ
Page 17 | 18
Q20: What is the difference between normalization and
Standardization with example?
In ML, every practitioner knows that feature scaling is an important issue. The two most discussed
scaling methods are Normalization and Standardization. Normalization typically means it
rescales thevalues into arange of [0,1].
It is an alternative approach to Z-score normalization (or standardization) is the so-called Min-Max
scaling (often also called “normalization” - a commoncause for ambiguities). In this approach, thedata
is scaled toafixed range - usually 0to1. Scikit-Learn providesatransformer called MinMaxScaler
for
this. A Min-Max scaling is typically donevia thefollowing equation:
Xnorm= X-Xmin/Xmax-Xmin
Example withsampledata: BeforeNormalization: AttributePrice in Dollars StorageSpaceCamera
AttributePriceinDollars StorageSpaceCamera
Mobile1250 1612
Mobile2200 168
Mobile3300 3216
Mobile4275 328
Mobile5225 1616
After Normalization: (Values ranges from 0-1 whichis working as expected)
AttributePriceinDollars StorageSpaceCamera
Mobile10.50 0.5
Mobile2000
Mobile3111
Mobile40.751 0
Mobile50.250 1
Standardization (or Z-score normalization) typically means rescales data to have a mean of 0 and a
standarddeviation of 1(unit variance) Formula:Z or X_new=(x−μ)/σ whereμ is themean(average),
andσ is thestandard deviation from themean;standard scores (also called z scores) Scikit-Learn
provides a transformer called StandardScaler for standardization Example: Let’s take an approximately
normally distributedsetof numbers:1, 2, 2, 3, 3, 3, 4, 4, and5. Its meanis 3, andits standarddeviation:
1.22. Now,let’s subtractthe meanfromall datapoints.weget anewdataset of: -2, -1, -1, 0, 0, 0, 1, 1,
and2. Now,let’s divide eachdatapoint by 1.22. As youcan seein thepicture below, weget: -1.6,
-0.82,
-0.82, 0, 0, 0, 0.82, 0.82, and1.63
Page 18 | 18
DATA SCIENCE
INTERVIEW
PREPARATION
Preparation)
# DAY 04
Q1. What is upsampling and downsampling with examples?
The classification data set with skewed class proportions is called an
imbalanced data set. Classes which make up a large proportion of the
data sets are called majority classes. Those make up smaller
proportions are minority classes.
Degree of imbalance Proportion of Minority Class
1>> Mild 20-40% of the data set
2>> Moderate 1-20% of the data set
3>> Extreme <1% of the data set
If we have an imbalanced data set, first try training on the true
distribution.
If the model works well and generalises, you are done! If not, try
the following up sampling and down sampling technique.
1. Up-sampling
Upsampling is the process of randomly duplicating observations from
the minority class to reinforce its signal.
First, we will import the resampling module from Scikit-
Learn: Module for resampling Python
1- From sklearn.utils import resample
Next, we will create a new Data Frame with an up-sampled minority
class. Here are the steps:
1First, we will separate observations from each class into different
Data Frames.
2Next, we will resample the minority class with replacement, setting the
number of samples to match that of the majority class.
3Finally, we'll combine the up-sampled minority class Data Frame with the
original majority class Data Frame.
2-Down-sampling
Downsampling involves randomly removing observations from the
majority class to prevent its signal from dominating the learning
algorithm.
The process is similar to that of sampling. Here are the steps:
1-First, we will separate observations from each class into different
Data Frames.
2Next, we will resample the majority class without replacement, setting the
number of samples to match that of the minority class.
3Finally, we will combine the down-sampled majority class Data Frame
with the original minority class Data Frame.
Q2. What is the statistical test for data validation with an example,
Chi-square, ANOVA test, Z statics, T statics, F statics,
Hypothesis Testing?
Before discussing the different statistical test, we need to get a clear

understanding of what a null hypothesis is. A null hypothesis proposes that
has no significant difference exists in the set of a given observation.
Null: Two samples mean are equal. Alternate: Two samples mean are not
equal.
For rejecting the null hypothesis, a test is calculated. Then the test statistic
is compared with a critical value, and if found to be greater than the critical
value, the hypothesis will be rejected.
Critical Value:-
Critical values are the point beyond which we reject the null hypothesis.
Critical value tells us, what is the probability of N number of samples,
belonging to the same distribution. Higher, the critical value which means
lower the probability of N number of samples belonging to the same
distribution.
Critical values can be used to do hypothesis testing in the following way.
1. Calculate test statistic
2. Calculate critical values based on the significance level alpha
3. Compare test statistics with critical values.
IMP-If the test statistic is lower than the critical value, accept the hypothesis
or else reject the hypothesis.
Chi-Square Test:-
A chi-square test is used if there is a relationship between two categorical
variables.
Chi-Square test is used to determine whether there is a significant
difference between the expected frequency and the observed frequency in
one or more categories. Chi-square is also called the non-parametric test
as it will not use any parameter
2-Anova test:-
ANOVA, also called an analysis of variance, is used to compare multiples
(three or more) samples with a single test.
Useful when there are more than three populations. Anova compares the
variance within and between the groups of the population. If the variation is
much larger than the within variation, the means of different samples will
not be equal. If the between and within variations are approximately the
same size, then there will be no significant difference between sample
means. Assumptions of ANOVA: 1-All populations involved follow a normal
distribution. 2-All populations have the same variance (or standard
deviation). 3-The samples are randomly selected and independent of one
another.
ANOVA uses the mean of the samples or the population to reject or
support the null hypothesis. Hence it is called parametric testing.
3-Z Statics:-
In a z-test, the samples are assumed to be normal distributed. A z score is
calculated with population parameters as “population mean” and
“population standard deviation” and it is used to validate a hypothesis that
the sample drawn belongs to the same population.
The statistics used for this hypothesis testing is called z-statistic, the score
for which is calculated as z = (x —μ) / (σ / √n), where x= sample mean μ =
population mean σ / √n =population standard deviation If the test statistic is
lower than the critical value, accept the hypothesis or else reject the
hypothesis
4- T Statics:-
A t-test used to compare the mean of the given samples. Like z-test, t-test
also assumed a normal distribution of the samples. A t-test is used when
the population parameters (mean and standard deviation) are unknown.
There are three versions of t-test
1. Independent samples t-test which compare means for two groups
2. Paired sample t-test which compares mean from the same group at
different times
3. Sample t-test, which tests the mean of the single group against the
known mean. The statistic for hypothesis testing is called t-statistic,
the score for which is calculated as t =(x1 —x2) / (σ / √n1 +σ / √n2),
where
x1 =It is mean of sample A, x2 =mean of sample B,

n1 =size of sample 1 n2 =size of sample 2
5- F Statics:-
The F-test is designed to test if the two population variances are equal. It
compares the ratio of the two variances. Therefore, if the variances are
equal, then the ratio of the variances will be 1.
The F-distribution is the ratio of two independent chi-square variables
divided by their respective degrees of freedom.
F =s1^2 / s2^2 and where s1^2 >s2^2.
If the null hypothesis is true, then the F test-statistic given above can be
simplified. This ratio of sample variances will be tested statistic used. If the
null hypothesis is false, then we will reject the null hypothesis that the ratio
was equal to 1 and our assumption that they were equal.
Q3. What is the Central limit theorem?

Central Limit Theorem
Definition: The theorem states that as the size of the sample increases, the
distribution of the mean across multiple samples will approximate a
Gaussian distribution (Normal). Generally, sample sizes equal to or greater
than 30 are consider sufficient for the CLT to hold. It means that the
distribution of the sample means is normally distributed. The average of the
sample means will be equal to the population mean. This is the key
aspect of the theorem.
Assumptions:
1. The data must follow the randomization condition. It must be sampled

randomly
2. Samples should be independent of each other. One sample
should not influence the other samples
3. Sample size should be no more than 10% of the population when
sampling is done without replacement
4. The sample size should be sufficiently large. The mean of the sample
means is denoted as:
µ X̄ =µ
Where,
µ X̄ =Meanofthesamplemeansµ=Populationmeanand,thestandard deviationofthe
sample mean is denoted as:
σ X̄ =σ/sqrt(n)
Where,
σ X̄ =Standarddeviationofthesamplemeanσ=Populationstandard deviationn=sample
size
A sufficiently large sample size can predict the characteristics of a
population accurately. For Example, we shall take a uniformly
distributed data:
Randomly distributed data: Even for a randomly (Exponential)
distributed data the plot of the means is normally distributed.
The advantage of CLT is that we need not worry about the actual data
since the means of it will always be normally distributed. With this, we
can
create component intervals, perform T-tests and ANOVA tests from
the given samples.
Q4. What is the correlation and

coefficient? What is the Correlation
Coefficient?
The correlation coefficient is a statistical measure that calculates the
strength of the relationship between the relative movements of two
variables. We use it to measure both the strength and direction of a linear
relationship between two variables the values range between -1.0 and 1.0.
A calculated number greater than 1.0 or less than -1.0 means that there
was an error in the correlation measurement. A correlation of -1.0 shows a
perfect negative correlation, while a correlation of 1.0 shows a perfect
positive correlation.
Correlation coefficient formulas are used to find how strong a relationship is

between data. The formulas return a value between -1 and 1, where:
1 indicates a strong positive relationship. -1 indicates a strong negative
relationship. A result of zero indicates no relationship at all.
Meaning
1. A correlation coefficient of 1 means that for every positive increase in

one variable, there is a positive increase in a fixed proportion in the
other. For example, shoe sizes go up in (almost) perfect correlation
with foot length.
2. A correlation coefficient of -1 means that for every positive increase
in one variable, there is a negative decrease of a fixed proportion in
the other. For example, the amount of gas in a tank decreases in
(almost)
perfect correlation with speed.
3.Zero means that for every increase, there isn’t a positive or negative
increase. The two just aren’t related.
What is a Negative Correlation?
Negative correlation is a relationship between two variables in which
one variable increases as the other decreases, and vice versa. In
statistics, a perfect negative correlation is represented by the value -1.
Negative correlation or inverse correlation is a relationship between two
variables whereby they move in opposite directions. If variables X and
Y have a negative correlation (or are negatively correlated), as X
increases in value, Y will decrease; similarly, if X decreases in value, Y
will increase.
What Is Positive Correlation?
Positive correlation is a relationship between two variables in which
both variables move in tandem—that is, in the same direction. A
positive correlation exists when one variable decreases as the other
variable decreases or one variable increases while the other increases.
We use the correlation coefficient to measure the strength and direction of

the linear relationship between two numerical variables X and Y.
The correlation coefficient for a sample of data is denoted by r.
Pearson Correlation Coefficient
Pearson is the most widely used correlation coefficient. Pearson
correlation measures the linear association between continuous variables.
In other words, this coefficient quantifies the degree to which a
relationship between two variables can be described by a line. Formula
developed by Karl Pearson over 120 years ago is still the most widely
used today. The
formula for the correlation (r) is
Where n is the number of pairs of data;
Are the sample means of all the x-values and all the y-values, respectively;
and sx and sy are the sample standard deviations of all the x- and y-values,
respectively.
1. Find the mean of all the x-values and mean of all y-values.
2.Find the standard deviation of all the x-values (call it sx) and the
standard deviation of all the y-values (call it sy). For example, to find
sx, you would use the following equation:
3. For each of the n pairs (x, y) in the data set, take
4. Add up the n results from Step 3.
5. Divide the sum by sx ∗ sy.
6. Divide the result by n – 1, where n is the number of (x, y) pairs. (It’s
the same as multiplying by 1 over n – 1.) This gives you the
correlation, r.
Q5: What is the difference between machine learning and deep

learning?
Machine Learning |deep learning

Machine Learning is a technique to learn from that data and then apply wha
t has been learnt to make an informed decision | The main difference betwe
en deep and machine learning is, machine learning models become better
progressively but the model still needs some guidance. If a machine-
learning model returns an inaccurate prediction then the programmer need
s to fix that problem explicitly but in the case of deep learning, the model do
es it by himself.
>Machine Learning can perform well with small size data also | Deep Learn
ing does not perform as good with smaller datasets.
>Machine learning can work on some low-

end machines also | Deep Learning involves many matrix multiplication op
erations which are better suited for GPUs
>Features need to be identified and extracted as per the domain before pu

shing them to the algorithm | Deep learning algorithms try to learn high-
level features from data.
>It is generally recommended to break the problem into smaller chunks, sol
ve them and then combine the results | It generally focusses on solving the
problem end to end
>Training time is comparatively less | Training time is comparatively more

>Results are more interpretable | Results Maybe more accurate but less int
erpretable
>No use of Neural networks | uses neural networks
>Solves comparatively less complex problems | Solves more complex prob
lems.
Q6: What is perceptron and how it is related to human neurons?
If we focus on the structure of a biological neuron, it has dendrites, which
are used to receive inputs. These inputs are summed in the cell body and
using the Axon it is passed on to the next biological neuron as shown
below.
Dendrite: Receives signals from other neurons
Cell Body: Sums all the inputs
Axon: It is used to transmit signals to the other cells
Similarly, a perceptron receives multiple inputs, applies various
transformations and functions and provides an output. A Perceptron is a
linear model used for binary classification. It models a neuron, which has a
set of inputs, each of which is given a specific weight. The neuron
computes some function on these weighted inputs and gives the output.
Q7: Why deep learning is better than machine learning?

Though traditional ML algorithms solve a lot of our cases, they are not
useful while working with high dimensional data that is where we
have a large number of inputs and outputs. For example, in the case
of handwriting recognition, we have a large amount of input where
we will
have different types of inputs associated with different types of
handwriting.
The second major challenge is to tell the computer what are the features it
should look for that will play an important role in predicting the outcome as
well as to achieve better accuracy while doing so.
Q8: What kind of problem can be solved by using deep learning?

Deep Learning is a branch of Machine Learning, which is used to solve
problems in a way that mimics the human way of solving problems.
Examples:
Image recognition
Object Detection
Natural Language processing- Translation, Sentence formations, text

to speech, speech to text
understand the semantics of actions
Q9: List down all the activation function using mathematical
Expression and example. What is the activation function?

Activation functions are very important for an Artificial Neural Network to
learn and make sense of something complicated and the Non-linear
complex functional mappings between the inputs and response variable.
They introduce non-linear properties to our Network. Their main purposes
are to convert an input signal of a node in an A-NN to an output signal.
So why do we need Non-Linearities?

Non-linear functions are those, which have a degree more than one, and
they have a curvature when we plot a Non-Linear function. Now we need
a Neural Network Model to learn and represent almost anything and any
arbitrary complex function, which maps inputs to outputs. Neural-Networks
are considered Universal Function Approximations. It means that they can
compute and learn any function at all.
Most popular types of Activation functions -
Sigmoid or Logistic
Tanh — Hyperbolic tangent
ReLu -Rectified linear
units
Sigmoid Activation function: It is a activation function of form f(x) = 1 /

1
+exp(-x).ItsRangeisbetween0and1.ItisanS-shapedcurve.Itiseasy tounderstand.
Hyperbolic Tangent function- Tanh : It’s mathematical formula is f(x) =1

—exp(-2x) / 1 +exp(-2x). Now it’s the output is zero centred because its
range in between -1 to 1 i.e. -1 <output <1 . Hence optimisation is easier
in this method; Hence in practice, it is always preferred over Sigmoid
function.
ReLu- Rectified Linear units: It has become more popular in the past
couple of years. It was recently proved that it has six times improvement in
convergence from Tanh function. It’s R(x) =max (0,x) i.e. if x <0 , R(x) =0
and if x >=0 , R(x) = x. Hence as seen that mathematical form of this
function, we can see that it is very simple and efficient. Many times in
Machine learning and computer science we notice that most simple and
consistent techniques and methods are only preferred and are the best.
Hence, it avoids and rectifies the vanishing gradient problem. Almost all the
deep learning Models use ReLu nowadays.
Q10: Detail explanation about gradient decent using example and

Mathematical expression?
Gradient descent is an optimisation algorithm used to minimize some
function by iteratively moving in the direction of steepest descent as
defined by negative of the gradient. In machine learning, we used gradient
descent to update the parameters of our model. Parameters refer to
coefficients in the Linear Regression and weights in neural networks.
The size of these steps called the learning rate. With the high learning rate,
we can cover more ground each step, but we risk overshooting the lower
point since the slope of the hill is constantly changing. With a very lower
learning rate, we can confidently move in the direction of the negative
gradient because we are recalculating it so frequently. The Lower learning
rate is more precise, but calculating the gradient is time-consuming, so it
will take a very large time to get to the bottom.
Math
Now let’s run gradient descent using new cost function. There are two
parameters in cost function we can control: m (weight) and b (bias). Since
we need to consider that the impact each one has on the final prediction,
we need to use partial derivatives. We calculate the partial derivative of the
cost function concerning each parameter and store the results in a
gradient.
Math
Given the cost function:
To solve for the gradient, we iterate by our data points using our new m
and b values and compute the partial derivatives. This new gradient tells
us about the slope of the cost function at our current position (current
parameter values) and the directions we should move to update our
parameters. The learning rate controls the size of our update.
Q11: What is backward propagation?

Back-propagation is the essence of the neural net training and this
method of fine-tuning the weights of a neural net based on the errors
rate obtained in the previous epoch. Proper tuning of the weights
allows us to reduce error rates and to make the model reliable by
increasing its generalisation.
Backpropagation is a short form of "backward propagation of errors."
This is the standard method of training artificial neural networks. This
helps to calculate the gradient of a loss function with respects to all
the weights in the network.
Most prominent advantages of Backpropagation are:
Backpropagation is the fast, simple and easy to program.
It has no parameters to tune apart from the numbers of input.

It is the flexible method as it does not require prior knowledge about
the network
It is the standard method that generally works well.
It does not need any special mentions of the features of the function
to be learned.
Q12: How we assign weights in deep learning?

We already know that in a neural network, weights are usually initialised
randomly and that kind of initialisation takes a fair/significant amount of
repetitions to converge to the least loss and reach the ideal weight matrix.
The problem is, that kind of initialisation is prone to vanishing or exploding
gradient problems.
General ways to make it initialise better weights:
ReLu activation function in the deep nets.
1. Generate a random sample of weights from a Gaussian
distribution having mean 0 and a standard deviation of 1.
2. Multiply the sample with the square root of (2/ni). Where ni is
the number of input units for that layer.
b) Likewise, if you’re using Tanh activation function :
1. Generate a random sample of weights from a Gaussian
distribution having mean 0 and a standard deviation of 1.
2. Multiply the sample with the square root of (1/ni) where ni is
several input units for that layer.
Q13: What is optimiser is deep learning, and which one is the best?
Deep learning is an iterative process. With so many hyperparameters
to tune or methods to try, it is important to be able to train models
fast, to quickly complete the iterative cycle. This is the key to increase
the speed and efficiency of a machine learning team.
Hence the importance of optimisation algorithms such as
stochastic gradient descent, min-batch gradient descent,
gradient descent with momentum and the Adam optimiser.
Adam optimiser is the best one.

Given an algorithm f(x), it helps in either minimisation or maximisation
of the value of f(x). In this context of deep learning, we use
optimisation algorithms to train the neural network by optimising the
cost function J.
The cost function is defined as:

The value of the cost function J is the mean of the loss L between the
predicted value y’and actual value y. The value y” is obtained during the
forward propagation step and makes use of the Weights W and biases b of
the network. With the help of optimisation algorithms, we minimise the
value of Cost Function J by updating the values of trainable
parameters W and b.
Q14: What is gradient descent, mini-batch gradient descent, batch
gradient decent, stochastic gradient decent and adam?
Gradient Descent
it is an iterative machine learning optimisation algorithm to reduce the cost
function, and help models to make accurate predictions.
Gradient indicates the direction of increase. As we want to find the
minimum points in the valley, we need to go in the opposite direction of the
gradient. We update the parameters in the negative gradient direction to
minimise the loss.
Where θ is the weight parameter, η is the learning rate, and ∇ J(θ ;x,y)
is the gradient of weight parameter θ
Types of Gradient Descent
Different types of Gradient descents are
Batch Gradient Descent or Vanilla Gradient
Descent Stochastic Gradient Descent
Mini batch Gradient Descent
Batch Gradient Descent
In the batch gradient, we use the entire dataset to compute the gradient of
the cost function for each iteration for gradient descent and then update the
weights.
Stochastic Gradient descent

Stochastic gradient descent, we use a single data point or example to
calculate the gradient and update the weights with every iteration.
We first need to shuffle the datasets so that we get a completely
randomised dataset. As the datasets are random and weights, are updated
for every single example, an update of the weights and the cost functions
will be noisy jumping all over the place
Mini Batch Gradient descent
Mini-batch gradients is a variation of stochastic gradient descent where
instead of a single training example, a mini-batch of samples are used.
Mini -batch gradient descent is widely used and converges faster and is
more stable.
The batch size can vary depending upon the dataset.
As we take batches with different samples, it reduces the noise which is a
variance of the weights updates, and that helps to have a more stable
converge faster.
Q15: What are autoencoders?

An autoencoder, neural networks that have three layers:
An input layer, a hidden layer which is also known as encoding layer, and a
decoding layer. This network is trained to reconstruct its inputs, which
forces the hidden layer to try to learn good representations of the inputs.
An autoencoder neural network is an unsupervised Machine-learning
algorithm that applies backpropagation, setting the target values to be
equal to the inputs. An autoencoder is trained to attempts to copy its
input to its output. Internally, it has a hidden layer which describes a
code used to represent the input.
Autoencoder Components:
Autoencoders consists of 4 main parts:

1Encoder: In this, the model learns how to reduce the input dimensions
and compress the input data into an encoded representation.
2Bottleneck: In this, the layer that contains the compressed
representation of the input data. This is the lowest possible dimension of
the input data.
3Decoder: In this, the model learns how to reconstruct the data from the
encod represented to be as close to the original inputs as possible.
4Reconstruction Loss: In this method that measures measure how well
the decoder is performing and how closed the output is related to the
original input.
Types of Autoencoders :
1. Denoising auto encoder
2. Sparse auto encoder

3. Variational auto encoder (VAE)
4. Contractive auto encoder (CAE)

Q16: What is CNN?
This is the simple application of a filter to an input that results in
inactivation. Repeated application of the same filter to input results in a
map of activations called a feature map, indicating the locations and
strength of a detected feature in input, such as an image.
Convolutional layers are the major building blocks which are used in
convolutional neural networks.
A covnets is the sequence of layers, and every layer transforms one
volume to another through differentiable functions.
Different types of layers in CNN:
Let’s take an example by running a covnets on of image of dimensions 32 x
32 x 3.
1. Input Layer: It holds the raw input of image with width 32, height 32
and depth 3.
2. Convolution Layer: It computes the output volume by computing dot
products between all filters and image patches. Suppose we use a
total of 12 filters for this layer we’ll get output volume of dimension 32
x 32 x 12.
3.Activation Function Layer: This layer will apply the element-wise
activation function to the output of the convolution layer. Some
activation functions are RELU: max(0, x), Sigmoid: 1/(1+e^-x), Tanh,
Leaky RELU, etc. So the volume remains unchanged. Hence output
volume will have dimensions 32 x 32 x 12.
4. Pool Layer: This layer is periodically inserted within the covnets, and
its main function is to reduce the size of volume which makes the
computation fast reduces memory and also prevents overfitting.
Two common types of pooling layers are max pooling and average
pooling. If we use a max pool with 2 x 2 filters and stride 2, the
resultant volume will be of dimension 16x16x12.
5. Fully-Connected Layer: This layer is a regular neural network layer

that takes input from the previous layer and computes the class
scores and outputs the 1-D array of size equal to the number of
classes.
Q17: What is pooling, padding, filtering operations on CNN?

Pooling Layer
It is commonly used to periodically insert a Pooling layer in-between
successive Conv layers in a ConvNet architecture. Its function is to
progressively reduce the spatial size of the representation to reduce the
number of parameters and computation in the network, and hence to
also
control overfitting. The Pooling Layer operates independently on every
depth slice of the input and resizes it spatially, using the MAX operation.
The most common form is a pooling layer with filters of size 2x2 applied
with a stride of 2 downsamples every depth slice in the input by two along
both width and height, discarding 75% of the activations. Every MAX
operation would, in this case, be taking a max over four numbers (little 2x2
region in some depth slice). The depth dimension remains unchanged.
Q18: What is the Evolution technique of CNN?

It all started with LeNet in 1998 and eventually, after nearly 15 years, lead
to groundbreaking models winning the ImageNet Large Scale Visual
Recognition Challenge which includes AlexNet in 2012 to Google Net in
2014 to ResNet in 2015 to an ensemble of previous models in 2016. In the
last two years, no significant progress has been made, and the new models
are an ensemble of previous groundbreaking models.
LeNet in 1998
LeNet is a 7-level convolutional network by LeCun in 1998 that classifies
digits and used by several banks to recognise the hand-written numbers
on cheques digitised in 32x32 pixel greyscale input images.
AlexNet in 2012
AlexNet: It is considered to be the first paper/ model, which rose the
interest in CNNs when it won the ImageNet challenge in the year 2012.
It is a deep CNN trained on ImageNet and outperformed all the entries
that year.
VGG in 2014
VGG was submitted in the year 2013, and it became a runner up in
the ImageNet contest in 2014. It is widely used as a simple
architecture compared to AlexNet.
GoogleNet in 2014
In 2014, several great models were developed like VGG, but the winner of
the ImageNet contest was GoogleNet.
GoogLeNet proposed a module called the inception modules that includes
skipping connections in the network, forming a mini-module, and this
module is repeated throughout the network.
ResNet in 2015
There are 152 layers in the Microsoft ResNet. The authors showed
empirically that if you keep on adding layers, the error rate should keep on
decreasing in contrast to “plain nets” we're adding a few layers resulted in
higher training and test errors.
Q19: How to initialise biases in deep learning?
It is possible and common to initialise the biases to be zero since the
random numbers in the weights provide the asymmetry braking. For ReLU
non-linearities, some people like to use small constant value such as 0.01
for all biases because this ensures that all ReLU units fire in the beginning,
therefore obtain, and propagate some gradient. However, it is unclear if this
provides a consistent improvement (in fact some results seem to indicates
that this performs worst) and it is more commonly used to use 0 bias
initialisation.
Q20: What is learning Rate?
Learning Rate
The learning rate controls how much we should adjust the weights
concerning the loss gradient. Learning rates are randomly initialised.
Lower the values of the learning rate slower will be the convergence to
global minima.
Higher values for the learning rate will not allow the gradient descent to
converge
Since our goal is to minimise the function cost to find the optimised
value for weights, we run multiples iteration with different weights and
calculate the cost to arrive at a minimum cost
--- -- --- -- --- -- -- --- -- -- --- -- -- --- -- --- -- -- --- -- -- -- --- -- --- -- --- -- -- --- -
- -- --- --
DATA SCIENCE
INTERVIEW
PREPARATION
Preparation)
# Day-5
Data Science Interview Questions Page 1

Q1: What are Epochs?
OneEpoch is anENTIRE datasetis passedforwards andbackwards through theneural network.
Since oneepochis toolarge tofeedtothecomputeratonce, wedivide it into several smaller batches.
WealwaysusemorethanoneEpoch because oneepoch leadstounderfitting.
As thenumberof epochsincreases,severaltimestheweightarechangedin theneuralnetworkandthe

curvegoes fromunderfitting uptooptimal tooverfitting curve.
Q2. What is the batch size?

Batch Size
The total numberof training andexamples present in asingle batch.

Unlike the learning rate hyperparameter where its value doesn’t affect computational time, the batch
sizes mustbeexaminedin conjunctions with theexecution time of training. The batch size is limited by
hardware’s memory, while the learning rate is not. Leslie recommends using a batch size that fits in
hardware’s memoryandenables using largerlearningrate.
If ourserverhasmultiple GPUs, thetotal batch size is thebatchsize onaGPU multiplied by thenumbers
ofGPU. If thearchitecturesaresmall oryourhardwarepermitsverylargebatchsizes, thenyoumight
comparetheperformanceofdifferentbatchsizes. Also, recall thatsmallbatchsizes addregularization
whilelargebatchsizes addless, soutilize thiswhilebalancingtheproperamountof regularization. It is
oftenbettertouselargebatchsizes soalargerlearningratecanbeused.
Q3: What is dropout in Neural network?

Dropout refers to ignoring units during the training phase of a certain set of neurons which is chosen
randomly. These units are notconsidered during theparticular forwardor backward pass.
Moretechnically, ateachtrainingstage,individual nodesareeitherdroppedoutofthenetwith
epdrogbeasbilty1-p orkept withprobability p, sothata reduced networkis left; incomingand outgoing
toadropped-outnodearealsoremoved.

WeneedDropouttopreventover-fitting
A dropoutis anapproachtoregularization in neuralnetworkswhichhelpstoreduceinterdependent
learningamongsttheneurons.
Wheretouse
Dropout is implemented per-layer in aneuralnetwork.

It canbeusedwithmosttypesof layers, suchasdensefully connectedlayers, convolutionallayers, and
recurrent layers suchasthelong short-termmemorynetworklayer.
Dropout maybeimplementedonany orall hidden layers in thenetwork as well as thevisible orinput
layer. It is notusedontheoutput layer.
Benefits:-
1. Dropoutforcesaneuralnetworktolearnmorerobustfeaturesthatare very useful in conjunction with

different randomsubsets oftheotherneurons.
2. Dropoutgenerally doublesthenumberof iterations requiredtoconverge.However,thetraining time
foreachepochis less.
Q4: List down hyperparameter tuning in deep learning.

The process of setting the hyper-parameters requires expertise and extensive trial and error. There are
nosimpleandeasy ways toset hyper-parameters — specifically,learningrate, batchsize,momentum, and
weight decay.
Approachestosearching for thebestconfiguration:
•GridSearch
•RandomSearch

Approach
1. Observeandunderstandtheclues available duringtraining bymonitoringvalidation/test loss
early in training, tuneyour architecture andhyper-parameters with short runs of afew
epochs.
2. Signs of underfitting oroverfitting of thetestorvalidation loss early in thetraining process are

usefulfortuning thehyper-parameters.
Toolsfor Optimizing Hyperparameters
•SageMaker
•Comet.ml
•Weights &Biases
•DeepCognition
•AzureML
Q5: What do you understand by activation function and error

functions?
Error functions
In mostlearning networks,an erroris calculated as the difference between the predicted
outputandthe actualoutput.
Thefunctionthatis usedtocomputethis error is known as Loss Function J(.). Different loss functions
will give different errors for thesameprediction, andthus have aconsiderable effect ontheperformance
of themodel.Oneof themostwidely usedloss function is meansquare error, which calculates thesquare
of thedifferencebetweentheactualvaluesandpredictedvalue. Different loss functions areusedtodeals
withadifferenttypeof tasks, i.e. regressionandclassification.

Regressive loss functions:
MeanSquareError
Absolute error
Smooth Absolute Error
Classification loss functions:
1.Binary Cross-Entropy
2. Negative Log-Likelihood
3. MarginClassifier
4.Soft Margin Classifier
Activation functionsdecide whetheraneuronshouldbeactivated ornotbycalculating aweighted
sum andaddingbiaswithit.Thepurposeoftheactivationfunctionistointroducenon-linearityintotheoutput ofa
neuron.
In a neural network, we would update the weights and biases of the neurons based on the
error at the outputs. This process is known as back-propagation. Activation function makes the back-
propagation possible since the gradients are supplied alongwith the errors toupdate the weights and
biases.
Q6: Why do we need Non-linear activation functions?
A neural network without activation functions is essentially a linear regression

model.Theactivation functionsdothenon-lineartransformationtotheinput,makingitcapableoflearning
andperforming morecomplextasks.
1. Identity
2.Binary Step
3. Sigmoid
4. Tanh
5.ReLU
6.Leaky ReLU
7. Softmax
The activation functionsdothenon-lineartransformationtotheinput,makingit capableof learning and
performing morecomplex tasks.
Q7: What do you under by vanishing gradient problem and how can
Do we solve that?
Theproblem:
As morelayers using certain activation function areaddedto neural networks, thegradients of theloss
functionapproachzero, making thenetworks tougher totrain.
Why:
Certain activation functions, like thesigmoid function, squishes a large input space into a small input
space between0 and1. Therefore, alarge change in theinput of thesigmoid function will cause a small
change in theoutput.Hence,thederivativebecomessmall.
For shallownetworkswithonly afewlayers thatusetheseactivations, this isn’t abig problem. However,

whenmorelayers areused, it cancause thegradient tobetoo small fortraining toworkeffectively.
However,whennhiddenlayers useanactivationlike thesigmoidfunction, nsmall derivativesare
multiplied together. Thus, thegradient decreases exponentiallyaswepropagate downtotheinitial layers.

Solutions:
Thesimplest solutionistouseotheractivationfunctions, suchasReLU,whichdoesn’tcauseasmall
derivative.
Residualnetworksareanothersolution,astheyprovideresidualconnectionsstraighttoearlierlayers.
Finally, batch normalization layers can also resolve the issue.
Q8: What is Transfer learning in deep learning ?

Transferlearning:Itisamachinelearningmethodwhereamodelisdevelopedforthetaskisagainused
asthestartingpointforamodelonasecondtask.
It is a popular approach in deep learning where pre-trained models are used as the starting
point on computer vision and natural language processing tasks given the vast compute and time
resources requiredtodevelopneuralnetworkmodelsontheseproblems
Transferlearningisamachinelearningtechniquewhereamodeltrainedononetaskisre-purposedon
asecondrelatedtask.
Transferlearningisanoptimizationthatallowsrapidprogressorimprovedperformancewhen
modelling
thesecondtask.
Transferlearningonlyworksindeeplearningif themodelfeatureslearnedfromthefirsttaskare
general.

Q9: What is VGG16 and explain the architecture of VGG16?
VGG-16 is a simpler architecture model since it’s not using many hyperparameters. It always uses 3 x 3
filters with the stride of 1 in convolution layer and uses SAME padding in pooling layers 2 x 2 with a
stride of2.
This architectureis fromtheVGG group,Oxford. It improvesAlexNet byreplacingthelarge kernel-

sizedfilter withmultiple3X3 kernel-sizedfilters oneafteranother.Withagivenreceptivefield(the
effective areasize of input image onwhichoutputdepends), multiple stacked smaller size kernel is better
thantheonewithalargersize kernelbecausemultiplenon-linear layersincreasesthedepthofthe
networkwhichenables it tolearn morecomplex features,andthattooatalowercost.
Three fully connected layers follow the VGG convolutional layers. The width of the networks starts at
thesmall value of 64 and increases by afactor of 2 after every sub-sampling/pooling layer. It achieves
thetop-5accuracy of92.3% onImageNet.
Q10: What is RESNET?

The winner of ILSRVC 2015, it also called as Residual Neural Network (ResNet) by Kaiming. This
architecture introduced a concept called “skip connections”. Typically, the input matrix calculates in
two linear transformations with ReLU activation function. In Residual network, it directly copies the
input matrix tothesecond transformation output andsums theoutput in final ReLU function.

Skip
Connection
Experimentsin paperfourcanjudgethepowerof theresidual network. The plain 34 layer network had

highvalidationerrorthanthe18layersplain network.This is wherewerealize thedegradation problems.
Andthesame34layersnetworkwhenconverted totheresidual network has much less training error
thanthe 18layers residual network.
Q11: What is I mageNet?

ImageNetis aproject aimed at(manually) labelling andcategorizing images into almost 22,000
separateobjectcategories for computer visionresearches.
Whenweheartheabout“ImageNet”in thecontextof deeplearning andConvolutional Neural Network,
wearereferringtoImageNetLarge Scale Visual RecognitionChallenge.
Themainaimof this imageclassification challengeis totrainthemodelthatcan correctly classify an
input image intothe1,000 separateobjects category.

Models are trained onthe~1.2 million training images with another 50,000 images for validation and
100,000 imagesfortesting.
These1,000imagecategoriesrepresentobjectclasses thatweencounterin ourday-to-daylives, suchas
speciesofdogs, cats,varioushouseholdobjects, vehicle types, andmuchmore.
Whenit comes totheimage classification, theImageNet challenge is the“de facto “ benchmark for
computervision classificationalgorithms —andtheleaderboardforthis challenge has
beendominatedby Convolutional Neural NetworksandDeeplearning techniques since 2012.
Q12: What is DarkNet?

DarkNet is a framework used to train neural networks; it is open source and written in C/CUDA and
serves as the basis for YOLO. Darknet is also used as the framework for training YOLO, meaning it
sets thearchitecture of thenetwork.
Clone therepolocally, andyouhaveit. To compileit, runamake.But first, if youintendtousethe
GPU
ucaspagabeilty,youneedtoedit theMakefile in thefirst twolines, whereyoutell it tocompileforGPU
Qwi1th3C:UWDhAadtriivserYs.OLOand explain the architecture of YOLO (you only
Look Once).One use case?

YOLO v1
The first YOLO You only look once (YOLO) version came about May 2016 and sets the core of the
algorithm, thefollowing versionsareimprovementsthatfix somedrawbacks.

In short, YOLO is a network “inspired by” Google Net. It has 24 convolutional layers working as the
feature extractors and two dense layers for making the predictions. The architecture works upon is called
Darknet, aneuralnetworkframeworkcreated by thefirst authorof theYOLO paper.
CoreConcept:-
Thealgorithmworksoff bydividing theimageintothegrid of thecells, for each cell bounding boxes
andtheirscoresarepredicted,alongsideclass probabilities. Theconfidenceis given in termsof IOU
(intersection over union), metric, which is measuring howmuchthedetected object overlaps with the
groundtruthasafractionof thetotal areaspanned by thetwotogether (the union).
YOLO v2-
This improves onsomeof theshortcomings of thefirst version, namely thefact that it is notvery good
atdetecting objects thatarevery nearandtendstomakesomeof themistakes onlocalization.
It introduces a few newer things: Which are anchor boxes (pre-determined sets of boxes such that the
network moves from predicting the bounding boxes to predicting the offsets from these) and the use of
features thataremorefine-grainedsosmaller objects canbepredicted better.
YOLO v3-
YOLOv3 came about April 2018, and it adds small improvements, including the fact that
boundingboxes getpredictedatthedifferentscales. TheunderlyingmeatypartoftheYOLOnetwork,Darknet,
is expanded in this version to have 53 convolutional layers

DATA SCI ENCE
INTERVIEW
PREPARATI ON
(30 Days of I nterview
Preparation)
# DAY 06
Q1. What is NLP?
Natural language processing (NLP): It is the branch of artificial intelligence that helps computers
understand, interpret and manipulate human language. NLP draws from many disciplines,
including computer science and computational linguistics, in its pursuit to fill the gap between
humancommunication andcomputerunderstanding.
Q2. What are the Libraries we used for NLP?

WeusuallyusetheselibrariesinNLP,whichare:
NLTK (Natural language Tool kit), TextBlob,
CoreNLP, Polyglot, Gensim, SpaCy, Scikit-learn
And the new one is Megatron library launched recently.
Q3. What do you understand by tokenisation?

Tokenisation is the act of breaking a sequence of strings into pieces such as words, keywords,
phrases, symbols and other elements called tokens. Tokens can be individual words, phrases or even
wholesentences.Intheprocessoftokenisation,somecharacterslikepunctuationmarksare discarded.
Q4. What do you understand by stemming?
Stemming:It is theprocessofreducinginflexions in wordstotheirrootformssuchasmappinga
group of words tothesamestemeven if stemitself is notavalid wordin theLanguage.
Q5. What is lemmatisation?

Lemmatisation: It is the process of the group together the different inflected forms of the word so
thattheycanbeanalysedasasingleitem.Itisquitesimilartostemming,butitbringscontextto thewords.So
it links words with similar kind meaning to one word.
Q6. What is Bag-of-words model?

We need the way to represent text data for the machine learning algorithms, and the bag-of-words
model helps us to achieve the task. This model is very understandable and to implement. It is the
way of extracting features from thetext for theuse in machine learning algorithms.
In this approach, we use the tokenised words for each of observation and find out the frequency of
eachtoken.
Let’s doanexampletounderstandthis conceptindepth.
“Itis going toraintoday.”
“Today, I amnotgoing outside.”
“I amgoing towatchtheseasonpremiere.”
Wetreateachsentenceastheseparatedocumentandwemakethelist ofall wordsfromall thethree
documentsexcluding thepunctuation. Weget,
‘It’, ’is’, ’going’, ‘to’, ‘rain’, ‘today’ ‘I’, ‘am’, ‘not’, ‘outside’, ‘watch’, ‘the’, ‘season’, ‘premiere.’
The next step is thecreate vectors. Vectors convert text thatcan beusedby themachine learning
algorithm.
Wetakethefirst document—“It is going toraintoday”, and wecheck thefrequency of words
from
thetenuniquewords.
“It” = 1
“is” = 1
“going” = 1
“to” = 1
“rain” = 1
“today” = 1
“I” = 0
“am” = 0
“not” = 0
“outside” = 0
Restof thedocuments will be:
“Itis going toraintoday” = [1, 1, 1, 1, 1, 1, 0, 0, 0, 0]
“Today I amnotgoing outside” = [0, 0, 1, 0, 0, 1, 1, 1, 1, 1]
“I amgoing towatchtheseasonpremiere” = [0, 0, 1, 1, 0, 0, 1, 1, 0, 0]
In this approach, eachword(a token)is called a“gram”. Creating thevocabulary oftwo-word
pairs
is calledabigrammodel.
The processof converting theNLP text into numbersis called vectorisation in ML. There are
different
ways toconverttext intothevectors :
•Counting thenumber of times that eachwordappears inthedocument.
•I amcalculating thefrequencythateachwordappears in adocumentoutofall thewordsin
thedocument.
Q7.What do you understand by TF-I DF?
TF-IDF: It stands forthetermof frequency-inversedocumentfrequency.
TF-IDF weight: It is astatistical measureusedtoevaluate howimportantawordis toadocumentin
a collection or corpus. The importance increases proportionally to the numberof times a word appears
in thedocumentbutis offset by thefrequency ofthewordin thecorpus.
•Term Frequency (TF): is a scoring of thefrequency of theword in the current document.

Since every document is different in length, it is possible that a term would appear much
moretimes in long documentsthanshorterones. The termfrequency is oftendivided by the
documentlengthtonormalise.
• Inverse Document Frequency(IDF): It is a scoring of how rare the word is

acrossthe documents.Itisameasureofhowrareatermis,Rarertheterm,andmoreis
theIDFscore.
Thus,
Q8. What is Word2vec?
Word2Vecisashallow, two-layerneuralnetworkwhichistrainedtoreconstructlinguistic
contextsofwords.Ittakesasitsinputalargecorpusofwordsandproducesavectorspace,
typically of several of hundred dimensions, with each of unique wordin the corpus
being assignedtothecorrespondingvectorinspace.
Wordvectorsarepositionedinavectorspacesuchthatwordswhichsharecommon
contextsinthecorpusarelocatedclosetooneanotherinthespace.
Word2Vecisaparticularly computationally-efficientpredictivemodelforlearningword
embeddingsfromrawtext.
Word2Vecisagroupofmodelswhichhelpsderive relations betweenawordandits
contextualwords.Let’s lookattwoimportantmodelsinside Word2Vec:Skip-grams and
CBOW.
Skip-grams
In Skip-gram model, wetake acentre wordandawindow of context (neighbour) words, and

wetrytopredictthecontextofwordsouttosomewindow size for each centre word. So, our
modelis going todefineaprobabilitydistribution,i.e. probabilityofawordappearingin the
contextgiven acentrewordandwearegoing tochooseourvector representations tomaximise
theprobability.
ContinuousBag-of-Words(CBOW)
CBOW predictstargetwords(e.g. ‘mat’) fromthesurrounding context words(‘the cat sits
onthe’).
Statistically, it affectsthatCBOW smoothes overa lot of distributional information (by
treatinganentirecontextasoneobservation).For themostpart,this turnsouttobeauseful
thing forsmaller datasets.
This wasaboutconvertingwordsintovectors.But wheredoesthe“learning” happen?
Essentially, webeginwithsmall randominitialisation of wordvectors.Ourpredictivemodel
learnsthevectorsbyminimising theloss function.In Word2vec,this happenswithfeed-
forwardneuralnetworksandoptimisationtechniquessuchasStochastic gradientdescent.
There arealso count-based models which maketheco-occurrence count matrix of thewords
in ourcorpus;wehaveavery large matrixwitheachrowfor the“words”andcolumnsfor the
“context”.Thenumberof“contexts” is, ofcourseverylarge, sinceit is veryessentially
combinatorialin size. To overcomethisissue,weapplySVD toamatrix.This reducesthe
dimensionsofthematrix toretainmaximumpieces of information.
Q9. What is Doc2vec?
Paragraph Vector(morepopularlyknownasDoc2Vec) —DistributedMemory(PV-DM)
Paragraph Vector(Doc2Vec) is supposedtobeanextension toWord2Vec such

thatWord2Veclearns toproject wordsinto alatentd-dimensionalspacewhereasDoc2Vec
aimsatlearning howtoproject adocumentinto alatent d-dimensional space.
The basic idea behindPV-DM is inspiredby Word2Vec. In CBOW modelof Word2Vec, the
model learns topredict acentre wordbasedonthecontexts. For example- given asentence
“The catsatonthetable”, CBOW modelwouldlearntopredictthewords “sat” given the
contextwords—thecat, onandtable.Similarly,in PV-DM themainideais: randomlysample
consecutivewordsfrom theparagraph andpredict acentre wordfrom therandomly sampled
setofwordsby taking astheinput—thecontextwordsandtheparagraphid.
Let’s have alook at themodel diagram for somemoreclarity. In this given model, wesee
Paragraph matrix, (Average/Concatenate) andclassifier sections.
Paragraph matrix: It is the matrix where each column represents the vector of a paragraph.
Average/Concatenate: It means that whether the word vectors and paragraph vector are
averagedorconcatenated.
Classifier: In this, it takes the hidden layer vector (the one that was concatenated/averaged) as
input andpredicts theCentre word.
In the matrix D, It has the embeddings for “seen” paragraphs (i.e. arbitrary length
documents),thesamewayWord2Vecmodelslearnsembeddingsforwords.Forunseen
paragraphs,themodelisagainrunthroughgradientdescent(5orsoiterations)toinfera
documentvector.
Q9. What is Time-Series forecasting?
Time series forecasting is a technique for the prediction of events through a sequence of
time. The technique is used across many fields of study, from the geology to behaviour to
economics.Thetechniquespredictfutureeventsbyanalysingthetrendsofthepast,onthe assumptionthat
futuretrendswillholdsimilartohistoricaltrends.
Q10. What is the difference between in Time series

and regression?
Time-series:
1. Wheneverdata isrecordedatregularintervalsoftime.
2. Time-seriesforecastisExtrapolation.
3. Time-seriesreferstoan orderedseriesofdata.
Regression:
1. Whereasinregression,whether data isrecordedatregularorirregularintervalsoftime, wecan
apply.
2.Regression is Interpolation.
3. Regressionreferbothorderedandunorderedseriesofdata.
Q11. What is the difference between stationery and non-
stationary data?
Stationary: A series is said tobe"STRICTLY STATIONARY” if theMean, Variance &
Covarianceis constantoversometimeortime-invariant.
Non-Stationary:
A seriesis said tobe"STRICTLY STATIONARY” if theMean,Variance & Covarianceis

notconstantover sometimeor time-invariant.
Q12. Why you cannot take non-stationary data to solve time
series Problem?
o Mostmodelsassumestationaryofdata.Inotherwords,standardtechniquesare invalidifdatais
"NON-STATIONARY".
Autoocorrelationmayresultdueto"NON-STATIONARY".
o Non-stationaryprocessesarearandomwalkwithorwithoutadrift(aslow,steady
change).
o Deterministic trends (trends that are constant, positive or negative, independent of
timeforthewholelifeof theseries).
DATA SCIENCE
INTERVIEW
PREPARATION
Preparation)
# DAY 07
Q1. What is the process to make data stationery from non-
stationary in time series?
Ans:
The twomostcommonwaystomakeanon-stationary timeseriesstationaryare:
Differencing
Transforming
Let uslook atsomedetailsfor eachof them:
Differencing:
To makeyour series stationary, you take adifference betweenthedata points. So let us say, your
original timeserieswas:
X1, X2, X3,...........Xn
Your series withadifferenceof degree1becomes:
(X2- X1, X3- X2, X4- X3,.......Xn- X(n-1)
Once, you make the difference, plot theseries andsee if there is any improvement in theACF curve.
If not, you can try a second or even a third-order differencing. Remember, the more you difference,
themorecomplicated your analysis is becoming.
Transforming:
If we cannot make a time series stationary, you can try out transforming the
variables. Log transform isprobably themostcommonlyusedtransformation ifweseethe
divergingtimeseries.
However,itissuggestedthatyouusetransformationonlyincasedifferencingisnotworking.
Q2. What is the process to check stationary data ?
Ans:
Stationaryseries:It is onein whichtheproperties–mean,varianceandcovariance,donotvary

withtime.
Let us get an idea with these three plots:
In the first plot, we can see that the mean varies (increases) with time, which
resultsinan upwardtrend.Thisisthenon-stationaryseries.
Fortheseriesclassification asstationary, itshouldnotexhibit thetrend.
Movingontothesecondplot,wedonotseeatrendin theseries,butthevarianceof theseries
is a function of time. As mentioned previously, a stationary series must have a
constant variance.
If welook atthethirdplot,thespreadbecomescloser, asthetimeincreases,whichimplies that
covariance is afunction of time.
These threeplots refer tothenon-stationary timeseries. Nowgive your attention tofourth:
In this case,Mean,Variance andCovariance areconstantwithtime.This is howastationarytime

serieslooks like.
Mostof thestatistical modelsrequiretheseriestobestationarytomakeaneffectiveandprecise
prediction.
The various processyoucan usetofind outyour datais stationaryornotby thefollowing terms:

1.Visual Test
2.Statistical Test
3.ADF(AugmentedDickey-Fuller) Test
4.KPSS(Kwiatkowski-Phillips-Schmidt-Shin) Test
Q3. What are ACF and PACF?.

Ans:
ACF is a(complete)auto-correlationfunctionwhich gives us thevalues of theauto-correlation of
anyseries withlaggedvalues. Weplotthesevalues alongwithaconfidenceband.Wehave anACF
plot.In simple terms, it describes howwell thepresent value of theseries is related toits past
values.A timeseriescanhavecomponentslike thetrend,seasonality,cyclic andresidual.ACF
considersall thecomponentswhile finding correlations; hence, it’s a‘complete auto-correlation
plot’.
PACF is apartialautocorrelationfunction.Insteadof finding correlationsof presentwithlags like
ACF, it finds thecorrelations of theresidualswiththenext lag value thus ‘partial’ andnot
‘complete’asweremovealready found variations before wefind next correlation. So if thereare any
hiddenpiecesof informationin theresidual whichcanbemodelledby nextlag, wemightget a good
correlation, andwe’ll keepthatnext lag asafeaturewhile modelling. Remember, while
modellingwedon’twanttokeeptoomanycorrelatedfeatures,asthatit cancreatemulticollinearity
issues. Hence weneedtoretain only relevant features.
Q4. What do you understand by the trend of data?

Ans:
A generalsystematiclinear or(mostoften)nonlinear component that changes over time and does
notrepeat.
Therearedifferentapproachestounderstandingtrend.A positive trendmeansit is likely that
growthcontinues. Let's illustrate this withasimple example:
Hmm,thislookslikethereisatrend.Tobuildupconfidence,let'saddalinearregressionforthis graph:
Great, nowit’s clear theirs atrendin thegraph by adding Linear Regression.

Q5. What is the Augmented Dickey-Fuller
Test? Ans:
TheDickey-Fullertest:It is oneofthemostpopularstatisticaltests.It is usedtodeterminethe
presenceofunitrootinaseries,andhencehelpustounderstandiftheseriesisstationaryornot. Thenull
andalternatehypothesisforthistestis:
Null Hypothesis: The series has a unit root (value of a
=1) Alternate Hypothesis: The series has no unit root.
If we fail to reject the null hypothesis, we can say that the series is non-
stationary.This meansthat theseriescanbelinearordifferencestationary.
Q6. What is AIC and BIC into time series?

Ans:
Akaike’s information criterion (AIC) compares the quality of a set of statistical models to
each other.Forexample,youmightbeinterestedinwhatvariablescontributetolowsocioeconomic status
andhowthevariablescontributetothatstatus.Let’ssayyoucreateseveralregressionmodels forvariousfactors
like education, family size, or disability status; The AIC will take each model and rank them
frombesttoworst.The“best”modelwillbetheonethatneitherunder-fitsnorover- fits.
TheBayesianInformationCriterion(BIC)canbedefinedas:
klog(n)-2log(L(θ̂̂ )).
Here n is the sample size.
K is the number of parameters which your model estimates.
θ is the set of all parameter.
L (θ̂̂ ) represents the likelihood of the model tested, when evaluated at
maximumlikelihoodvalues ofθ.
Q7. What are the components of the Time -Series?

Ans:
Time seriesanalysis: It providesabodyof techniquestounderstandadataset better.The
most usefuloneisthedecompositionofthetimeseriesintofourconstituentparts-
1.Level- The baseline value for the series if it were a straight line.
2. Trend-Theoptionalandlinear,increasingordecreasingbehaviourofseriesovertime.
3. Seasonality-Optionalrepeatedpatterns/cyclesofbehaviourovertime.
4. Noise-Theoptionalvariabilityintheobservationsthatcannotbeexplainedbythemodel.
Q8. What is Time Series Analysis?

Ans:
Time seriesanalysis: It involves developingmodelsthatbestcapture ordescribeanobserved
time seriestounderstandtheunderlyingcause.Thisstudyseeksthe“why”behindthetime-series
datasets.Thisinvolvesmakingassumptionsabouttheformofdataanddecomposingtime-series intothe
constitutioncomponent.
Quality ofdescriptive modelis determinedbyhowwell it describesall availabledataand

the interpretationitprovidestoinformtheproblemdomainbetter.
Q9. Give some examples of the Time-Series forecast?
Ans:
Thereis almostanendlesssupplyof thetimeseriesforecastingproblems.Below aretenexamples
fromarange of industries tomakethenotions of time series analysis andforecasting more
concrete.
1. Forecastingthecornyield in tonsby thestate eachyear.
2. ForecastingwhetheranEEG trace in seconds indicates apatient is having aseizure ornot.
3. Forecastingtheclosing price of stockseveryday.
4. Forecastingthebirth ratesatall hospitalsin thecity everyyear.
5. Forecastingproduct sales in theunits sold eachday for thestore.
6. Forecastingthenumberof passengers through thetrain station eachday.
7. Forecastingunemploymentfor astateeachquarter.
8. Forecastingtheutilisation demandontheserver every hour.
9. Forecastingthesize of therabbit populations in thestate eachbreeding season.
10. Forecastingtheaverage price of gasolinein acity eachday.
Q10. What are the techniques of Forecasting?

Ans:
Therearesomanystatistical techniquesavailable fortimeseriesforecast however wehave found a few
effective oneswhich arelisted below:
SimpleMoving Average(SMA)
Exponential Smoothing (SES)
Autoregressive Integration MovingAverage(ARIMA)
Q11. What is the Moving Average?

Ans:
The moving average modelis probably themostnaive approachtotime series modelling. This
modelstatesthatthenext observationis themeanof all pastobservations.
Althoughsimple, this modelmight besurprisingly good, andit representsagoodstarting point.
Otherwise,themovingaveragecanbeusedtoidentify interestingtrendsin thedata.Wecandefine

tarewnidnsd.owtoapply themoving average modeltosmooththetimeseriesandhighlight different
Example of a moving average on a 24h window
In the plot above, we applied the moving average model to a 24h window. The
green linesmoothedthetimeseries,andwecanseethattherearetwopeaksinthe24hperiod.
Thelongerthewindow,thesmootherthetrendwillbe.
Below is an example of moving average on a smaller window.
Exampleof amoving averageona12hwindow

Q12. What is Exponential smoothing?
Ans:
Exponential smoothingusessimilar logic tomovingaverage,butthis time,differentdecreasing
weightis assignedtoeachobservation.Wecanalsosay, less importanceis given tothe
observations as wemovefurther fromthepresent.
Mathematically, exponential smoothing is expressed as:
Here,alphais thesmoothingfactorwhichtakesvalues between0to1. It determineshowfast the

weight will decreasefortheprevious observations.
Fromtheaboveplot,thedarkbluelinerepresentstheexponentialsmoothingofthetimeseries
using a smoothing factor of 0.3, and the orange line uses a smoothing factor of 0.05. As we can
see, thesmallerthesmoothingfactor,thesmootherthetimeserieswillbe.Becauseassmoothingfactor
approaches0,weapproachtothemovingaveragemodel
DATA
SCIENCE
INTERVIEW
PREPARATI ON
Preparation)
# DAY 08
Page 1 | 7
Q1. What is Tensorflow?
Ans:
TensorFlow: TensorFlow is an open-source software library released in 2015 by Google to make it
easier for the developers to design, build, and train deep learning models. TensorFlow is originated
as an internal library that the Google developers used to build the models in house, and we expect
additional functionality to be added in the open-source version as they are tested and vetted in
internalflavour.AlthoughTensorFlowis theonlyoneofseveraloptionsavailabletothedevelopers and
wechoosetouseit herebecauseof thoughtful design andeaseof use.
At ahigh level, TensorFlow is aPython library thatallows users toexpress arbitrary computationas
agraph of data flows.Nodes in this graph represent mathematical operations, whereasedges
represent data thatis communicated fromonenodetoanother. Data in TensorFlow arerepresented
astensors,whicharemultidimensionalarrays.Although this framework for thinking about
computation is valuable in manydifferent fields, TensorFlow is primarily usedfor deeplearning in
practiceand research.
Q2. What are Tensors?

Ans:
Tensor: In mathematics, it is an algebraic object that describes the linear mapping from one
set of algebraicobjectsto the another. Objects that the tensors may mapbetweeninclude, but are not
limitedtothevectors,scalarsandrecursively,evenothertensors(forexample,amatrixisthemap
betweenvectorsandthusatensor.Thereforethelinearmapbetweenmatricesisalsothetensor). Tensorsare
inherentlyrelatedtothevectorspacesandtheirdualspacesandcantakeseveral
differentforms. For
Page 2 | 7
example, a scalar, avector, a dual vector ata point, oramulti-linear mapbetweenvector
spaces. Euclidean vectorsandscalars aresimple tensors.While tensorsaredefined asindependentof
anybasis. Theliterature onphysics, oftenreferredbytheircomponentsonabasis relatedtoa
particularcoordinate system.
Q3. What is TensorBoard?

Ans:
TensorBoard, asuitofvisualisingtools,isaneasysolutiontoTensorflowofferedbythecreators thatletsyou
visualise the graphs, plot quantitative metrics about the graph with additional data like
imagestopassthroughit.
This one is some example of how the TensorBoardis

working.
Page 3 | 7
Q4. What are the features of TensorFlow?
Ans:
• Oneof themainfeatures of TensorFlowis its ability tobuild neuralnetworks.

• By using theseneuralnetworks,machinescanperformlogical thinking andlearnsimilar to
humans.
• Therearetheothertensorsfor processing,suchasdataloading, preprocessing,calculation,
stateandoutputs.
• It considered notonly as deeplearning butalso as thelibrary for performingthetensor
calculations, andit is themostexcellentlibrary whenconsideredasthedeeplearning
frameworkthatcanalso describe basic calculation processing.
• TensorFlow describesall calculation processes by calculation graph, nomatterhowsimple
thecalculation is.
Q5. What are the advantages of TensorFlow?

Ans:
• It allows Deep
• Learning. It is
• open-sourceandfree.
• It is reliable (and without major
• bugs)
• It is backed by Google and a good
community. Itisaskillrecognisedbymany
employers.
Q6. List a few limitations of Tensorflow.
It is easy to implement.
Ans :
•Has theGPU memoryconflicts withTheano if importedin thesamescope.

•Ithas dependencies withother libraries.
• Requirespriorknowledgeof theadvancedcalculus andlinear algebraalongwiththepretty
goodunderstanding of machinelearning.
Page 4 | 7
Q7. What are the use cases of Tensor flow?
Ans:
Tensorflowis animportanttoolof deeplearning, it hasmainly five usecases, andthey are:
•TimeSeries
•Imagerecognition
•SoundRecognition
•Videodetection
•Text-basedApplications
Q8. What are the very important steps of Tensorflow architecture? Ans:
There arethree mainsteps in theTensorflowarchitectureare:
•Pre-process theData
•BuildaModel
•Trainandestimatethemodel
Page 5 | 7
Q9. What is
Keras? Ans:
Keras: It is an Open Source Neural Network library written in Python that runs on the top
of TheanoorTensorflow.Itisdesignedtobethemodular,fastandeasytouse.Itwasdevelopedby
FrançoisChollet, aGoogleengineer.
Q10. What is a pooling layer?

Ans:
Poolinglayer: It is generallyusedinreducingthespatialdimensionsandnotdepth,ona
convolutional neural network model.
Q11. What is the difference between CNN and RNN?

Ans:
CNN (Convolutional Neural Network)
•Bestsuited forspatial datalike images

•CNN is powerful comparedtoRNN
•This network takes afixedtypeof inputs andoutputs
•Thesearetheideal forvideoandimageprocessing
Page 6 | 7
RNN (Recurrent Neural Network)
•Best suited for sequential data
•RNN supports less feature set than CNN.
•This networkcan manage the arbitrary input and output lengths.
•It is ideal for text and speech analysis.
Q12. What are the benefits of Tensorflow over other libraries?

Ans:
The following benefits are:
•Scalability
•Visualisation of Data
•Debugging facility
•Pipelining
Page 7 | 7
DATA SCI ENCE
INTERVIEW
PREPARATI ON
Preparation)
# DAY 09
Page 1 | 9
Q1: How would you define Machine
Learning? Ans:
Machinelearning:It isanapplicationofartificialintelligence(AI)thatprovidessystems theability
to learn automatically and to improve from experiences without being programmed. It focuses on the
developmentofcomputerapplicationsthatcanaccessthedataandusedittolearnforthemselves.
Theprocessoflearningstartswiththeobservationsordata,suchasexamples,directexperience,or
instruction, to look for the patterns in data and to make better decisions in the future based
on
examples that we provide. The primary aim is to allow the computers to learn automatically without
humaninterventionorassistanceandadjustactionsaccordingly.
Q2. What is a labeled training set?

Ans:
Machine learning is derived from the availability of the labeled data in the form of a training
set and test set that is used by the learning algorithm. The separation of data into the training
portion and a test portion is the way the algorithm learns. We split up the data containing
known response variable values into two pieces. The training set is used to train the algorithm,
and then you use the trained model on the test set to predict the variable response values that
are already known. The final step is to compare with the predicted responses against actual
(observed) responses to see how close they are. The difference is the test error metric.
Depending on the test error, you can go back to refine the model and repeat the process until
you’re satisfied with the accuracy.
Page 2 | 9
Q3. What are the two common supervised tasks?
Ans:
The twocommonsupervisedtasks areregressionandclassification.
Regression-
The regressionproblemis whentheoutputvariable is thereal orcontinuous value, such as “salary”
or“weight.”Manydifferent models can beused, andthesimplest is linear regression. It tries tofit
thedatawiththebesthyper-plane,which goes throughthepoints.
Classification
It is thetype of supervised learning. It specifies theclass towhich thedata elements belong toand
is best used whenthe output has finite anddiscrete values. It predicts a class for aninput variable,
aswell.
Q4. Can you name four common unsupervised tasks? Ans:
Thecommonunsupervisedtasksincludeclustering, visualization, dimensionalityreduction,and

association rule learning.
Clustering
Page 3 | 9
It is a Machine Learning technique that involves the grouping of the data points. Given a
setofdata points,andwecanuseaclusteringalgorithmtoclassifyeachdatapointintothespecificgroup.In
theory,datapointsthatliein thesamegroupshouldhavesimilar propertiesand/orfeatures,and data
pointsin thedifferentgroupsshouldhavehigh dissimilar propertiesand/orfeatures.Clustering isthe
methodofunsupervisedlearningandisacommontechniqueforstatisticaldataanalysisused inmanyfields.
Visualization
Data visualization is the technique that uses an array of static and interactive visuals within
the specificcontexttohelppeopletounderstandandmakesenseofthelargeamountsofdata.Thedata isoften
displayed in thestoryformatthatvisualizespatterns,trends,andcorrelationsthatmaygo otherwise
unnoticed.Itisregularlyusedasanavenuetomonetizedataastheproduct.Anexample ofusing
monetizationanddatavisualizationisUber.Theappcombinesvisualizationwithreal-time datasothat
customerscanrequestaride.
Q5. What type of Machine Learning algorithm we use to

allow a robot to walk in various unknown terrains?
Ans:
ReinforcementLearning is likely toperformthebestif wewantarobottolearnhowtowalkin the
various unknownterrains since this is typically thetype of problemthat thereinforcement learning
Page 4 | 9
tackles. It may be possible to express the problem as a supervised or semisupervised learning
problem, butit wouldbeless natural.
ReinforcementL earning-
It’s about to take suitable actions to maximize rewards in a particular situation. It is employed by
the various software and machines to find out the best possible behavior/path it should take in
specific situations. Reinforcement learning is different from the supervised learning in a way that in
supervised learning, training data has answer key with it so that the model is trained with the correct
answer itself, but in reinforcement learning, there is no answer, and the reinforcement agent decides
what to do to perform the given task. In the absence of the training dataset, it is bound to learn from
its experience.
Q6. What type of algorithm would we use to segment your

customers into multiple groups?
Ans:
If wedon’tknowhowtodefinethegroups,thenwecanusetheclustering algorithm(unsupervised
learning) tosegmentourcustomersintoclustersof similar customers.However,if weknowwhat
groupswewouldlike tohave,thenwecanfeedmanyexamplesof eachgrouptoaclassification
algorithm (supervised learning), andit will classify all your customersintothesegroups.
Q7: What is an online machine learning?

Ans:
Online machinelearning: It is amethodofmachinelearningin whichdatabecomesavailable in
sequentialorderandtoupdateourbest predictor for thefuture dataat each step, as opposedto
batchlearningtechniquesthatgeneratethebestpredictorbylearningonentire training dataset at
once.Online learningis acommontechnique andusedin theareasof machine learning whereit is
computationallyinfeasible totrain overthedatasets, requiringtheneedforOut- of-
Corealgorithms.It is alsousedin situationswherethealgorithmmustadapt tonewpatterns in the
datadynamically orwhenthedataitself is generatedas thefunction of time, for example, stock
Page 5 | 9
price prediction.Online learningalgorithms might bepronetocatastrophic interference and
problemthatcan beaddressedby theincremental learningapproaches.
Q8: What is out-of-core learning?

Ans:
Out-of-core: It refers to the processing data that is too large to fit into the
computer’smain memory.Typically,whenthedatasetfitsneatlyintothecomputer’smain
memory,randomly accessingsectionsofdatahasa(relatively)smallperformancepenalty.
Whendatamustbestoredinamediumlikealargespinningharddriveoranexternalcomputernetwork,it
becomesveryexpensivetoseekanarbitrarysectionofdatarandomlyortoprocessthe
samedatamultipletimes.Insuchacase,anout-of-corealgorithmwilltrytoaccessalltherelevant dataina
sequence.
However,moderncomputershavedeepmemoryhierarchy,andreplacing randomaccesswiththe
sequentialaccesscanincreasetheperformanceevenondatasetsthatfitwithinmemory.
Page 6 | 9
Q9. What is the Model Parameter?
Ans:
Modelparameter:It is aconfigurationvariable thatis internaltoamodelandwhosevaluecanbe
predictedfromthedata.
Whilemakingpredictions,themodelparameteris
needed. Thevaluesdefinetheskillofamodelonproblems.
It is estimated or learned from data.
It is often not set manually by the
practitioner. Itisoftensavedaspartofthelearned
model.
Page 7 | 9
Parametersarekey tomachinelearningalgorithms.They arepartof themodelthatis learnedfrom
historical training data.
Q11: What is Model Hyperparameter?
Ans:
Modelhyperparameter:It is aconfigurationthatis external toamodel andwhosevalues cannot
beestimatedfromthedata.
Itis oftenusedinprocesses tohelpestimatemodel parameters.
The practitioneroftenspecifies them.
Itcanoftenbetheset using heuristics.
It is tunedforthegiven predictive modeling problems.
Wecannotknowthebestvalueforthemodel hyperparameter onthegiven problem. Wemayusethe
rulesof thumb,copyvaluesusedonotherproblems,orsearchfor thebest value by trial and error.
Q12. What is cross-validation?

Ans:
Cross-validation: It is atechniqueforevaluatingMachineLearning modelsbytrainingseveral
MachineLearning modelsonsubsetsof available inputdataandevaluatingthemonthe
complementarysubsetof data.Use cross-validation todetect overfitting, i.e., failing togeneralizea
pattern.
There arethreesteps involved in cross-validation areas follows :
Reservesomeportionof thesampledataset.
Page 8 | 9
Using therestdataset andtrainmodels.
Testthemodel using areserveportionof thedata-set.
Page 9 | 9
DATA SCIENCE
INTERVIEW
PREPARATION
Preparation)
# DAY 10
Page 1 | 11
Q1. What is a Recommender System?
Answer:
A recommendersystemis todaywidely deployedin multiplefields like movierecommendations,music
preferences,social tags, researcharticles, search queries andso on. The recommender systems work as
percollaborativeandcontent-basedfiltering orbydeployingapersonality-basedapproach.This typeof
systemworksbasedonaperson’spast behavior in ordertobuild amodel for thefuture. This will predict
thefutureproductbuying, movieviewing orbookreadingbypeople.It alsocreates afiltering approach
using thediscrete characteristicsof items while recommendingadditionalitems.
Q2. Compare SAS, R and Python programming?
Answer:
SAS: it is one of the most widely used analytics tools used by some of the biggest
companiesonearth. Ithassomeofthebeststatisticalfunctions,graphicaluserinterface,butcancomewitha
pricetagand henceitcannotbereadilyadoptedbysmallerenterprises
Page 2 | 11
R: The best part about R is that it is an Open Source tool and hence used generously
byacademiaand theresearchcommunity.It is arobusttoolforstatistical computation,graphical
representationand reporting.Duetoits opensourcenatureit isalwaysbeingupdatedwiththelatest
featuresandthenreadily availabletoeverybody.
Python:Pythonisapowerfulopensourceprogramming language thatiseasytolearn,workswellwith
mostothertoolsandtechnologies.ThebestpartaboutPythonisthatithasinnumerablelibrariesand community
createdmodulesmakingitveryrobust.Ithasfunctionsforstatisticaloperation,model buildingand more.
Q3. Why is important in data analysis?
Answer:
With datacomingin frommultiple sourcesit is importanttoensure thatdatais goodenoughfor analysis.
This is wheredatacleansing becomesextremely vital. Data cleansing extensively deals with theprocess
of detectingandcorrectingof datarecords, ensuring thatdatais completeandaccurate andthe
componentsof datathatareirrelevant aredeletedormodifiedaspertheneeds.This processcanbe
deployed in concurrence withdatawrangling orbatchprocessing.
Page 3 | 11
Oncethedatais cleanedit confirmswiththerulesofthedatasetsinthesystem.Datacleansingis an
essential partofthedatasciencebecause thedatacanbepronetoerrorduetohumannegligence,
corruptionduring transmissionorstorageamongotherthings. Datacleansing takesahugechunkoftime
andeffortofaDataScientistbecauseofthemultiplesourcesfromwhichdataemanatesandthespeedat whichit
comes.
Q4. What are the various aspects of a Machine Learning process?
Answer:
Herewewill discuss thecomponentsinvolved in solving aproblemusing machinelearning.
Domainknowledge
This is thefirst step wherein weneedto understand howto extract the various features from the data and
learn moreabout thedata that wearedealing with. It has got moreto dowith thetype of domain that we
aredealing withandfamiliarizing thesystem tolearnmoreabout it.
Page 4 | 11
FeatureSelection
This stephasgotmoretodowiththefeaturethatweareselecting fromthesetof featuresthatwehave.
Sometimesit happensthattherearealot of features andwehavetomakeanintelligent decision
regarding
thetype of feature thatwewanttoselect togo aheadwithourmachinelearning endeavor.
Algorithm
This is avital stepsincethealgorithmsthatwechoosewill haveavery majorimpactontheentire
process
of machine learning. You can choosebetweenthelinear andnonlinear algorithm. Someof the
algorithms
usedareSupportVector Machines,Decision Trees, Naïve Bayes, K-Means Clustering, etc.
Training
This is the most important part of the machine learning technique and this is where it differs from the
traditional programming. The training is done based on the data that we have and providing morereal
world experiences. With each consequent training step themachine gets better andsmarter andable to
takeimproved decisions.
Evaluation
In this stepweactually evaluate thedecisions takenby themachinein ordertodecide whether it is up
to
themarkornot.Therearevarious metricsthatareinvolved in this processandwehavetoclosed
deploy
eachof thesetodecide ontheefficacy of thewholemachine learning endeavor.
Optimization
This processinvolves improvingtheperformanceof themachinelearningprocessusing various
optimizationtechniques.Optimizationof machinelearningis oneof themostvital componentswherein
theperformanceof thealgorithmis vastly improved.Thebestpartof optimization techniquesis that
machine learning is not just aconsumer of optimization techniques but it also provides newideas for
optimization too.
Testing
Here various tests are carried out and some these are unseen set of test cases. The data is partitioned
into test and training set. There are various testing techniques like cross-validation in order to deal with
multiplesituations.
Page 5 | 11
Q4. What is I nterpolation and Extrapolation?
Answer:
The termsof interpolationandextrapolationareextremelyimportantin anystatistical analysis.
Extrapolation is thedetermination orestimationusing aknownsetof values orfacts by extending it and
taking it toanareaor regionthat is unknown. It is thetechnique of inferring something using data that
is available.
Interpolationontheotherhandis themethodof determining a certainvalue which falls betweenacertain
setof valuesorthesequenceof values. This is especially useful whenyou have dataat thetwoextremities
of acertainregionbutyoudon’thaveenoughdatapointsatthespecific point. This is whenyoudeploy
interpolationtodeterminethevaluethat youneed.
Page 6 | 11
Q5. What does P-value signify about the statistical data?
Answer:
P-valueisusedtodetermine thesignificance ofresultsafterahypothesis testinstatistics. P-valuehelps
thereaderstodrawconclusionsand isalwaysbetween0and 1.
•P- Value > 0.05 denotes weak evidence against the null hypothesis which means the null
hypothesis cannotberejected.
•P-value <= 0.05denotes strongevidence against thenull hypothesis whichmeansthenull
hypothesis canberejected.
•P-value=0.05isthemarginalvalueindicatingitispossibletogoeitherway.
Q6. During analysis, how do you treat missing values?

Answer:
Theextentofthemissing valuesis identifiedafteridentifying thevariableswithmissing values.If any
patternsareidentifiedtheanalysthastoconcentrateonthemasitcouldleadtointerestingandmeaningful
businessinsights.Iftherearenopatternsidentified, thenthemissingvaluescanbesubstitutedwithmean or
medianvalues(imputation)ortheycansimplybeignored.
Page 7 | 11
Therearevariousfactorstobeconsideredwhenansweringthisquestion-
Understandtheproblemstatement,understandthedataandthengivetheanswer.Assigningadefault
valuewhichcanbemean,minimumormaximumvalue.Gettingintothedataisimportant.
If it is a categorical variable, the default value is assigned. The missing value is
assigneda defaultvalue.
If you have a distribution of data coming, for normal distribution give the mean value.
Shouldweeventreatmissingvaluesisanother important pointtoconsider? If 80%ofthevaluesfora
variablearemissingthenyoucananswerthat youwouldbedroppingthevariableinsteadoftreating
the missingvalues.
Q7. Explain the difference between a Test Set and a Validation Set?
Answer:
Validation setcanbeconsideredasapartof thetrainingsetasit is usedforparameterselection andto
avoidOverfitting of themodel being built. On theotherhand, test set is used for testing or evaluating
theperformanceof atrainedmachineleaning model.
In simpleterms,thedifferencescanbesummarizedas-
Training Setis tofittheparameters i.e. weights.
Test Setis toassesstheperformanceof themodeli.e. evaluatingthepredictivepowerand
generalization.
Validation setis totunetheparameters.
Page 8 | 11
Q8. What is the curse of dimensionality? Can you list some ways to deal
with it?
Answer:
Thecurseofdimensionalityis whenthetraining datahas ahigh feature count, butthedataset does not

haveenoughsamplesfor amodel tolearn correctly from so manyfeatures. For example, atraining dataset of
100sampleswith100featureswill bevery hardtolearnfrombecausethemodelwill find random relations
betweenthefeaturesandthetarget.However,if wehadadatasetof 100ksampleswith 100 features,the
modelcould probablylearnthecorrect relationshipsbetweenthefeatures andthetarget.
There aredifferent optionstofight thecurse of dimensionality:
Featureselection.Instead of using all thefeatures, wecantrain onasmallersubsetof features.

Dimensionalityreduction. There aremanytechniques thatallow toreducethedimensionality
ofthefeatures.Principal componentanalysis(PCA) andusingautoencodersareexamplesof
dimensionalityreductiontechniques.
L1 regularization. Becauseit producessparseparameters,L1 helps todealwithhigh-
dimensionalityinput.
Feature engineering. It’s possibletocreatenewfeaturesthatsumupmultipleexisting
features.For example, wecan get statistics such asthemeanormedian.
Page 9 | 11
Q9. What is data augmentation? Can you give some examples?
Answer:
Dataaugmentationisatechniqueforsynthesizing newdatabymodifying existing datainsuchaway that

thetargetisnotchanged,oritischangedinaknownway.
Computervisionisoneoffieldswheredataaugmentationisveryuseful.Therearemanymodifications thatwe
cando toimages:
Resize
Horizontal or vertical
flip Rotate
Add noise
Deform
Modifycolors
Each problemneedsacustomizeddataaugmentationpipeline.For example,onOCR, doingflips
will change the text and won’t be beneficial; however, resizes and small rotations may
help.
Q10. What is stratified cross-validation and when should we use it?

Answer:
Cross-validation is a technique for dividing data between training and validation sets. On typical
cross- validation this split is done randomly. But in stratified cross-validation, the split preserves the
ratioof thecategoriesonboththetrainingandvalidationdatasets.
Page 10 | 11
For example, if wehave a dataset with 10% of category A and90% of category B, andweuse stratified
cross-validation, we will have the same proportions in training and validation. In contrast, if we use
simple cross-validation, in the worst case we may find that there are no samples of category A in the
validation set.
Stratified cross-validation maybeapplied in thefollowing scenarios:
On a datasetwith multiple categories.The smallerthedataset andthemoreimbalanced the

categories, themoreimportantit will betousestratified cross-validation.
On adatasetwithdataofdifferent distributions. For example,in adatasetfor autonomous
driving, wemayhaveimages takenduring theday andatnight. If wedonotensure thatboth
types arepresent in training andvalidation, wewill have generalizationproblems.
Page 11 | 11
DATA SCIENCE
INTERVIEW
PREPARATION
Preparation)
# DAY 11
Page 1 | 12
Q1. What are tensors?
Answer:
The tensors are nomorethana methodof presenting the data in deeplearning. If put in thesimple term,
tensors are just multidimensional arrays that allow developers to represent the data in a layer, which
means deep learning you are using contains high-level data sets where each dimension represents a
different feature.
The foremost benefit of using tensors is it provides the much-needed platform-flexibility and is easy to
trainable on CPU. Apart from this, tensors have the auto differentiation capabilities, advanced support
system for queues,threads, andasynchronous computation. All thesefeatures also makeit customizable.
Q2. Define the concept of RNN?

Answer:
RNN is the artificial neutral which were created to analyze and recognize the patterns in the
sequences ofthedata.Duetotheirinternalmemory,RNNcancertainlyrememberthethingsabouttheinputs
they receive.
Page 2 | 12
MostcommonissuesfacedwithRNN
AlthoughRNNis aroundforawhileandusesbackpropagation,therearesomecommonissuesfaced
bydeveloperswhoworkit.Outofall,someofthemostcommonissuesare:
Exploding
gradients
Vanishing
gradients
Page 3 | 12
Q3. What is a ResNet, and where would you use it? Is it efficient?
Answer:
Amongthevariousneuralnetworksthatareused for computervision, ResNet (Residual Neural

Networks),is oneof themostpopular ones. It allows us totrain extremely deepneural networks, which
is theprimereasonforits hugeusageandpopularity. Beforetheinventionofthis network,training
extremely deepneuralnetworkswasalmost impossible.
To understand why wemustlook at thevanishing gradient problem which is anissue that arises when
thegradient is backpropagated toall thelayers. As alarge numberof multiplications areperformed,the
size of thenetwork keeps decreasing till it becomesextremely small, andthus, thenetwork starts
performing badly. ResNet helps tocounterthevanishing gradient problem.
The efficiency of this network is highly dependent onthe concept of skip connections. Skip connections
are a method of allowing a shortcut path through which the gradient can flow, which in effect helps
counterthevanishing gradient problem.
Anexampleof askipconnectionis shownbelow:
In general, a skip connection allows us to skip the training of a few layers. Skip connections are also
called identity shortcut connections as they allow us to directly compute an identity function by just
relying onthese connections andnot having to look at thewhole network.
The skipping of theselayers makesResNet anextremely efficient network.
Page 4 | 12
Q4. Transfer learning is one of the most useful concepts today. Where
can it be used?
Answer:
Pre-trainedmodelsareprobably oneof themostcommonusecases fortransfer learning.
For anyone who does not have access to huge computational power, training complex models is always
a challenge. Transfer learning aims to help by both improving the performance and speeding up your
network.
In laymanterms,transferlearning is atechnique in which amodel that has already beentrained to do

onetaskis usedforanotherwithoutmuchchange.This typeof learningis alsocalled multi-tasklearning.
Many models that are pre-trained are available online. Any of these models can beused as astarting
point in thecreation of thenewmodelrequired. After just using theweights,themodelmustberefined
andadaptedon therequired databy tuning theparametersof themodel.
Page 5 | 12
The general idea behind transfer learning is totransfer knowledge notdata.For humans,this task is easy
–wecangeneralize models that wehave mentally created along time ago for adifferent purpose. One
ortwosamplesis almostalwaysenough.However,in thecase of neural networks, ahuge amountof data
andcomputational power arerequired.
Transfer learning shouldgenerally beusedwhenwedon’t have alot of labeled training data, orif there
already exists anetworkfor thetask youaretrying toachieve, probably trained onamuchmoremassive
dataset. Note, however, that the input of the model must have the same size during training. Also, this
works only if the tasks are fairly similar to each other, and the features learned can be generalized. For
example, something like learning how to recognize vehicles can probably be extended to learn how to
recognize airplanes andhelicopters.
Q5. What does tuning of hyperparameters signify? Explain with

examples.
Answer:
A hyperparameter is just a variable that defines the structure of the network.

Let’s go through some hyperparameters and see the effect of tuning them.
1.A number of hidden layers – Most times, the presence or absence of a
largenumberof hidden layersmaydeterminetheoutput,accuracyandtrainingtimeofthe
neuralnetwork.Havinga
largenumberoftheselayersmaysometimescauseanincreaseinaccuracy.
2. Learningrate–Thisissimplyameasureofhowfasttheneuralnetworkwillchangeits
parameters.Alargelearningratemayleadtothenetworknotbeingabletoconverge,butmight also
speeduplearning.Ontheotherhand,asmallervalueforthelearningratewillprobablyslow
downthenetworkbutmightleadtothenetworkbeingabletoconverge.
3. Numberofepochs–Thisisthenumberoftimestheentiretrainingdataisrunthroughthe network.
Increasing thenumberofepochsleadstobetteraccuracy.
4. Momentum–Momentumisameasureofhowandwherethenetworkwil gowhiletakinginto
accountallofitspastactions.Apropermeasureofmomentumcanleadtoabetternetwork.
5. BatchSize –Batchsize determinesthenumberofsubsamplesthatareinputstothenetwork
beforeeveryparameter update.
Page 6 | 12
Q6. Why are deep learning models referred as black boxes?
Answer:
Lately, theconcept of deeplearningbeing ablack box hasbeenfloating around. A black boxis asystem
whosefunctioning cannotbeproperlygrasped,buttheoutputproduced can beunderstood andutilized.
Now,sincemostmodelsaremathematicallysoundandarecreatedbasedonlegit equations,howis it
possiblethat wedonotknowhowthesystemworks?
First, it is almostimpossibletovisualize thefunctionsthataregeneratedbyasystem.Mostmachine

learningmodelsendupwithsuch complex outputthatahumancan't makesenseof it.
Second,therearenetworkswithmillions of hyperparameters.As ahuman,wecangrasparound10to15
parameters.But analysing amillion of themseems outof thequestion.
Third andmostimportant, it becomes very hard, if not impossible, totrace back why thesystem made
thedecisions it did. This maynot sound like ahuge problem to worry about but consider thecase of a
self driving car.If thecarhits someoneontheroad, weneedtounderstand why thathappenedandprevent
it. Butthis isn’tpossibleif wedonotunderstandhowthesystemworks.
Page 7 | 12
To make a deeplearning model not bea black box, a newfield called Explainable Artificial Intelligence
or simply, Explainable AI is emerging. This field aims to be able to create intermediate results and trace
back thedecision-making process of asystem.
Q7. Why do we have gates in neural networks?
Answer:
To understand gates, wemustfirst understand recurrent neural networks.
Recurrentneural networks allow information tobestored as amemoryusing loops. Thus, theoutputof

arecurrentneuralnetworkis notonly basedonthecurrent input butalso thepast inputs which arestored
in thememoryofthenetwork.Backpropagationis donethroughtime,butin general,thetruncated
versionof this is usedforlonger sequences.
Gatesaregenerally usedin networksthataredependentontime.In effect,anynetworkwhichwould
require memory,sotospeak, would benefit fromtheuseof gates. These gates aregenerally usedtokeep
track of any information that is required by the network without leading to a state of either vanishing or
exploding gradients. Such anetwork can also preserve theerror through time. Since asense of constant
error is maintained, thenetwork canlearnbetter.
Thesegatedunits canbeconsideredasunits withrecurrentconnections.They also containadditional

neurons,whicharegates.Ifyourelatethisprocesstoasignalprocessingsystem,thegateisusedto
Page 8 | 12
regulatewhichpartofthesignalpassesthrough.Asigmoidactivationfunctionisusedwhichmeansthat the
valuestakenarefrom0to1.
An advantage of using gates is that it enables the network to either forget information that it has
already learnedortoselectivelyignoreinformationeitherbasedonthestateofthenetworkortheinput
thegatereceives.
Gatesareextensivelyusedinrecurrentneuralnetworks,especiallyinLongShort-TermMemory(LSTM)
networks.A generalLSTMnetworkwillhave3to5gates,typically aninputgate,outputgate,hidden gate,
and activationgate.
Q8. What is a Sobel filter?

Answer:
The Sobel filter performs a two-dimensional spatial gradient measurement ona given image, which then
emphasizes regions thathave ahigh spatial frequency.In effect, this meansfinding edges.
In most cases, Sobel filters are used to find the approximate absolute gradient magnitude for every point
in a grayscale image. The operator consists of a pair of 3×3 convolution kernels. One of these kernels is
rotatedby 90degrees.
Page 9 | 12
Thesekernelsrespondtoedgesthatrunhorizontalorverticalwithrespecttothepixelgrid,onekernel for
eachorientation.Apointtonoteisthatthesekernelscanbeappliedeitherseparatelyorcanbe combinedto
findtheabsolutemagnitudeofthegradientateverypoint.
The Sobel operator has a large convolution kernel,which ends upsmoothing the image to a greater
extent, and thus, the operator becomes less sensitive to noise. It also produces higher output values for
similaredgescomparedtoothermethods.
To overcome the problem of output values from the operator overflowing the maximum
allowedpixel valueperimagetype,avoidusingimagetypesthatsupportpixelvalues.
Q9. What is the purpose of a Boltzmann Machine?
Answer:
Boltzmannmachinesarealgorithmsthatarebasedonphysics,specificallythermalequilibrium.Aspecial and
morewell-knowncaseofBoltzmannmachinesistheRestrictedBoltzmannmachine,whichisatype ofBoltzmann
machinewheretherearenoconnectionsbetweenhiddenlayersofthenetwork.
Page 10 | 12
The concept was coined by Geoff Hinton, who most recently won the Turing award. In general, the
algorithm uses the laws of thermodynamics and tries to optimize a global distribution of energy in the
system.
In discrete mathematical terms, a restricted Boltzmann machine can be called a symmetric

bipartite graph,i.e.twosymmetriclayers.Thesemachinesareaformofunsupervisedlearning,which
means thattherearenolabelsprovidedwithdata.Itusesstochasticbinaryunitstoreachthisstate.
BoltzmannmachinesarederivedfromMarkovstatemachines.AMarkovStateMachineisamodelthat
can be used to represent almost any computable function. The restricted Boltzmann machine can be
regardedasanundirectedgraphicalmodel.Itisusedindimensionalityreduction,collaborative
filtering,
learning featuresaswell asmodeling.It canalsobeusedfor classification andregression.In general,
restrictedBoltzmannmachinesarecomposedofatwo-layernetwork,whichcanthenbeextended
further.
Note that these models are probabilistic since each of the nodes present in the system learns low-level
features from items in the dataset. For example, if we take a grayscale image, each node that is
responsibleforthevisiblelayerwilltakejustone-pixelvaluefromtheimage.
A part of the process of creating such a machine is a feature hierarchy where

sequencesofactivations tahregroupedinterms offeatures.Inthermodynamicsprinciples,simulatedannealingis
a processthat
machinefollowstoseparatesignalandnoise.
Page 11 | 12
Q10. What are the types of weight initialization?
Answer:
Therearetwomajortypesofweightinitialization:-zeroinitializationandrandominitialization.
Zero initialization:Inthis process, biases and weights are initialisedto 0. If the weights are
setto
0, all derivativeswithrespecttotheloss functionsin theweightmatrixbecomeequal.Hence,none
of theweightschangeduringsubsequent iterations. Settingthebiasto0cancelsoutanyeffectit
may have.
All hidden units become symmetric due to zero initialization. In general, zero
initialization is notvery usefuloraccurate forclassification andthus mustbeavoidedwhenany
classificationtaskisrequired.
Randominitialization: As compared to0 initialization, this involvessetting randomvalues for the

weights.The only disadvantage is that set very high values will increase thelearning time as the
sigmoid activationfunctionmapsclose to1. Likewise, if low values areset, thelearning time increases
astheactivationfunction is mappedclose to0.
Setting toohigh ortoolow values thus generally leads totheexploding orvanishing gradient problem.
Newtypes of weightinitializationlike“Heinitialization” and“Xavier initialization” havealso

emerged.Thesearebasedonspecific equationsandarenotmentionedhereduetotheirsheer
complexity.
Page 12 | 12
DATA SCIENCE
INTERVIEW
PREPARATION
Preparation)
# DAY 12
Page 1 | 11
Q1. Where is the confusion matrix used? Which module would you use to
show it?
Answer:
In machinelearning, confusionmatrixis oneof theeasiestwaystosummarizetheperformanceof

youralgorithm.
At times,it is difficult tojudgetheaccuracyof amodelbyjustlooking attheaccuracybecauseof
problemslike unequaldistribution. So, abetterwaytocheckhowgoodyourmodelis, is tousea
confusion matrix.
First, let’s look atsomekey terms.
Classification accuracy –This is theratioofthenumberofcorrectpredictionstothenumberof

predictions made
Truepositives – Correctpredictions of trueevents
False positives– Incorrectpredictions of true events
Truenegatives – Correctpredictions of falseevents
False negatives– Incorrectpredictions of false events.
Theconfusionmatrixis nowsimply amatrixcontainingtruepositives, false positives, true

negatives, false negatives.
Page 2 | 11
Q2: What is Accuracy?
Answer:
It is the most intuitive performance measure and it simply a ratio of correctly predicted to
thetotal observations.Wecansayas,ifwehavehighaccuracy,thenourmodelisbest.Yes,wecouldsaythat
accuracyis agreatmeasurebutonlywhenyouhavesymmetricdatasetswherefalsepositivesand
falsenegativesarealmostsame.
Accuracy = True Positive+ True Negative/ (True Positive+False Positive+ False

Negative+TrueNegative)
Q3: What is Precision?

Answer:
It is also called as thepositive predictive value. Numberof correct positives in your model that
predicts comparedtothetotal numberof positives it predicts.
Precision = True Positives /(True Positives + False Positives)

Precision= True Positives /Total predicted positive
It is thenumberof positive elements predicted properly divided by thetotal numberof positive

elements predicted.
Wecansay Precision is ameasureof exactness, quality, oraccuracy. High precision
Meansthatmoreorall of thepositive results youpredicted arecorrect.
Page 3 | 11
Q4: What is Recall?
Answer:
Recall wecanalso called assensitivity ortruepositive rate.
It is several positives thatourmodelpredicts comparedtotheactual numberof positives in ourdata.
Recall = True Positives/(True Positives+ FalsePositives)
Recall = TruePositives /Total Actual Positive
Recall is a measure of completeness. High recall which meansthat our model classified most or all
of thepossiblepositiveelements as positive.
Q5: What is F1 Score?

Answer:
Weuse Precision and recall together because they complement each other in how they describe the
effectiveness of a model. The F1 score that combines these two as theweighted harmonic meanof
precision andrecall.
F1Score= 2* (Precision* Recall) /(Precision+ Recall)
Page 4 | 11
Q6: What is Bias and Variance trade-off?
Answer:
Bias
Bias meansit’s howfar arethepredictvaluesfromtheactual values. If theaverage predicted values
arefar off fromtheactual values, thenwecalled asthis onehave high bias.
Whenourmodelhas ahigh bias, thenit meansthat ourmodelis toosimple anddoesnotcapturethe
complexity of data, thus underfitting thedata.
Variance
It occurs whenour model performs good onthetrained dataset but doesnot dowell ona dataset that
it is not trained on, like a test dataset or validation dataset. It tells us that actual value is how much
scattered fromthepredicted value.
Because of High variance it cause overfitting that implies that thealgorithm models random noise
present in thetraining data.
Whenmodel have high variance, thenmodel becomes very flexible andtuneitself to the data points
of thetraining set.
Page 5 | 11
Bias-variance: It decomposition essentially decomposes the learning error from any algorithm by
adding bias, the variance and a bit of irreducible error due to noise in the underlying dataset.
Essentially, if we make the model more complex and add more variables, We’ll lose bias but gain
some variance —to get the optimally reduced amount of error, you’ll have to tradeoff bias and
variance.Wedon’t wanteither high bias orhigh variance in your model.
Bias and variance using bulls-eye

diagram
Q7. What is data wrangling? Mention three points to consider in the

process.
Answer:
Datawranglingisaprocessbywhichweconvertandmapdata.Thischangesdatafromitsraw
form to a format that is a lot more valuable.
Datawranglingisthefirststepformachinelearninganddeeplearning.Theendgoalistoprovide data
thatisactionableand toprovideitasfastaspossible.
Page 6 | 11
There arethreemajor things tofocus onwhile talking aboutdatawrangling –
1.Acquiring data
Thefirst andprobablythemostimportant step in datascience is theacquiring, sorting andcleaning
of data.This is an extremely tediousprocessandrequires themost amountof time.
Oneneeds to:
Check if thedatais validandup-to-date.

Check if thedataacquired is relevant fortheproblemathand.
Sources for datacollection Data is publicly available onvarious websiteslike
kaggle.com,data.gov,World Bank, Five Thirty Eight Datasets,AWS Datasets,Google
Datasets.
2.Datacleaning
Datacleaningis anessentialcomponentof datawranglingandrequires alot of patience. To makethe
job easier it is first essential toformatthedatamakethedatareadable for humansatfirst.
Theessentials involvedare:
Formatthedatatomakeitmorereadable
Find outliers(data points thatdonotmatchtherest of thedataset) in data
Find missing valuesandremovethemfromthedataset(withoutthis, anymodelbeing
trainedbecomesincomplete anduseless)
3.DataComputation
At times,yourmachinenothaveenoughresourcestorunyouralgorithm e.g. youmightnothave a
GPU. In thesecases,youcanusepublicly available APIs torunyouralgorithm. These arestandard
endpoints found onthewebwhich allow you tousecomputing powerover thewebandprocess
datawithout having torely onyour ownsystem. An example wouldbetheGoogleColab Platform.
Page 7 | 11
Q8. Why is normalization required before applying any machine
learning model? What module can you use to perform normalization?
Answer:
Normalizationis aprocessthatis requiredwhenanalgorithmusessomethinglike distance

measures.Exampleswouldbeclusteringdata,finding cosinesimilarities, creatingrecommender
systems.
Normalizationis notalwaysrequiredandis donetopreventvariablesthatareonhigherscalefrom

affecting outcomesthatareonlowerlevels. For example, consideradataset of employees’ income.
This datawon’t beonthesamescale if youtry toclusterit. Hence,wewould have tonormalize the
datatopreventincorrect clustering.
A key point tonoteis thatnormalization doesnotdistortthedifferences in therangeof values.
A problemwemightface if wedon’t normalize datais thatgradients would take avery long time
todescendandreachtheglobal maxima/minima.
For numericaldata, normalization is generally donebetweentherange of 0 to1.
Thegeneral formulais:
Xnew= (x-xmin)/(xmax-xmin)
Page 8 | 11
Q9. What is the difference between feature selection and
feature extraction?
Featureselectionandfeatureextractionaretwomajorwaysoffixingthecurseofdimensionality
1.Feature selection:
Feature selectionis usedto filter asubsetof input variablesonwhichtheattentionshouldfocus. Every
othervariableisignored.Thisissomethingwhichwe,ashumans,tendtodosubconsciously.
Many domains have tens of thousands of variables out of which most are irrelevant and redundant.
Featureselectionlimitsthetrainingdataandreducestheamountofcomputationalresourcesused. Itcan
significantly improve a learning algorithms performance.
In summary, we can say that the goal of feature selection is to find out an optimal feature
subset. Thismightnotbeentirelyaccurate,however,methodsofunderstandingtheimportanceof
features alsoexist.SomemodulesinpythonsuchasXgboosthelpachievethesame.
2.Feature extraction
Featureextractioninvolves transformationoffeaturessothatwecanextractfeaturestoimprovethe
processoffeatureselection.Forexample,inanunsupervisedlearningproblem,theextractionof bigrams
fromatext,ortheextractionofcontoursfromanimageareexamplesoffeatureextraction.
Thegeneralworkflowinvolvesapplying featureextractionongiven datatoextractfeaturesand
thenapplyfeatureselectionwithrespecttothetargetvariabletoselectasubsetofdata.Ineffect, thishelps
improve the accuracy of a model.
Page 9 | 11
Q10. Why is polarity and subjectivity an issue?
Polarity andsubjectivity aretermswhich aregenerally usedin sentimentanalysis.
Polarity is thevariationofemotionsin asentence.Since sentimentanalysisis widelydependenton
emotionsandtheir intensity, polarity turns outtobeanextremely important factor.
In mostcases,opinionsandsentimentanalysis areevaluations.They fall underthecategoriesof
emotional and rationalevaluations.
Rational evaluations,asthenamesuggests,arebasedonfactsandrationality whileemotional
evaluations arebasedonnon-tangibleresponses, which arenotalwayseasy todetect.
Subjectivity in sentimentanalysis, is amatterof personal feelings andbeliefs which may or may
notbebasedonany fact. Whenthereis alot of subjectivity in atext, it mustbeexplained and
analysedin context.Onthecontrary, if therewasalot of polarity in thetext, it could beexpressed
asapositive, negative orneutral emotion.
Q11. When would you use ARIMA?

Answer:
ARIMA is a widely used statistical method which stands for Auto Regressive
IntegratedMoving Average.Itisgenerallyusedforanalyzingtimeseriesdataandtimeseriesforecasting.
Let’s take a quick look at the terms involved.
AutoRegressionis amodelthatusestherelationshipbetweentheobservationandsomenumbersof
laggingobservations.
Page 10 | 11
Integratedmeansuseofdifferencesinrawobservationswhichhelpmakethetimeseriesstationary.
MovingAveragesis amodelthatusestherelationshipanddependencybetweenthe
observation andresidualerrorfromthemodelsbeingappliedtothelaggingobservations.
Notethateachofthesecomponentsareusedasparameters.Aftertheconstructionofthemodel,a linear
regressionmodelisconstructed.
Data is prepared by:
Finding out the differences
Removingtrendsandstructuresthatwillnegativelyaffectthemodel
Finally, making the model stationary.
Page 11 | 11
DATA SCI ENCE
INTERVIEW
PREPARATI ON
Preparation)
# Day13
Page 1 | 10
Q1. What is Autoregression?
Answer:
Theautoregressive(AR)modeliscommonlyusedtomodeltime-varyingprocessesandsolve
problemsin thefields of natural science, economicsandfinance, andothers. The modelshave
always
beendiscussedin thecontextof randomprocessandareoftenperceivedasstatistical toolsfor time
seriesdata.
A regressionmodel,like linear regression,modelsanoutputvalue whicharebasedonalinear
combinationof input values.
Example: y^ = b0+ b1*X1
Wherey^ is theprediction,b0andb1arecoefficients foundbyoptimising themodel ontraining
data,and X is an input value.
This modeltechniquecan beused onthetime serieswhereinput variables aretaken as observations
atprevious timesteps, called lag variables.
For example,wecanpredictthevaluefor thenext time step(t+1) given theobservations at thelast
twotimesteps(t-1 andt-2). As aregressionmodel, this would look asfollows:
X(t+1) = b0+ b1*X(t-1) + b2*X(t-2)-
Becausetheregressionmodelusesthedatafromthesameinputvariable atprevioustimesteps,it is
referred toasan autoregression.
The notationAR(p) referstotheautoregressive modelof orderp.The AR(p) modelis written
Page 2 | 10
Q2. What is Moving Average?
Answer:
Movingaverage:Fromadataset, wewill get anoverall idea oftrends by this technique; it is an
averageofanysubsetofnumbers.Forforecastinglong-termtrends,themovingaverageis extremelyuseful
forit. Wecancalculateit foranyperiod.For example:if wehavesalesdatafor twentyyears,we
cancalculate thefive-yearmovingaverage,afour-yearmovingaverage,athree- yearmovingaverageand
soon.Stockmarketanalystswilloftenusea50or200-daymovingaverage tohelpthemseetrendsinthe
stockmarketand(hopefully)forecastwherethestocksareheaded.
The notationMA(q) referstothemoving average modelof orderq:
Q3. What is Autoregressive Moving Average (ARMA)?

Answer:
ARMA: It is a model of forecasting in which the methods of autoregression (AR) analysis and
moving average (MA) are both applied to time-series data that is well behaved. In ARMA it is
assumed that the time series is stationary and when it fluctuates, it does so uniformly around a
particulartime.
AR (Autoregressionmodel)-
Autoregression(AR) modelis commonly used in current spectrumestimation.
Page 3 | 10
ThefollowingistheprocedureforusingARMA.
Selecting theAR modelandthenequalizing theoutputtoequalthesignal being studied
if the input is an impulse function or the white noise. It should at least be
goodapproximation
ofsignal.
Findinga model’sparameters numberusingtheknownautocorrelationfunctionorthe
data .
Usingthederivedmodelparameterstoestimatethepowerspectrumofthesignal.
MovingAverage(MA)model-
It is a commonly used model in the modern spectrum estimation and is also one of the
methods of the model parametric spectrum analysis. The procedure for estimating MA model’s signal
spectrumisasfollows.
SelectingtheMA modelandthenequalisingtheoutputtoequalthesignal understudyin

the casewheretheinputisanimpulsefunctionorwhitenoise.Itshouldbeatleastagood
approximationofthesignal.
Finding themodel’sparametersusingtheknownautocorrelation
function. Estimatingthesignal’spowerspectrumusingthederivedmodel
parameters.
In the estimation of the ARMA parameter spectrum, the AR parameters are first estimated,
andthen theMAparametersareestimatedbasedontheseARparameters.ThespectralestimatesoftheARMA
modelarethenobtained.TheparameterestimationoftheMAmodelis,thereforeoftencalculatedas aprocess
ofARMAparameterspectrumassociation.
ThenotationARMA(p, q)referstothemodelwithpautoregressivetermsandqmoving-average
terms.ThismodelcontainstheAR(p) and MA(q)models,
Page 4 | 10
Q4. What is Autoregressive Integrated Moving Average (ARIMA)?
Answer:
ARIMA: It is a statistical analysis model that uses time-series data to either better
understandthedata setortopredictfuturetrends.
AnARIMAmodelcan be understoodby the outlining each ofits components as follows-
Autoregression (AR): It refers to a model that shows a changing variable that
regresseson itsownlagged,orprior,values.
Integrated(I):Itrepresentsthedifferencingofrawobservationstoallowforthetimeseries to
becomestationary,i.e.,datavaluesarereplacedbythedifferencebetweenthedatavalues
andthepreviousvalues.
Moving average (MA): It incorporates the dependency between an observation and
theresidualerrorfromthemovingaveragemodelappliedtothelaggedobservations.
Each component functions as the parameter with a standard notation. For ARIMA models,
the standard notation would be the ARIMA with p, d, and q, where integer values substitute for the
parameterstoindicatethetypeoftheARIMAmodelused.Theparameterscanbedefinedas-
p:Itthenumberoflagobservations inthemodel;alsoknownasthelagorder.
d: It thenumberof timesthattherawobservations aredifferenced; alsoknownasthe
degree ofdifferencing.
q:Itthesizeofthemovingaveragewindow;alsoknownastheorderofthemovingaverage.
Page 5 | 10
Q5.What is SARIMA (Seasonal Autoregressive Integrated Moving-
Average)?
Answer:
Seasonal ARIMA: It is an extension of ARIMA that explicitly supports the univariate time series
datawiththeseasonalcomponent.
It adds three new hyper-parameters to specify the autoregression (AR), differencing (I) and the
moving average (MA) for theseasonal component of theseries, as well as an additional parameter
fortheperiodof theseasonality.
Configuring the SARIMA requires selecting hyperparameters for both the trend and seasonal
elements of theseries.
Trend Elements
Threetrendelementsrequirestheconfiguration.
They aresameas theARIMA model, specifically-
p: Itis Trendautoregressionorder.
d: Itis Trenddifferenceorder.
q: Itis Trendmoving averageorder.
Seasonal Elements-
Fourseasonalelements arenotthepartof theARIMA thatmustbeconfigured,they are-

P: It is Seasonalautoregressive order.
D: Itis Seasonal differenceorder.
Q: Itis Seasonal moving averageorder.
m:It is thenumberof timesteps forthesingle seasonal period.
Together, thenotationfortheSARIMA modelis specified as-
SARIMA(p,d,q)(P,D,Q)m-
The elements can bechosen through careful analysis of theACF andPACF plots looking atthe
correlations of recenttimesteps.
Page 6 | 10
Q6. What is Seasonal Autoregressive Integrated Moving-Average
with Exogenous Regressors (SARIMAX) ?
Answer:
SARIMAX: It is anextensionof theSARIMA modelthatalsoincludesthemodellingof the
exogenousvariables.
Exogenousvariablesarealsocalled thecovariatesandcanbethought of as parallel input sequences
thathaveobservationsatthesametime steps as theoriginal series. The primary series may be
referredasendogenousdatatocontrastit fromexogenous sequence(s). The observations for
exogenousvariablesareincludedin themodeldirectly ateachtimestepandare notmodeledin the
samewayastheprimaryendogenoussequence(e.g. asanAR, MA, etc.process).
The SARIMAX methodcan also beusedtomodelthesubsumedmodelswith exogenous variables,

suchas ARX, MAX, ARMAX, andARIMAX.
The methodis suitablefor univariatetimeseries withtrendand/orseasonalcomponentsand
exogenousvariables.
Page 7 | 10
Q7. What is Vector autoregression (VAR)?
Answer:
VAR: It is a stochastic process model used to capture the linear interdependencies
among
multipletimeseries.VARmodelsgeneralisetheunivariateautoregressivemodel(ARmodel)by
allowing formorethanoneevolving variable.All variablesin theVAR enterthemodelin thesame
way:eachvariablehasanequationexplaining its evolutionbasedonits ownlaggedvalues,thelagged
valuesof theothermodelvariables, andanerrorterm.VAR modelling doesnotrequires asmuch
knowledgeabouttheforcesinfluencingthevariableasdostructuralmodelswithsimultaneous equations:
Theonly priorknowledgerequiredis alist ofvariableswhichcanbehypothesisedto affecteach
otherintertemporally.
A VAR model describes the evolution of the set of kvariablesoverthesamesample
period(t= 1,
..., T) as the linear function of only their past values. The variables are collected
in the k- vector((k × 1)-matrix) y,,whichhasasthtthe(i )element,yi,t, the
observationattimetof
the(i th)variable.Example:ifthe(i th)variableistheGDP,thenyi,t is the value of GDP at time “t”.
-
wheretheobservationyt−i is called the (i-th) lagofy, c is the k-vectorofconstants(intercepts),Ai is a
time-invariant(k× k)-matrix,and etis a k-vectoroferrortermssatisfying.
Q8. What is Vector Autoregression Moving-Average (VARMA)?

Answer:
VARMA: It is method models the next step in each time series using an ARMA
model.Itisthe generalisationofARMAtomultipleparalleltimeseries,Example-multivariatetime
series.
Page 8 | 10
The notation for a model involves specifying the order for the AR(p) and the MA(q) models as
parameters to the VARMA function, e.g. VARMA (p, q). The VARMA model can also be used to
developVAR orVMA models.
This methodis suitable formultivariate timeserieswithout trend andseasonalcomponents.
Q9. What is Vector Autoregression Moving-Average with Exogenous

Regressors (VARMAX)?
Answer:
VARMAX: It is an extension of the VARMA model that also includes the
modellingofthe exogenousvariables.ItisthemultivariateversionoftheARMAXmethod.
Exogenousvariablesarealsocalled thecovariatesandcanbethoughtof asparallel inputsequences
thathaveobservationsatthesametimestepsastheoriginalseries.Theprimaryseries(es)arereferred
as the endogenous data to contrast it from the exogenous sequence(s). The observations for the
exogenousvariablesareincludedinthemodeldirectlyateachtimestepandarenotmodeledinthe same
wayastheprimaryendogenoussequence(Example-asanAR,MA,etc.).
This method can also be used to model subsumed models with exogenous variables, such
asVARX andtheVMAX.
This method is suitable for multivariate time series without trend and seasonal
componentsand exogenousvariables.
Page 9 | 10
Q10. What is Simple Exponential Smoothing (SES)?
Answer:
SES: It methodmodelsthenexttimestepasanexponentiallyweightedlinear functionof
observations
atprior timesteps.
This methodis suitable for univariate timeserieswithout trend andseasonalcomponents.
Exponential smoothing is the rule of thumb technique for smoothing time series data using the
exponential window function. Whereas in the simple moving average, the past observations are
weighted equally, exponential functions are used to assign exponentially decreasing weights
over time. It is easily learned and easily applied procedure for making some determination
based on prior assumptions by the user, such as seasonality. Exponential smoothing is often
usedfor analysis of time-series data.
Exponential smoothing is oneof manywindowfunctions commonly applied tosmoothdatain
signal
processing,acting as low-pass filters toremovehigh-frequency noise.
The raw data sequence is often represented by {xt} beginning at time t = 0, and the output of the
exponential smoothing algorithm is commonly written as {st} which may be regarded as a best
estimateof whatthenext value of x will be. Whenthesequenceof observations begins attimet=
0, thesimplest formof exponential smoothing is given by theformulas:
Page 10 | 10
DATA SCIENCE
INTERVIEW
PREPARATION
Preparation)
# DAY 14
Page 1 | 11
Q1. What is
Alexnet? Answer:
The Alex Krizhevsky, Geoffrey Hinton and Ilya Sutskever created the neural network architecture
called ‘AlexNet’ and won Image Classification Challenge (ILSVRC) in 2012. They trained their
networkon1.2millionhigh-resolutionimagesinto1000differentclasseswith60million parametersand
650,000neurons.ThetrainingwasdoneontwoGPUswithsplitlayerconcept becauseGPUswerea
little bit slow at that time.
AlexNet is the name of convolutional neural network which has had a large impact on the
field of machine learning, specifically in the application of deep learning to machine vision.
The network had very similar architecture as the LeNet by Yann LeCun et al. but was deeper with
morefilters
per
layer,andwiththestackedconvolutionallayers.Itconsistof(11×11,5×5,3×3,convolutions),max
pooling, dropout, data augmentation, ReLU activations and SGD with the momentum. It
attached withReLUactivationsaftereveryconvolutionalandfullyconnectedlayer.AlexNetwastrainedfor
six days simultaneously on two Nvidia Geforce GTX 580 GPUs, which is the reason for
whytheir networkissplitintothetwo pipelines.
Architecture
AlexNet containseightlayers withweights,first five areconvolutional, andtheremainingthreeare

fully connected. The outputof last fully-connected layer is fed toa1000-way softmax which
producesadistribution overthe1000class labels. The networkmaximisesthemultinomiallogistic
regressionobjective,whichis equivalenttomaximising theaverageacrosstrainingcasesof thelog-
probability ofthecorrectlabelunder theprediction distribution. The kernels of second, fourth, and
thefifth convolutionallayersareconnectedonlywiththosekernelmapsin theprevious layer which
resideonthesameGPU. The kernels of third convolutional layer areconnected toall thekernel maps
in secondlayer.Theneuronsin fully connectedlayersareconnectedtoall theneuronsin theprevious
layers.
Page 2 | 11
In short, AlexNet contains five convolutional layers and three fully connected layers. Relu is
applied aftertheveryconvolutionalandthefullyconnectedlayer.Dropoutis appliedbeforethefirst
and secondfullyconnectedyear.Thenetworkhasthe62.3millionparametersandneeds1.1billion
computationunitsinaforwardpass.Wecanalsoseeconvolutionlayers,whichaccountsfor6%of allthe
parameters,consumes95%ofthecomputation.
Q2. What is VGGNet?

Answer:
VGGNet consists of 16 convolutional layers and is very appealing because of its very
uniform architecture. Similar to AlexNet, only 3x3 convolutions, but lots of filters. Trained on 4
GPUs for 2– 3 weeks. It is currently the most preferred choice in the community for
extracting features from images. The weight configuration of the VGGNet is publicly available
and has beenusedinmany other applications andchallenges as a baseline feature extractor. However,
VGGNet consists of 138 million parameters, which can be a bit challenging to handle.
There are multiple variants of the VGGNet (VGG16, VGG19 etc.) which differ only in total number of
layersinthenetworks.ThestructuraldetailsoftheVGG16networkhasbeenshown:
The idea behind having thefixed size kernels is that all the variable size convolutional kernels used
in the Alexnet (11x11, 5x5, 3x3) can be replicated by making use of multiple 3x3 kernels as the
building blocks. The replication is in termof thereceptive field covered by kernels .
Page 3 | 11
Let’s consider the example. Say we have an input layer of the size 5x5x1. Implementing
theconv layerwithkernelsizeof5x5andstrideonewilltheresultsandoutputfeaturemapof(1x1).Thesame
outputfeaturemapcanobtainedbyimplementingthetwo(3x3)Convlayerswithstrideof1as below:
Now, let’s look at the number of the variables needed to be trained. For a 5x5 Conv layer filter, the
numberofvariablesis25.Ontheotherhand,twoconvlayersofkernelsize3x3haveatotalof 3x3x2=18
variables (a reductionof 28%).
Page 4 | 11
Q3. What is VGG16?
Answer:
VGG16: It is a convolutional neural network model proposed by the K. Simonyan andA. Zisserman
from the University of Oxford in the paper “Very Deep Convolutional Networks for the Large-Scale
Image Recognition”. The model achieves 92.7% top 5 test accuracy in ImageNet, which is the
dataset of over 14 million images belonging to the 1000 classes. It was one of famous model
submitted to ILSVRC-2014. It improves AlexNet by replacing the large kernel-sized filters (11 and
5inthefirstandsecondconvolutionallayer,respectively)withmultiple3×3kernel-sizedfiltersone after
another.VGG16 wastrainedforweeksandwasusing NVIDIA Titan Black GPU’s.
The Architecture
The architecture depicted below is
VGG16.
Theinput totheCov1 layer is of fixed size of 224 x 224 RGB image. The image is passed through
thestackofconvolutional(conv.) layers, wherethefilters wereusedwith avery small receptive field:
3×3(whichis thesmallestsize tocapturethenotionofleft/right,up/down,centre).In oneofthe
configurations,it also utilises the1×1 convolution filters, which can beseenas thelinear
transformationof theinputchannels . The convolution stride is fixed tothe1 pixel, thespatial padding
of theConv. layer input is such that, thespatial resolutionis preserved aftertheconvolution, i.e. the
Page 5 | 11
padding is 1-pixel for 3×3 Conv. layers. Spatial pooling is carried out by the five max-pooling layers,
which follows some of the Conv. Layers. Max-pooling is performed over the 2×2 pixel window, with
stride2.
ThreeFully-Connected(FC)layersfollowthestackofconvolutionallayers(whichhasthedifferent depth
in different architectures): the first two have 4096 channels each, the third performs 1000-way
ILSVRC classification and thus contains 1000 channels . The final layer is
softmaxlayer.The configurations ofthefullyconnectedlayersissameinallthenetworks.
All hidden layers are equipped with rectification (ReLU) non-linearity. It is also
noted that none of the networks (except for one) contain the Local Response Normalisation (LRN), such
normalisation
does not improve the performance on the ILSVRC dataset, but leads to increased memory
consumptionandcomputationtime.
Q4. What is ResNet?

Answer:
At the ILSVRC 2015, so-called Residual Neural Network (ResNet) by the Kaiming He et al
introducedtheanovelarchitecturewith“skipconnections”andfeaturesheavybatchnormalisation. Such
skip connections are also known as the gated units or gated recurrent units and have the
strong similaritytorecentsuccessful elementsappliedinRNNs.Thanksto thistechniqueastheywere
able totraintheNNwith152layerswhilestillhavinglowercomplexitythantheVGGNet.Itachieves the
top-5errorrateof3.57%,whichbeatshuman-levelperformanceon thisdataset.
Q5. What is HAAR CASCADE?

Answer:
HaarCascade:Itisthemachinelearningobjectdetectionsalgorithmusedtoidentifytheobjectsin an
imageorthevideoandbasedontheconceptoffeaturesproposedbyPaulViolaandMichael Jonesin
theirpaper"RapidObjectDetectionusingaBoostedCascadeofSimpleFeatures"in 2001.
It is a machine learning-based approach where the cascade function is trained from the lot
of
positive
andnegativeimages.Itisthenusedtodetecttheobjectsinotherimages. The Page 6 | 11
algorithmhasfourstages:
Haar FeatureSelection
Creating Integral Images
AdaboostTraining
Cascading Classifiers
It is well knownfor being able todetect faces andbody parts in animage but can betrained to
identify almost any object.
Q6. What is Transfer Learning?

Answer:
Transferlearning: It is themachinelearning methodwherethemodeldevelopedforataskis
reusedasthestartingpointforthemodelonthesecondtask.
TransferLearningdiffersfromthetraditionalMachineLearninginthat itistheuseofpre-
trained
tmasokdelsthathavebeenusedforanothertasktojump-startthedevelopmentprocessonanew
orproblem.
Page 7 | 11
The benefits of theTransfer Learning are thatit can speedupthe time as it takes to develop and
train themodel by reusing these pieces or modules of already developed models. This helps to
speedupthemodeltraining processandaccelerateresults.
Q7. What is Faster, R-CNN?

Answer:
Faster R-CNN: It has two networks: region proposal network (RPN) for generating region
proposals and a network using these proposals to detect objects. The main difference here
with the Fast R-CNN is that the later uses selective search to generate the region
proposals. The time cost of generating the region proposals is much smaller in the RPN
than selective search, when RPN shares the most computation with object detection
network. In brief, RPN ranks region boxes (called anchors) and proposes the ones most
likely containing objects.
Anchors
Anchors play an very important role in Faster R-CNN. An anchor is the box. In default
configuration of Faster R-CNN, there are nine anchors at the position of an image. The graphs
shown9anchors attheposition (320, 320) of animage withsize (600, 800).
Page 8 | 11
Region Proposal Network:
Theoutputoftheregionproposalnetworkisthebunchofboxes/proposalsthatwillbeexamined
byaclassifierand regressortochecktheoccurrenceofobjectseventually.Tobemore
precise,RPN predictsthepossibility of ananchorbeingbackgroundorforeground,andrefine
theanchor.
Q8. What is RCNN?

Answer:
To bypass the problem of selecting the huge number of regions, Ross Girshick et al.
proposeda methodwhereweusetheselectivesearchtoextractjust2000regionsfromtheimage,and
he calledthemasregionproposals.Therefore,insteadoftryingtoclassifythehugenumberof regions,youcan
workwith2000regions.
Page 9 | 11
Problems withR-CNN:
It still takesthehugeamountof timetotrain thenetworkaswewouldhavetoclassify

2000 region proposals perimage.
It cannotbeimplemented real-timeasit takes around47seconds for eachtestimage.
The selective searchalgorithm is thefixed algorithm.Therefore,nolearning is happening
atthatstage. This leadstothegeneration of thebadcandidate region proposals.
Q9.What is GoogLeNet/Inception?
Answer:
The winner of the ILSVRC 2014 competition was GoogLeNet from Google. It achieved a top-5
error rate of 6.67%! This was very close to human-level performance which the organisers of the
challenge were now forced to evaluate. As it turns out, this was rather hard to do and required some
human training to beat GoogLeNets accuracy. After the few days of training, the human expert
(Andrej Karpathy) was able to achieve the top-5 error rate of 5.1%(single model) and 3.6%
(ensemble). The network used the CNN inspired by LeNet but implemented a novel element which
is dubbed an inception module. It used batch normalisation, image distortions and RMSprop. This
module is based on the several very small convolutions to reduce the number of parameters
drastically. Their architecture consisted of the 22 layer deep CNN but reduced the number of
parametersfrom60million (AlexNet) to4million.
It contains 1×1 Convolution at themiddle of network, andglobal average pooling is used at theend
of the network instead of using the fully connected layers. These two techniques are from another
paper“Network In-Network” (NIN). Another technique, called inceptionmodule,is tohave
different
sizes/typesof convolutions forthesameinput andtostackall theoutputs.
Page 10 | 11
Q10. What is LeNet-5?
Answer:
LeNet-5, a pioneering 7-level convolutional network by the LeCun et al in 1998, that classifies
digits, was applied by several banks to recognise hand-written numbers on checks (cheques)
digitised in 32x32 pixel greyscale input images. The ability to process higher-resolution images
requireslargerandmoreconvolutionallayers,sotheavailabilityofcomputingresourcesconstrains this
technique.
LeNet-5is verysimplenetwork.It onlyhassevenlayers,amongwhichtherearethreeconvolutional

layers(C1, C3 andC5), twosub-sampling(pooling) layers(S2 andS4), andonefully connectedlayer
(F6), thatarefollowedbyoutputlayers. Convolutional layers use5by5convolutionswithstride1. Sub-
sampling layers are2 by 2 average pooling layers. Tanh sigmoid activations are usedto throughoutthe
network.Several interestingarchitectural choices weremadein LeNet-5 that are not very commonin
themoderneraof deeplearning.
Page 11 | 11
DATA SCIENCE
INTERVIEW
PREPARATION
Preparation)
# DAY 15
Page 1 of 12
Q1. What is Autoencoder?
Answer:
Autoencoderneuralnetwork:ItisanunsupervisedMachinelearningalgorithmthatapplies
backpropagation,settingthetargetvalues tobeequaltotheinputs.It istrainedtoattempttocopyits
inputtoitsoutput.Internally,ithasthehiddenlayerthatdescribesacodeusedtorepresenttheinput.
It is trying to learn the approximation to the identity function, to output x̂ ̂ x^thatissimilar

tothexx.
Autoencodersbelongstotheneuralnetworkfamily, buttheyarealsocloselyrelatedtoPCA
(principalcomponentsanalysis).
Autoencoders,althoughit is quitesimilartoPCA, butits Autoencodersaremuchmoreflexible than
PCA. Autoencoders can represent both liners and non-linear transformation in
encoding,butPCA
canperformlineartransformation.Autoencoderscanbelayeredtoformdeeplearningnetworkdue to
itsNetworkrepresentation.
Types of Autoencoders:
1.Denoisingautoencoder
AutoencodersareNeuralNetworkswhichareusedforfeatureselectionandextraction.
However,whentherearemorenodesin hiddenlayerthanthereareinputs,theNetworkis
risking to learn so-called “Identity Function”, also called “Null Function”, meaning
that
outputequalstheinput,markingtheAutoencoderuseless.
Page 2 of 12
Denoising Autoencoders solve this problembycorruptingthedataonpurposeby
randomly turningsomeoftheinputvaluestozero.Ingeneral,thepercentageofinputnodes
which arebeing settozerois about50%. Othersourcessuggestalowercount,suchas30%.
It depends on the amount of data and input nodes you have.
2. Sparseautoencoder
An autoencoder takes the input image orvector andlearns codedictionary that changes the
rawinputfromonerepresentationtoanother.Wherein sparseautoencoderswith asparsity
enforcerthatdirectsasingle-layer networktolearn codedictionary which in turnminimizes
theerrorin reproducingtheinputwhilerestricting numberof code words for reconstruction.
Thesparseautoencoderconsistsasingle hidden layer, which is connected totheinput vector
by aweight matrixforming theencoding step.The hidden layer thenoutputs toa
reconstruction vector, using atied weight matrixtoformthedecoder.
Q2. What Is Text Similarity?

Answer:
Whentalkingabouttextsimilarity,differentpeoplehaveaslightlydifferentnotiononwhattext similaritymeans.
In essence, the goal is to compute how ‘close’ two pieces of text are in (1) meaning or (2)
surfacecloseness.Thefirst is referredtoassemanticsimilarity, andthelatteris referredto aslexical
similarity. Although the methods forlexical similarity areoftenusedtoachiesveemanticsimilarity
(toacertainextent),achievingtruesemanticsimilarityisoftenmuchmore involved.
Page 3 of 12
Lexical or Word Level Similarity
Whenreferringtotextsimilarity,peoplerefertohowsimilarthetwopiecesoftextareatthesurfacelevel. Example-
howsimilararethephrases“the catatethemouse”with“the mouseatethecat food” byjust looking
atthewords?Onthesurface,ifyouconsideronlyword-levelsimilarity,these twophrases(withdeterminers
disregarded)appearverysimilaras3ofthe4uniquewordsarean exactoverlap.
Semantic Similarity:
Another notion of similarity mostly explored by NLP research community is how similar in meaning
are any two phrases? If we look at the phrases, “ the cat ate the mouse ” and “ the mouse ate
the cat food”. As we know that while the words significantly overlaps, these two phrases have
different
meaning. Meaning out of the phrases is often the more difficult task as it requires deeper level of
analysis.Example, wecan actually look at thesimple aspects like order of
words: “cat==>ate==>mouse” and “mouse==>ate==>cat food”. Words overlap in this
case, the
order of the occurrence is different, and we can tell that, these two phrases have different meaning.
This is just the one example. Most people use the syntactic parsing to help with the semantic
similarity. Let’s have alook attheparse trees for these twophrases.What canyouget fromit?
Page 4 of 12
Q3. What is dropout in neural networks?
Answer:
Whenwetrainingourneuralnetwork(ormodel)byupdatingeachofitsweights,itmightbecome toodependent
onthedatasetweareusing.Therefore,whenthismodelhastomakeapredictionor classification,itwillnotgive
satisfactoryresults.Thisisknownasover-fitting.Wemightunderstand
thisproblemthroughareal-world example:Ifastudentofsciencelearnsonlyonechapterofabook
andthentakesatestonthewholesyllabus,hewillprobablyfail.
To overcome this problem, we use a technique that was introduced by Geoffrey Hinton in
2012.This techniqueisknownasdropout.
Dropoutreferstoignoringunits(i.e.,neurons)duringthetrainingphaseofcertainsetofneurons, which
is chosen at random. By “ignoring”, I mean these units are not considered during a
particular forwardorbackwardpass.
At each training stage, individual nodes are either dropped out of the net with probability 1-
porkept withprobabilityp, sothatareducednetworkis left; incoming andoutgoingedgestoa
dropped-out nodearealsoremoved.
Page 5 of 12
Q4. What is Forward Propagation?
Answer:
InputX providestheinformationthatthenpropagatestohiddenunitsateachlayerandthenfinally
producetheoutputy. Thearchitectureofnetworkentailsdeterminingits depth,width,andthe activation
functionsusedoneachlayer. Depthis thenumberofthehiddenlayers. Widthis the numberof
units(nodes)oneachhiddenlayer since wedon’tcontrolneitherinputlayer noroutput layer
dimensions.Therearequiteafewsetof activation functionssuchRectified Linear Unit, Sigmoid,
Hyperbolic tangent,etc. Researchhasproventhatdeepernetworksoutperformnetworks with
morehiddenunits.Therefore,it’salwaysbetterandwon’thurttotrainadeepernetwork.
Q5. What is Text Mining?

Answer:
Text mining: It is also referred as textdatamining, roughly equivalent totext analytics, is theprocess
of deriving high-quality informationfromtext. High-quality informationis typically derived through
thedevising ofpatternsandtrendsthroughmeanssuchasstatistical patternlearning.Text mining
usually involves theprocessof structuringtheinputtext(usually parsing, alongwiththeaddition of
somederivedlinguistic featuresandtheremovalof others,andsubsequentinsertionintoadatabase),
deriving patternswithinthestructureddata,andfinally evaluationandinterpretationof theoutput.
'High quality' in textminingusually referstosomecombinationofrelevance,novelty,andinterest.
Typical textminingtasksincludetextcategorization,textclustering, concept/entityextraction,
productionof granulartaxonomies,sentimentanalysis, documentsummarization,andentityrelation
modeling(i.e., learning relationsbetweennamedentities).
Page 6 of 12
Q6. What is Information Extraction?
Answer:
Information extraction (IE): It is the task of automatically extracting structured information from the
unstructured and/or semi-structured machine-readable documents. In most of the cases, this activity
concernsprocessinghumanlanguagetextsusingnaturallanguageprocessing(NLP).
Informationextractiondepends onnamedentityrecognition(NER),a sub-toolusedtofindtargeted
informationtoextract.NERrecognizesentitiesfirstasoneofseveralcategories,suchaslocation
(LOC), persons (PER), or organizations (ORG). Once the information category is
recognized,an informationextractionutility extractsthenamedentity’s relatedinformationand
constructsa machine-readabledocumentfromit, whichalgorithms canfurtherprocesstoextract
meaning.IE findsmeaningbywayofothersubtasks,including co-referenceresolution,relationship
extraction, language,and vocabularyanalysis,and sometimesaudioextraction.
Page 7 of 12
Q7. What is Text Generation?
Answer:
Text Generation: It is a type of the Language Modelling problem. Language Modelling is the core
problem for several of natural language processing tasks such as speech to text, conversational
system, and the text summarization. The trained language model learns the likelihood of occurrence
of the word based on the previous sequence of words used in the text. Language models can be
operatedatthecharacterlevel, n-gramlevel, sentence level orevenparagraphlevel.
A language model is at thecore of many NLP tasks, andis simply a probability distribution over a
sequenceof words:
It can also be used to estimate the conditional probability of the next word in a
sequence:
Page 8 of 12
Q8. What is Text Summarization?
Answer:
Weall interactwiththeapplicationswhichusesthetext summarization. Many of theapplications arefor
theplatformwhichpublishesarticles onthedaily news,entertainment,sports.Withourbusy
schedule,welike toreadthesummaryof thosearticles beforewedecidetojumpin forreadingentire
article. Reading asummaryhelps us toidentify theinterest area, gives abrief context of thestory.
Text summarizationis asubdomainof Natural Language Processing (NLP) thatdeals withextracting
summariesfromhugechunksof texts.Therearetwomaintypesof techniquesusedfor text
summarization: NLP-based techniques anddeeplearning-basedtechniques.
Text summarization:It referstothetechniqueof shorteninglong piecesof text.Theintentionis to
createthecoherent andfluent summaryhaving only themainpoints outlined in thedocument.
Howtext summarizationworks:
The twotypes of summarization, abstractiveandtheextractivesummarization.
1. Abstractive Summarization:It selectwordsbasedonthesemanticunderstanding;eventhose
wordsdid notappearin thesource documents.It aims atproducingimportantmaterial in the
newway.They interpretsandexamines thetext using advanced natural language techniques
togenerate thenewshorter text that conveysthe most critical informationfromthe original
text.
It canbecorrelatedin thewayhumanreadsthetextarticle orblogpostandthensummarizes in
their word.
2. Extractive Summarization:It attempt tosummarizearticles byselectingthesubset of

words thatretainthemostimportantpoints.
This approach weights the most important part of sentences and uses the same to
formthe summary.Differentalgorithmandthetechniquesareusedtodefinetheweightsfor
the sentencesandfurtherrankthembasedonimportanceandsimilarityamongeachother.
Page 9 of 12
Q9. What is Topic Modelling?
Answer:
Topic Modelling is the taskof using unsupervised learning toextract the maintopics
(representedas asetofwords)thatoccurinacollectionofdocuments.
Topic modeling, in the context of NaturalLanguageProcessing, is described asa
methodof uncoveringhiddenstructureinacollectionoftexts.
Dimensionality Reduction:
Topic modeling is the form of dimensionality reduction. Rather than representing the text T
in its feature space as {Word_i: count(Word_i, T) for Word_i in V}, we can represent the text
inits topic space as(Topic_i: weight(Topic_i, T) forTopic_i inTopics ).
Unsupervised learning:
Topic modeling can be compared to the clustering. As in the case of clustering, the number of
topics,
like the number of clusters, is the hyperparameter. By doing the topic modeling, we build
clustersof wordsratherthanclustersoftexts.Atextisthusamixtureofallthetopics,eachhavinga
certain weight.
A Form of Tagging
If document classification is assigning a single category to a text, topic modeling is
assigningmultiple tagstoatext.A humanexpertcanlabel theresultingtopicswithhuman-
readablelabelsanduse differentheuristicstoconverttheweightedtopicstoasetoftags.
Page 10 of 12
Q10.What is Hidden Markov Models?
Answer:
HiddenMarkovModels(HMMs)aretheclass ofprobabilistic graphical modelthatallowusto
predictthesequenceof unknown(hidden)variables fromthesetof observedvariables. Thesimple
exampleofanHMMis predictingtheweather(hiddenvariable) basedonthetypeofclothesthat
someonewears(observed).An HMMcanbeviewedastheBayes Netunrolledthroughtimewith
observationsmadeatthesequenceof timestepsbeingusedtopredictthebestsequenceof thehidden
states.
Thebelowdiagram from Wikipedia shows that HMM andits transitions. The scenario is the room
thatcontainsurns X1, X2, andX3, eachof which containsaknownmix of balls, eachball labeled y1,
y2, y3, andy4. Thesequenceoffourballs is randomlydrawn. In this particular case, theuser observes
thesequenceofballs y1,y2,y3, andy4andis attemptingtodiscernthehiddenstate,whichis theright
sequenceof threeurnsthatthesefourballs werepulled from.
WhyHidden,MarkovModel?
ThereasonitiscalledtheHiddenMarkovModelisbecauseweareconstructinganinferencemodel
basedontheassumptionsofaMarkovprocess.TheMarkovprocessassumptionissimplythatthe “futureis
independentofthepastgiventhepresent”.
To make this point clear, let us consider the scenario below where the weather, the hidden
variable, can be hot, mild or cold, and the observed variables are the type of clothing worn. The
arrows representtransitionsfromahiddenstatetoanotherhiddenstateorfromahiddenstatetoan
observedvariable.
Page 11 of 12
Page 12 of 12
DATA SCIENCE
INTERVIEW
PREPARATION
Preparation)
# Day-16
Q1.What is Statistics Learning?
Answer:
Statistical learning: It is the framework for understanding data based on the statistics, which
canbe classified as the supervisedor unsupervised. Supervisedstatistical learning involves building the
statisticalmodelforpredicting,orestimating,anoutputbasedononeormoreinputs,while
in unsupervised statistical learning, there are inputs but no supervising output, but we can
learn relationships andstructurefromsuchdata.
Y = f(X) + ɛ ,X = (X1,X2, . . .,Xp),
f : It is an unknown function & ɛ is random error (reducible &
irreducible). Prediction&Inference:
In the situations , where the set of inputs X are readily available, but the output Y is not
known,we oftentreatfastheblackbox(notconcernedwiththeexactformof“f”),aslongasityields
the accuratepredictionsforY.Thisistheprediction.
There are the situations where we are interested in understanding the way that Y is affected as X
change. In this type of situation, we wish to estimate f, but our goal is not necessarily to make the
predictionsforY.HerewearemoreinterestedinunderstandingtherelationshipbetweentheXand
Y. Now f cannot be treated as the black box, because we need to know
it’s exact form. This
is inference.
Parametric&Non-parametric
methods
Parametricstatistics: This statistical testsbasedonunderlyingtheassumptionsaboutdata’s distribution.
In other words,It is based on the parameters of the normal curve. Because parametric
statistics arebasedonthenormalcurve,datamustmeetcertainassumptions, orparametric
statistics cannotbecalculated. Before runninganyparametricstatistics, youshouldalways besure
totestthe assumptionsfortheteststhatyouareplanningtorun.
f(X)=β0+β1X1+β2X2+...+βpXp
As by the name, nonparametric statistics are not based on parameters of the normal
curve. Therefore, if our data violate the assumptions of a usual parametric and nonparametric
statistics mightbetterdefinethedata,tryrunningthenonparametricequivalentoftheparametric
test.We shouldalsoconsiderusingnonparametricequivalenttestswhenwehavelimitedsamplesizes
(e.g., n<30).Thoughthenonparametricstatisticaltestshavemoreflexibilitythandoparametric
statistical tests, nonparametric tests are not as robust; therefore, most statisticians recommend
thatwhenappropriate,parametricstatisticsarepreferred.
.
PredictionAccuracy andModel Interpretability:
Out of many methods that we use for the statistical learning, some are less flexible and more
restrictive . When inference is the goal, then there are clear advantages of using the simple and
relatively inflexible statistical learning methods. Whenwe are only interested in theprediction, we
useflexible modelsavailable.
Q2. What is ANOVA?
Answer:
ANOVA: it stands for “ Analysis of Variance ” is an extremely important tool for
analysis ofdata (bothOneWayandTwoWayANOVA is used).It is astatistical methodto
comparethepopulation meansoftwoormoregroupsbyanalyzingvariance.Thevariancewoulddifferonly
whenthemeans aresignificantlydifferent.
ANOVA test is the way to find out if survey or experiment results are significant. In
otherwords,It helpsustofigureout ifweneedtorejectthenullhypothesisoracceptthealternatehypothesis.
We
aretestinggroupstoseeifthere’sadifferencebetweenthem.Examplesofwhenwemightwantto test
differentgroups:
Thegroupof psychiatricpatientsaretrying threedifferenttherapies:counseling,medication,
andbiofeedback.Wewant toseeifonetherapyisbetterthan theothers.
The manufacturerhastwodifferentprocessestomakelight bulbsif theywanttoknowwhich
oneisbetter.
Studentsfromthedifferentcollegestakethesameexam.Wewanttoseeifonecollege
outperformstheother.
Types of ANOVA:
One-wayANOVA
Two-way
ANOVA
One-wayANOVAisthehypothesistestinwhichonlyonecategoricalvariableorthesingle factor
is taken into consideration. With the help of F-distribution, it enables us to
compare
meansofthreeormoresamples.TheNullhypothesis(H0)istheequityinallpopulation meanswhile
anAlternativehypothesisisthedifferenceinatleastonemean.
There are two-ways ANOVA examines the effect of two independent factors on a dependent
variable. It also studies the inter-relationship between independent variables influencing the
values of thedependent variable, if any.
Q3. What is ANCOVA?

Answer:
Analysis ofCovariance (ANCOVA): It is theinclusion ofthecontinuousvariable in additiontothe
variablesof interest( thedependentandindependentvariable) asmeansforthecontrol.Becausethe
ANCOVA is theextensionoftheANOVA, theresearchercanstill assessmaineffectsandthe
interactionsto answer their research hypotheses. The difference betweenANCOVA andanANOVA
is thatanANCOVA modelincludes the“covariate” thatis correlatedwithdependentvariableand
meansondependentvariable areadjusted duetoeffectsthecovariate has onit. Covariates can also
beusedinmanyANOVAbaseddesigns:suchasbetween-subjects,within-subjects(repeated measures),mixed
(between –andwithin–designs),etc.Thus,thistechniqueanswersthequestion
In simple terms, The difference between ANOVA and the ANCOVA is the letter "C", which
stands for'covariance'.LikeANOVA,"AnalysisofCovariance"(ANCOVA)hasthesinglecontinuous
responsevariable.UnlikeANOVA,ANCOVAcomparestheresponsevariablebyboththefactorand a
continuousindependentvariable(examplecomparingtestscorebyboth'levelofeducation'andthe
'numberofhoursspentinstudying').Thetermsforthecontinuousindependentvariable(IV)usedin theANCOVA
is "covariate".
Example of ANCOVA
Q4. What is MANOVA?
Answer:
MANOVA (multivariate analysis of variance): It is a type of multivariate analysis used to
analyze datathatinvolves morethanonedependent variableatatime.MANOVA allows ustotest
hypotheses regardingtheeffectofoneormoreindependentvariablesontwoormoredependentvariables.
TheobviousdifferencebetweenANOVAandthe"MultivariateAnalysisofVariance"(MANOVA) is
the“M”,whichstandsformultivariate.Inbasicterms,MANOVAisanANOVAwithtwoormore continuous
responsevariables.LikeANOVA,MANOVAhasboththeone-wayflavorandatwo- wayflavor.The
numberoffactorvariablesinvolveddistinguishtheone-wayMANOVAfromatwo-wayMANOVA.
Whencomparingthetwoormorecontinuousresponsevariablesbythesinglefactor,aone-way MANOVAis
appropriate(e.g.comparing‘testscore’and‘annualincome’togetherby‘levelof
education’). The two-way MANOVA also entails two or more continuous response variables, but
compares themby at least two factors (e.g. comparing ‘test score’ and ‘annual income’ together by
both‘level of education’ and‘zodiac sign’).
Q5. What is MANCOVA?

Answer:
Multivariateanalysis ofcovariance(MANCOVA): It is astatistical techniquethatis theextensionof
analysis ofcovariance(ANCOVA). It is themultivariateanalysis ofvariance (MANOVA) witha
covariate(s).). In MANCOVA, weassessforstatistical differencesonmultiplecontinuousdependent
variables byanindependentgroupingvariable, whilecontrolling forathirdvariablecalled the
covariate;multiple covariatescanbeused, depending onthesample size. Covariates are addedso
thatit canreduceerrortermsandsothattheanalysis eliminates thecovariates’ effectonthe
relationship betweentheindependent grouping variable andthecontinuous dependentvariables.
ANOVA and ANCOVA, the main difference between the MANOVA and MANCOVA, is the “C,”
which again stands for the “covariance.” Both the MANOVA and MANCOVA feature two or more
response variables, but the key difference between the two is the nature of the IVs. While the
MANOVA can includeonlyfactors,ananalysisevolvesfromMANOVA toMANCOVA whenone or
morecovariates areaddedtothemix.
Q6. Explain the differences between KNN
classifier and KNN regression methods.
Answer:
They arequite similar. Given avalue forKK andaprediction pointx0x0, KNN regressionfirst
identifiestheKKtrainingobservationsthatareclosestox0x0,representedbyN0.Itthen
estimatesf(x0)usingtheaverageofallthetrainingresponsesinN0.Inotherwords,
So themain difference is the fact that for the classifier approach, the algorithm assumes the outcome
as the class of more presence, and on the regression approach, the response is the average value of
thenearest neighbors.
Q7. What is t-test?

Answer:
To understand T-Test Distribution, Consider the situation, youwant tocomparethe
performanceof twoworkersofyourcompanybycheckingtheaveragesalesdonebyeachofthem,orto
compare theperformanceofa workerbycomparingtheaveragesalesdonebyhimwiththestandardvalue.
In such situations of daily life, t distribution is applicable.
A t-test is the type of inferential statistic used to determine if there is a
significant differencebetween themeansoftwogroups,whichmayberelatedincertain
features.Itismostlyusedwhenthedata
sets,like thedatasetrecordedastheoutcomefromflippingacoin100times,wouldfollowanormal
distributionandmayhaveunknownvariances.A t-testis usedasahypothesistestingtool,which
allowstesting ofan assumptionapplicable toapopulation.
Understandt-testwithExample:Let’ssayyouhaveacold,andyoutryanaturopathicremedy.Your coldlasts
acoupleofdays.Thenexttimewhenyouhaveacold,youbuyanover-the-counter pharmaceutical,andthe
cold lasts a week. You survey your friends, and they all tell you that their colds were of a shorter
duration(anaverageof3days)whentheytookthehomeopathicremedy. Whatyouwanttoknowis,
aretheseresultsrepeatable?At-testcantellyoubycomparingthemeans ofthetwogroupsandlettingyou
knowtheprobabilityofthoseresultshappeningbychance.
Q8. What is Z-test?

Answer:
z-test: It is astatistical testusedtodetermine whetherthetwopopulation meansare different
thevariances areknown,andthesamplesize is large. The teststatistic is assumedtohave thenormal
distribution, andnuisanceparameterssuch as standarddeviation should beknownfor anaccurate z-
testtobeperformed.
AnotherdefinitionofZ-test:AZ-testisatypeofhypothesistest.Hypothesistestingisjusttheway foryou
tofigureoutifresultsfromatestarevalidorrepeatable.Example,ifsomeonesaidtheyhadfoundthenewdrug
thatcurescancer,youwouldwanttobesureitwasprobablytrue.Hypothesis testwilltellyouifit’sprobably
trueorprobablynottrue.AZtestisusedwhenyourdatais approximatelynormallydistributed.
Z-Tests Working :
Tests that can be conducted as the z-tests include one-sample location test, a two-sample
location test,apaireddifferencetest,andamaximumlikelihoodestimate.Z-testsarerelatedtot-
tests, but t- tests are best performed when an experiment has the small sample size. Also, T-tests
assumes the standard deviation is unknown, while z-tests assumes that it is known. If the standard
deviationof thepopulationisunknown,thentheassumptionofthesamplevarianceequalingthe
population varianceismade.
WhenwecanruntheZ-test:
Different typesoftests areusedinthe statistics (i.e.,ftest, chi-square test, t-test). Youwouldusea
Z test if:
Yoursample sizeis greater than 30.Otherwise, use a t-test.
Datapointsshouldbeindependent fromeachother.Someother words,onedata pointisnot
relatedordoesn’taffectanotherdatapoint.
Yourdata shouldbe normally distributed. However,forlarge sample sizes(over
30),this doesn’talwaysmatter.
Yourdata shouldbe randomly selectedfroma population, whereeachitem has
anequal chanceofbeingselected.
Sample sizes should be equal, if at all possible.
Q9. What is Chi-Square test?
Answer:
Chi-square (χ2) statistic: It is a test that measures how expectations compare to actual
observeddata (ormodelresults).Thedatausedincalculatingachi-squarestatisticmustberandom,raw,
mutually exclusive, drawnfromindependentvariables, anddrawnfromalargeenoughsample.For
example, theresultsoftossingacoin100timesmeetthesecriteria.
Chi-square testisintendedto test howitisthatanobserveddistributionisduetochance.Itisalso
called the "goodness of fit" statistic because it measures how well the observed distribution of
thedatafitswiththedistributionthatisexpectedifthevariablesareindependent.
Chi-squaretestisdesignedtoanalyzethecategoricaldata.Thatmeansthatthedatahasbeen
counted anddividedintocategories.Itwillnotworkwithparametricorcontinuousdata(suchasheightin inches).
For example,if youwanttotestwhetherattending class influences howstudents performon an
exam,usingtestscores(from0-100)asdatawouldnotbeappropriateforaChi-squaretest. However,
arrangingstudentsintothecategories"Pass" and"Fail" would.Additionally, thedatain a Chi-square
grid shouldnotbein the formof percentages, oranything otherthanfrequency(count) data.
Q10. What is correlation and the covariance in the statistics?
Answer:
TheCovariance andCorrelation aretwomathematicalconcepts;thesetwoapproachesarewidely
usedin thestatistics. BothCorrelation andtheCovariance establish therelationship andalso
measuresthedependencybetweenthetworandomvariables, thework is similar betweenthesetwo,
in themathematical terms,they aredifferent fromeachother.
Correlation: It is thestatistical technique that canshowwhetherandhowstrongly pairs of variables
arerelated. For example, height andweight arerelated; taller people tendtobeheavier thanshorter
people.Therelationshipisn't perfect.Peopleof thesameheightvary in weight, andyou can easily
thinkoftwopeopleyouknowwheretheshorteroneis heavierthanthetaller one.Nonetheless,the
averageweight of people5'5'' is less thantheaverage weight of people5'6'', andtheir average weight
is less thanthatof people5'7'',etc.Correlationcantell youjusthowmuchof thevariationin peoples'
weights is related totheir heights.
Covariance: It measuresthedirectionalrelationshipbetweenthereturnsontwoassets.The
positive covariancemeansthatassetreturnsmovetogetherwhileanegative covariance means
theymove inversely.Covarianceiscalculatedbyanalyzingat-returnsurprises(standarddeviationsfrom
the expectedreturn)orbymultiplyingthecorrelationbetweenthetwovariablesbythestandard deviationof
eachvariable.
DATA SCIENCE
INTERVIEW
PREPARATION
Preparation)
# DAY 17
Page 1 | 11
Q1. What is ERM (Empirical Risk Minimization)?
Answer:
Empirical risk minimization (ERM): It is a principle in statistical learning theory which defines a
family of learning algorithms andis used to give theoretical bounds on their performance. The idea
is that wedon’t know exactly howwell an algorithm will work in practice (the true "risk") because
wedon't know thetrue distribution of data that thealgorithm will work on, but as an alternative we
canmeasureits performanceonaknownsetof training data.
Weassumedthat our samples comefrom this distribution anduse our dataset as anapproximation.
If wecomputetheloss usingthedatapointsin ourdataset, it’s called empirical risk. It is “empirical”
andnot“true” because weareusing adataset that’s asubsetof thewholepopulation.
Whenour learning model is built, wehave topick afunction that minimizes theempirical risk that
is thedeltabetweenpredictedoutputandactualoutputfordatapointsin thedataset.This processof
finding this function is called empirical risk minimization(ERM). Wewanttominimize thetruerisk.
Wedon’thaveinformationthatallowsustoachievethat,sowehopethatthis empirical risk will
almost bethesameasthetrueempirical risk.
Let’s get abetter understanding by Example
Wewouldwanttobuild amodelthatcandifferentiatebetweenamaleandafemalebasedonspecific
features.If weselect 150 randompeoplewherewomenare really short, andmenare really tall, then
themodelmightincorrectly assumethatheightis thedifferentiating feature.Forbuilding atruly
accuratemodel,wehavetogatherall thewomenandmenin theworldtoextractdifferentiating
features.Unfortunately,thatis notpossible!So weselectasmall numberof peopleandhopethatthis
sampleis representative of thewholepopulation.
Page 2 | 11
Q2. What is PAC (Probably Approximately Correct)?
Answer:
PAC: In computational learning theory, probably approximately correct (PAC)
learning is a frameworkformathematicalanalysisofmachinelearning.
Thelearnerreceivessamplesandmusthavetopickageneralizationfunction(calledthehypothesis)
fromaspecificclassofpossiblefunctions.Ourgoalisthat,withhighprobability,theselected function
will have low generalization error. The learner must be able to learn the concept given
any arbitraryapproximationratio,probabilityofsuccess,ordistributionofthesamples.
Hypothesis class is PAC(Probably Approximately Correct) learnable if there exists a function m_H and
algorithmthatforanylabelingfunctionf,distributionDoverthedomainofinputsX,
delta and epsilon that with m ≥ m_H produces a hypothesis h like that with probability 1-delta it
returnsatrueerrorlowerthanepsilon.Labelingfunctionisnothingotherthansayingthatwehavea specific
function f that labels the data in the domain.
Page 3 | 11
Q3. What is ELMo?
Answer:
ELMo is anovel waytorepresent wordsin vectorsorembeddings. These wordembeddingshelp
achieve state-of-the-art (SOTA) results in several NLP tasks:
It is a deep contextualized word representation that models both complex characteristics

of worduse (e.g., syntax andsemantics), andhowtheseusesvary acrosslinguistic contexts.These
wordvectors arelearnedfunctionsofinternalstatesofadeepbiLM(bidirectional languagemodel),whichispre-
trainedonlargetextcorpus.Theycouldbeeasilyaddedtoexistingmodelsandsignificantlyimprove state
oftheartacrossabroadrangeofchallenging NLP problems,including questionanswering, textual
entailmentandsentimentanalysis.
Page 4 | 11
Q4. What is Pragmatic Analysis in NLP?
Answer:
Pragmatic Analysis(PA): It deals with outside word knowledge, which means understanding i.e
externaltodocumentsandqueries.PAthatfocusesonwhatwasdescribedisreinterpretedbywhatit
actuallymeant,derivingthevariousaspectsoflanguagethatrequirereal-worldknowledge.
It deals with overall communicative and social content and its effect on interpretation. It
meansabstractingthemeaningfuluseoflanguageinsituations.Inthisanalysis,themainfocusalwaysonwhat
wassaidinreinterpretedonwhatisintended.
It helps users to discover this intended effect by applying a set of rules that characterize
cooperative dialogues.
E.g., "close the window?" should be interpreted as a request instead of an order.
Q5. What is Syntactic Parsing?

Answer:
Syntactic Parsing orDependencyParsing: It is ataskofrecognizing asentenceandassigning a
syntacticstructuretoit. MostWidely weusedsyntactic structureis theparsetreewhichcanbe
generatedusingsomeparsingalgorithms.Theseparsetreesareusefulin variousapplicationslike
grammarchecking ormoreimportantly,it playsacritical rolein thesemanticanalysis stage.For
exampletoanswerthequestion“Whois thepoint guardfor theLA Laker in thenext game?” we
needtofigure outits subject, objects, attributesto help us figure out that theuser wants thepoint
guard of theLA Lakers specifically forthenext game.
Example:
Page 5 | 11
Q6. What is ULMFit?
Answer:
TransferLearning in NLP(Natural languageProcessing) is anareathathadnotbeenexploredwith
greatsuccess.But,inMay2018,JeremyHowardandSebastianRudercameupwiththepaper
–UniversalLanguageModelFine-tuningforTextClassification(ULMFit)whichexploresthe
benefitsofusingapretrainedmodelontextclassification. It proposesULMFiT(Universal Language Model
Fine-tuningforText Classification), atransferlearningmethodthatcouldbeappliedtoany taskin
NLP. In this method outperforms the state-of-the-art on six text classification tasks.
ULMFiT uses a regular LSTMwhich is the state-of-the-art language model architecture
(AWD- LSTM). The LSTM network has three layers. Single architecture is used throughout –
forpre-training aswellasforfine-tuning.
ULMFiT achieves the state-of-the-art result using novel techniques like:
Discriminative fine-tuning
Slantedtriangularlearningrates
Gradual unfreezing
Discriminative Fine-Tuning
Page 6 | 11
Different layers of a neural network capture different types of information so they should be fine-
tuned to varying extents. Instead of using the same learning rates for all layers of the model,
discriminative fine-tuning allows ustotuneeachlayer withdifferent learningrates.
Slantedtriangularlearning
Themodelshould quickly convergetoasuitableregionoftheparameterspaceinthe beginningof

training and then later refine its parameters. Using a constant learning rate throughout training is
not thebestwaytoachievethisbehaviour.InsteadSlantedTriangularLearningRates(STLR)linearly increases
thelearningrateatfirstandthenlinearlydecaysit.
Gradual Unfreezing
Gradualunfreezingistheconceptofunfreezingthelayersgradually,whichavoidsthecatastrophicloss of
knowledgepossessedbythemodel.Itfirstunfreezesthetoplayerandfine-tunesalltheunfrozen layersfor1
epoch.Itthenunfreezesthenextlowerfrozenlayerandrepeatsuntilallthelayershavebeen fine-tuneduntil
convergenceatthelastiteration.
Q7. What is BERT?

Answer:
BERT (Bidirectional Encoder Representations from Transformers) is an open-
sourcedNLPpre- training modeldevelopedbyresearchersatGooglein 2018.Adirectdescendantto
GPT (Generalized Language Models), BERT has outperformed several models in NLP and
providedtopresultsin QuestionAnswering,NaturalLanguageInference(MNLI),andother
frameworks.
Whatmakesit’suniquefromtherestofthemodelisthatitisthefirstdeeplybidirectional, unsupervised
languagerepresentation,pre-trainedusingonlyaplaintextcorpus.Sinceit’sopen-
sourced,anyonewithmachinelearningknowledgecaneasilybuildanNLPmodelwithouttheneed for
sourcing massive datasets for training the model, thus saving time, energy, knowledge and
resources.
Page 7 | 11
How does it work?
Traditionalcontext-freemodels(like word2vecorGloVe) generateasingle wordembedding
representationforeachwordinthevocabularywhichmeanstheword“right” wouldhavethesame
context-freerepresentationin“I’msureI’mright”and“Takea rightturn.”However,BERTwould
representbasedonbothpreviousandnextcontext,makingitbidirectional.Whiletheconceptof
bidirectional wasaroundfora long time, BERT wasfirst onits kind to successfully pre-train
bidirectionalinadeepneuralnetwork.
Q8.What is XLNet?
Answer:
XLNet is a BERT-like model instead of a totally different one. But it is an auspicious and
potential one.Inoneword,XLNetisageneralizedautoregressivepretrainingmethod.
Autoregressive(AR) languagemodel:It is akind ofmodelthatusing thecontextwordtopredictthe
nextword.Butherethecontextwordisconstrainedtotwodirections,eitherforwardorbackwards.
The advantages of AR language model are good at generative Natural language Process(NLP) tasks.
Because when generating context, usually is the forward direction. AR language model naturally
workswell onsuch NLP tasks.
Page 8 | 11
But Autoregressive language modelhassomedisadvantages, andit only canuseforwardcontext
or backwardcontext,whichmeansitcan'tuseforwardandbackwardcontextatthesametime.
Q9. What is the transformer?

Answer:
Transformer:It is adeepmachinelearning modelintroducedin2017,usedprimarilyinthefield of
naturallanguageprocessing(NLP).Likerecurrentneuralnetworks(RNN),Itisdesignedtohandle ordered
sequencesofdata,suchasnaturallanguage,forvarioustaskslikemachine
translationandtextsummarization.However,Unlikerecurrentneuralnetworks(RNN),Transformers
donotrequirethatthesequencebeprocessedintheorder.So,ifthedatainquestionisanatural language,the
Transformerdoesnotneedtoprocessthebeginningof asentencebeforeit processes theend.Dueto
thisfeature,theTransformerallowsformuchmoreparallelizationthanRNNsduring training.
Transformersaredevelopedtosolvetheproblemofsequencetransductioncurrentneuralnetworks.It
meansanytaskthattransformsaninputsequencetoanoutputsequence.Thisincludesspeech recognition,
text-to-speechtransformation,etc.
Formodelsto performa sequence transduction, it is necessary to have somesortof
memory. example,letussaythatwearetranslatingthefollowingsentencetoanotherlanguage
(French):
Page 9 | 11
“The Transformers”is aJapaneseband.Thatbandwasformedin 1968,duringtheheightof the
Japanese music history.”
In theaboveexample, theword“the band” in thesecondsentence refers totheband“The
Transformers” introduced in thefirst sentence. Whenyoureadaboutthebandin thesecondsentence,
youknowthatit is referencing tothe“The Transformers” band. That maybeimportantfor translation.
For translating other sentences like that, a model needs to figure out these sort of dependencies and
connections. Recurrent Neural Networks (RNNs) and Convolutional Neural Networks (CNNs) have
beenusedtodeal withthis problembecauseof their properties.
Q10. What is Text summarization?

Answer:
Text summarization:It is the processofshortening a text document, to createa summaryof
the significantpointsoftheoriginaldocument.
Types of Text Summarization Methods :
Textsummarizationmethodscanbeclassifiedintodifferenttypes.
Page 10 | 11
Exampl
e:
Page 11 | 11
DATA SCIENCE
INTERVIEW
PREPARATION
Preparation)
# Day-18
Q1. What is Levenshtein Algorithm?
Answer:
Levenshtein distance is a string metric for measuring the difference between two sequences. The
Levenshtein distance between two words is the minimum number of single-character edits (i.e.
insertions,deletions orsubstitutions) required tochange onewordinto theother.
By Mathematically, the Levenshtein distance between the two strings a, b (of length |a|and |b|
respectively) is given by theleva, b( |a|, |b|) where:
Where,1(ai≠bi): This is the indicator function equal to zero when ai≠ bi and
equalto1
otherwise,
andleva,b(i,j)isthedistancebetweenthefirsticharactersofaandthefirstjcharactersofb. Example:
TheLevenshteindistancebetween"HONDA"and"HYUNDAI"is3,sincethefollowingthreeedits
changeoneintotheother,and thereisno waytodo itwithfewerthan threeedits:
Q2. What is Soundex?
Answer:
Soundexattemptstofind similar namesorhomophonesusing phoneticnotation.Theprogramretains
lettersaccordingtodetailedequations,tomatchindividual titles for purposesofamplevolume
research.
Soundexphoneticalgorithm:Its indexesstrings dependontheirEnglish pronunciation.The algorithmis
usedtodescribehomophones,wordsthatarepronouncedthesame,butspeltdifferently.
Suppose we have the following sourceDF.
Let’s runbelowcodeandseehowthesoundex algorithm encodes theabovewords.

Let’s summarizetheaboveresults:
"two"and"to"bothareencoded as T000
"break" and"brake" bothareencodedasB620
"hear" and"here" both areencoded asH600
"free" is encodedas F600 and"tree" is encodedas T600: Encodings aresimilar, butwordis
different
The Soundex algorithm wasoftenusedtocomparefirst namesthatwerespelt differently.
Q3. What is Constituency parse?

Answer:
A constituency parse tree breaks a text into sub-phrases. Non-terminals in the tree are types of
phrases, the terminals are the words in the sentence, andthe edges are unlabeled. For a simple
sentence, "Johnsees Bill", aconstituency parsewouldbe:
Aboveapproachesconverttheparsetreeintoasequencefollowingadepth-firsttraversaltobeable
toapplysequence-to-sequencemodelstoit.Thelinearizedversionoftheaboveparsetreelooksas follows:(S
(N) (VP V N)).
Q4. What is LDA(Latent Dirichlet Allocation)?

Answer:
LDA: It is used to classify text in the document to a specific topic. LDA builds a
topicperdocument modelandwordspertopicmodel,modelledasDirichletdistributions.
Each documentis modeledasadistributionof topics, andeachtopicis modelled
as multinomialdistributionofwords.
LDA assumes that every chunk of text we feed into it will contain
wordsthataresomehow related.Thereforechoosingtherightcorpusofdataiscrucial.
It also assumes documents are produced from a mixture of topics. Those topics
thengeneratewordsbasedontheirprobabilitydistribution.
TheBayesianversionofPLSAisLDA.ItusesDirichletpriorsfortheword-topicanddocument- topic
distributions,lendingitselftobettergeneralization.
WhatLDAgiveus?
It is a probabilistic method. For every document, the results give us a mixture of topics that
makeup thedocument.Tobeprecise,wecangetprobabilitydistributionoverthektopicsforevery document.
Every word in the document is attributed to the particular topic with probability given by
distribution.
Thesetopicsthemselvesweredefinedasprobabilitydistributionsovervocabulary.Ourresultsare
twosetsofprobabilitydistributions:
Thecollectionofdistributionsoftopicsforeachdocument
Thecollectionofdistributionsofwordsforeachtopic.
Q5.What is LSA?
Answer:
Latent SemanticAnalysis(LSA):Itisatheory andthe methodforextract andrepresents the
contextualusagemeaningofwordsbystatisticalcomputationappliedtolargecorpusoftexts.
It is an informationretrieval technique which analyzes and identifies the pattern in an
unstructured collectionoftextand relationshipbetweenthem.
LatentSemanticAnalysis itself is anunsupervisedwayofuncoveringsynonymsin acollection of
documents.
WhyLSA(LatentSemanticAnalysis)?
LSA is a technique for creating vector representation of the document. Having a
vector representation of the document gives us a way to compare documents for their similarity by
calculating the distance between vectors. In turn, means we can do handy things such as classify
documentstofindoutwhichofasetknowstopicstheymostlikelyresideto.
Classification implies we have some known topics that we want to group documents into,
andthat youhavesomelabelledtrainingdata.Ifyou'regoingtoidentifynaturalgroupingsofthedocuments
withoutanylabelleddata,youcanuseclustering
Q6. What is PLSA?
Answer:
PLSA standsforProbabilistic LatentSemantic Analysis, uses aprobabilistic methodinstead of SVD
totackle problem. The main idea is tofind theprobabilistic model with latent topics thatwe
can generate dataweobserve in ourdocument termmatrix. Specifically, wewantamodelP(D, W)
suchthatfor any documentdandwordw,P(d,w) correspondstothatentryin document-termmatrix.
Each documentis found in themixture of topics, andeachtopic consists of thecollection of words.
PLSA addstheprobabilisticspin totheseassumptions:
Given documentd, topicz is available in thatdocumentwiththeprobability P(z|d)
Given thetopicz, wordwis drawnfromz withprobability P(w|z)
Thejointprobabilityofseeingthegivendocumentandwordtogetheris:
In the above case, P(D), P(Z|D), and P(W|Z) are the parameters of our models.P(D)
canbe determineddirectlyfromcorpus.P(Z|D)andtheP(W|Z)aremodelledasmultinomial
distributionsandcanbetrainedusingtheexpectation-maximisationalgorithm(EM).
Q7. What is LDA2Vec?

Answer:
It is inspired by LDA, word2vec model is expanded to simultaneously learn word,
document,topic andparagraphtopicvectors.
Lda2vec is obtainedbymodifying theskip-gramword2vecvariant.In theoriginal skip-gram
method, themodelistrainedtopredictcontextwordsbasedonapivotword.Inlda2vec,thepivotwordvector
andadocumentvectorareaddedtoobtainacontextvector.Thiscontextvectoristhenusedtopredict context
words.
At the document level, weknow how to represent thetext as mixtures of topics. At theword-level,
we typically used something like word2vec to obtain vector representations. It is an extension of
word2vec andLDA thatjointly learnsword, document, andtopicvectors.
Howdoes itwork?
It correctly builds ontopof theskip-gram modelof word2vectogenerate wordvectors.Neural net
thatlearns wordembedding by trying touseinput wordtopredict enclosing context words.
With Lda2vec, otherthan using the word vector directly to predict context words, you leverage
acontextvectortomakethepredictions.Context vector is createdas thesumof twoothervectors:
thewordvector andthedocumentvector.
Thesameskip-gram word2vecmodelgeneratesthewordvector.Thedocumentvectoris most
impressive.It is areally weightedcombination of two othercomponents:
thedocumentweightvector,representingthe“weights”of eachtopicin adocument
Topic matrixrepresentseachtopicandits corresponding vector embedding.
Together, adocumentvector andwordvector generate“context” vectorsforeachwordin a
document.lda2vecpowerlies in thefactthatit notonly learnswordembeddingsforwords;it
simultaneously learnstopicrepresentations anddocumentrepresentations as well.
Q8. What is Expectation-Maximization Algorithm(EM)?
Answer:
TheExpectation-Maximization Algorithm, in short,EM algorithm,is anapproachformaximum
likelihood estimation in thepresenceof latent variables.
This algorithmis aniterative approachthatcycles betweentwomodes.Thefirst modeattemptsto
predict themissing orlatent variables called theestimation-stepor E-step. The secondmodeattempts
tooptimisetheparametersof themodeltoexplain thedatabestcalled themaximization-step or M-
step.
E-Step. Estimatethemissing variables in thedataset.
M-Step. Maximize theparametersof themodelin thepresenceof thedata.
TheEM algorithmcanbeappliedquitewidely, althoughit is perhapsmostwell knownin machine
learning forusein unsupervised learning problems,such as density estimationandclustering.
For detail explanationofEM is, letusfirst considerthisexample.Say thatwearein aschool,and
interested tolearntheheight distributionof female andmalestudentsin theschool. The mostsensible
thing to do, as weprobably would agree with me,is torandomly take a sample of N students of both
genders, collect their height information and estimate the meanand standard deviation for male and
femaleseparately by way of maximumlikelihoodmethod.
Nowsay thatyouarenotable toknowthegender of studentwhile wecollect their height information,
andsotherearetwothingsyouhavetoguess/estimate:(1) whethertheindividual sampleofheight
informationbelongs to amale or afemale and(2) theparameters (μ, θ) for each gender which is
nowunobservable. This is tricky becauseonly with theknowledge of whobelongs towhichgroup,
canwemakereasonable estimates of thegroup parametersseparately.Similarly, only if weknowthe
parametersthatdefinethegroups,canweassign asubjectproperly.Howdoyoubreakoutof this
infinite loop?Well, EM algorithm just says tostartwithinitial randomguesses.
Q9.What is Text classification in NLP?
Answer:
Text classification is also knownas text tagging or text categorization is a process of categorizing
textintoorganizedgroups.ByusingNLP,textclassificationcanautomaticallyanalyzetextandthen assigna
setofpre-definedtagsorcategoriesbasedoncontent.
Unstructuredtextiseverywhereontheinternet,suchasemails, chatconversations, websites,andthe
social media but it’s hardto extract value fromgiven data unless it’s organizedin a certain
way. Doing sousedtobeadifficult andexpensive processsinceit required spending timeand
resources tomanuallysortthedataorcreatinghandcraftedrulesthataredifficulttomaintain.Text
classifiers with NLP have proven to be agreat alternative to structure textual data in afast,
cost-effective, and scalable way.
Text classification isbecoming an increasingly important part ofbusinesses asit allows us to
get insightsfromdataand automatebusinessprocessesquickly.Someofthemost common
examplesandtheusecasesforautomatictextclassificationincludethefollowing:
SentimentAnalysis:Itistheprocessofunderstandingifagiventextistalkingpositivelyor
negativelyabout agivensubject(e.g.forbrandmonitoringpurposes).
Topic Detection: Inthis, the task ofidentifying the theme ortopic ofapiece oftext
(e.g. knowifaproductreviewisaboutEaseofUse,CustomerSupport,orPricingwhen
analysing
customerfeedback).
LanguageDetection:theprocedureof detectingthelanguageof agiven text(e.g.knowif an
incomingsupportticketiswritteninEnglishorSpanishforautomaticallyroutingticketsto
theappropriateteam).
Q10. What is Word Sense Disambiguation (WSD)?
Answer:
WSD (Word Sense Disambiguation) is a solution to the ambiguity which arises due to
different meaningofwordsinadifferentcontext.
In natural language processing, word sense disambiguation (WSD) is the problem of
determining which"sense"(meaning)ofawordisactivatedbytheuseofthewordinaparticular
context,a
processwhichappearstobemostlyunconsciousinpeople.WSDisthenaturalclassificationproblem: Given
awordanditspossiblesenses,asdefinedbythedictionary,classifyanoccurrenceoftheword inthecontextinto
oneormoreofitssenseclasses.Thefeaturesofthecontext(suchasthe neighbouringwords)providetheevidencefor
classification.
Forexample,considerthese twobelowsentences.
“ The bank will notbe accepting the cashon
Saturdays.” “Theriveroverflowedthebank.”
The word “ bank “ in the given sentence refers to commercial (finance) banks, while in the second
sentence,itreferstoariverbank.Theuncertaintythatarises,duetothisistoughforthemachineto detectand
resolve.Detectionofchangeisthefirstissueandfixingitanddisplayingthecorrectoutput isthesecond
issue.
DATA SCIENCE
INTERVIEW
PREPARATION
Preparation)
# DAY 19
Page 1 | 12
Q1. What is LSI(Latent Semantic Indexing)?
Answer:
Latent Semantic Indexing (LSI): It is an indexing and retrieval method that uses a
mathematical technique called SVD(Singular value decomposition) to find patterns in relationships
betweenterms andconceptscontainedinanunstructuredcollectionoftext.Itisbasedontheprinciplethatwords
thatareusedinthesamecontextstendtohavesimilarmeanings.
For example, Tiger and Woods are associated with men instead of an animal, and a Wood,
Parris, and Hilton are associated with the singer.
Example:
If you use LSI to index a collection of articles and the words “fan” and “regulator” appear
togetherfrequently enough,thesearchalgorithmwouldnoticethatthetwotermsaresemanticallyclose.Asearchfor
“fan” will, therefore, return a set of items containing that phrase, but also items that contain just the word
“regulator”.It doesn't understandworddistance,but by examiningasufficient numberof documents,itonly
knowsthetwo termsareinterrelated.Itthenusesthatinformationtoprovideanexpandedsetofresultswithbetter
recallthan anunderstandable keywordsearch.
ThediagrambelowdescribestheeffectbetweenLSIandkeywordsearches.Wstandsforadocument.
Page 2 | 12
Q2. What is Named Entity Recognition? And tell some use cases of
NER?
Answer:
Named-entityrecognition (NER): It is also knownasentityextraction,andentityidentification is a
subtaskof informationextractionthatexploretolocate andclassify atomicelements in
textintopredefinedcategorieslike thenamesof persons,organizations,places, expressionsof
times,quantities, monetary values, percentagesandmore.
In eachtext document, particular termsrepresent specific entities thatare moreinformative andhave
adifferent context. These entities arecalled namedentities, which moreaccurately refer toconditions
that represent real-world objects like people, places, organizations or institutions, andso on, which
areoftenexpressedbypropernames.The naive approach could betofind these by having alook at
thenounphrasesin textdocuments.It also is knownasentitychunking/extraction,whichis apopular
technique used in information extraction toanalyze andsegment thenamedentities andcategorize
orclassify themundervarious predefined classes.
Named Entity Recognition use-case
Classifyingcontent for news providers-

NER can automatically scan entire articles and reveal which are the significant people,
organizations, and places discussed in them. Knowing the relevant tags for each item helps
in automatically categorizing thearticles in defined hierarchies andenable smoothcontent
discovery.
Customer Support:
Let’s say weare handling thecustomer support departmentof anelectronics store with
multiple branches worldwide;wego through anumbermentionsin ourcustomers’ feedback.
Suchas this for instance.
Page 3 | 12
Now, if we pass it through the Named Entity Recognition API, it pulls out the entities Bangalore
(location) and Fitbit (Product). This can be then used to categorize the complaint and assign it to
therelevantdepartmentwithintheorganizationthatshouldbehandlingthis.
Q3. What is perplexity?

Answer:
Perplexity: It is ameasurement ofhowwell aprobability modelpredicts asample.In thecontext
of NLP,perplexity(Confusion)isonewaytoevaluatelanguagemodels.
Thetermperplexity hasthreeclosely relatedmeanings.It is ameasureofhoweasyaprobability
distributionistopredict.Itisameasureofhowvariableapredictionmodelis.AndItisameasure
ofpredictionerror.Thethirdmeaningofperplexityiscalculatedslightlydifferently,butallthree
havethesamefundamentalidea.
Page 4 | 12
Q4. What is the language model?
Answer:
LanguageModelling (LM): It is oneof theessentialpartsof modernNLP. Therearemanysortsof
applicationsforLanguageModelling, like Machine Translation, Spell Correction Speech
Recognition, Summarization, Question Answering, Sentiment analysis, etc. Each of those tasks
requirestheuseof thelanguagemodel.The languagemodelis neededtorepresentthetexttoaform
understandablefromthemachinepointof view.
Thestatistical languagemodelis aprobability distributionoveraseriesof words.Given sucha
series,
say of lengthm, itassigns aprobability tothewholeseries.
It provides context todistinguish betweenphrases andwords that sounds aresimilar. For example,
mineAanmericanEnglish, thephrases " wreckanice beach" and"recognizespeech" soundalike but
differentthings.
Data sparsity is a significant problem in building language models. Most possible word sequences
are notnoticed in training. One solution is to makethe inference that theprobability of a word only
depends ontheprevious nwords. This is called as ann-gram model or unigram model whenn = 1.
The unigrammodelis also knownasthebag of wordsmodel.
Howdoesthis L anguageModel help in NL P Tasks?
The probabilities restoration by a language model is most useful to compare the likelihood that
different sentences are"good sentences." This was useful in manypractical tasks, for example:
Spell checking: You observe a word that is not identified as a known word as part of a sentence.
aUrseingtheedit distance algorithm, wefind theclosestknownwordstotheunknownwords.These
the candidate corrections. For example, we observe the word "wurd" in the context of the sentence,
"I like to write this wurd." The candidate corrections are ["word", "weird", "wind"]. How can we
select amongthesecandidates themostlikely correctionforthesuspectederror"weird"?
AutomaticSpeechRecognition:wereceive as input astring of phonemes; a first modelpredicts
for
sub-sequencesof thestreamof phonemescandidatewords;thelanguagemodelhelpsin rankingthe
mostlikely sequenceof wordscompatiblewiththecandidate wordsproducedby theacoustic
model.
Machine Translation: each word from the source language is mappedto multiple candidate words
in thetarget language; the language model in thetarget language can rank the most likely sequence
of candidate target words.
Page 5 | 12
Q5. What is Word Embedding?
Answer:
A word embedding is a learned representation for text where words that have
thesamemeaninghave asimilarobservation.
It is basically a form of word representation that bridges the human understanding of
languagetothat ofamachine.Wordembeddingsdivide representationsoftextinann-dimensional
space.Theseare essentialforsolvingmost NLPproblems.
And the other point worth considering is how we obtain word embeddings as no two sets of
word embeddingsaresimilar.Wordembeddingsaren'trandom;they'redevelopedbytrainingtheneural
network. A recent powerful word embedding usage comes from Google named Word2Vec, which is
trainedbypredictingseveralwordsthatappearnexttootherwordsinalanguage.Forexample,the word
"cat", the neural network would predict the words like "kitten" and "feline." This intuition of
wordscomesout"near"eachotherallowsustoplacetheminvectorspace.
Q6. Do you have an idea about fastText?

Answer:
fastText: It is anotherwordembedding methodthatis anextension of theword2vec
model. Alternatively, learning vectors for words directly. It represents each wordas ann-gram of
characters. So, for example, take theword, “artificial” with n=3, thefastText representation of this
wordis <ar, art,rti, tif, ifi, fic, ici, ial, al>, wheretheangular brackets indicate thebeginning andend of
theword.
This helps to capture the meaning of shorter words and grant the embeddings to understand prefixes
andsuffixes. Once the word has beenshowed using character skip-grams, a n-gram model is trained
tolearn theembeddings. This model is acknowledged tobeabag of words model with asliding
Page 6 | 12
windowoverawordbecausenointernalstructureofthewordistakenintoaccount.Aslongasthe
charactersarewithinthiswindow,theorderofthen-gramsdoesn’tmatter.
fastText workswell withrarewords.So evenif awordwasn’tseenduringtraining, it canbebroken
downinton-gramstogetitsembeddings.
Word2vecandGloVe bothfail toprovideanyvectorrepresentationforwordsthatarenotin the
modeldictionary.Thisisahugeadvantageofthismethod.
Q7. What is GloVe?

Answer:
GloVe(global vectors) is for word representation. GloVe is an unsupervised learning
algorithm developed by Stanfordfor achieving word embeddings by aggregatinga global word-
word co- occurrence matrix from a corpus. The resulting embeddings show interesting linear
substructuresof thewordinvectorspace.
The GloVe modelproducesavectorspacewithmeaningfulsubstructure,asevidencedby its
performanceof75%onanewwordanalogytask.Italsooutperformsrelatedmodelsonsimilarity tasks
and namedentityrecognition.
How GloVe find meaning in statistics?
Producesavectorspacewithmeaningfulsubstructure,asevidencedbyitsperformanceof75%on a
newwordanalogytask.Italsooutperformsrelatedmodelsonsimilaritytasksandnamedentityrecognition.
Page 7 | 12
GloVe aims to achieve two goals:
(1) Createwordvectorsthatcapturemeaninginvectorspace
(2) Takes advantage of globalcountstatisticsinstead of only local
information Unlike word2vec – which learns by streaming sentences – GloVe
determinesbasedonaco- occurrence matrix andtrainswordvectors,sotheirdifferencespredictco-
occurrence ratios GloVe weights the loss based on wordfrequency.
Somewhatsurprisingly,word2vecandGloVeturnouttoberemarkablysimilar,despitestartingoff from
entirelydifferent startingpoints.
Q8. Explain Gensim?

Answer:
Gensim: It is billed asaNatural Language Processing package thatdoes‘Topic Modelingfor
Humans’. Butits practically muchmorethanthat.
If you are unfamiliar with topic modeling, it is a technique to extract the underlying topics from
large volumes of text. Gensim provides algorithms like LDA and LSI (which we already seen in
previous interview questions) andthenecessary sophisticationtobuilt high-quality topicmodels.
Page 8 | 12
It is an excellent library package for processing texts, working with word vector models (such as
FastText, Word2Vec, etc) and for building the topic models. Another significant advantage with
gensimis: it lets us handle large text files without having toloadtheentire file in memory.
Wecan also tell as It is an open-source library for unsupervised topic modeling and natural
language processing,using modernstatistical machinelearning.
Gensimis implementedin PythonandCython. Gensimis designedtohandleextensivetext

collections using datastreamingandincrementalonlinealgorithms,whichdifferentiatesit from
mostothermachine learning software packages thattarget only in-memoryprocessing.
Page 9 | 12
Q9. What is Encoder-Decoder Architecture?
Answer:
The encoder-decoderarchitecture consists of two mainparts :
Encoder:
Encodersimply takestheinputdata,andtrainsonit, thenit passesthefinal stateof its
recurrent layer asaninitial statetothefirst recurrent layer of thedecoder part.
Decoder:
The decoder takes the final state of encoder’s final recurrent layer and uses it as an initial
statetoitsinitial,recurrentlayer,theinputofthedecoderissequencesthatwewanttoget French
sentences.
Some more example for better

understanding:
Page 10 | 12
Q10. What is Context2Vec?
Answer:
Assumeacasewhereyouhaveasentencelike. Ican’t find May. WordMaymaybereferstoamonth's
nameoraperson'sname.Youusethewordssurroundit(context)tohelpyourselftodeterminethe bestsuitableoption.
Actually, this problem refers tothe WordSense Disambiguation task, onwhich youinvestigate
theactualsemanticsofthewordbasedonseveralsemanticandlinguistictechniques. TheContext2Vecideaistaken
fromtheoriginal CBOWWord2Vecmodel,butinsteadofrelying on averagingtheembeddingofthewords,
it relies on a much more complex parametric model that is based on one layer of Bi-LSTM.
Figure1 shows the architecture of the CBOW model.
Figure1
Page 11 | 12
Context2Vecappliedthesameconceptof windowing,butinsteadofusing asimpleaverage
function,ituses3stagestolearncomplexparametricnetworks.
A Bi-LSTM layer that takes left-to-right and right-to-left representations
A feedforward network that takes the concatenated hidden
representationandproducesa hiddenrepresentationthroughlearningthenetwork
parameters.
Finally, we apply the objective function to the network output.
WeusedtheWord2Vecnegativesampling ideatoget betterperformancewhile calculating

theloss value.
The following aresomesamples of theclosestwordstoagiven context.
Page 12 | 12
DATA SCIENCE
INTERVIEW
PREPARATION
Preparation)
# DAY 20
Q1. Do you have any idea about Event2Mind in NLP?
Answer:
Yes, it is basedonNLP research papertounderstand thecommon-senseinference fromsentences.
Event2Mind: Common-senseInferenceonEvents,Intents, and Reactions
The study of “Commonsense Reasoning” in NLP deals with teaching computers howto gain and
employcommonsenseknowledge.NLP systemsrequirecommonsensetoadaptquickly and
understand humansaswetalk toeachotherin anatural environment.
This paper proposesa newtask toteachsystemscommonsensereasoning: given aneventdescribed
in ashort“eventphrase”(e.g. “PersonX drinks coffee in themorning”), theresearchers teach asystem
toreasonaboutthelikely intents(“PersonX wantstostay awake”) and reactions (“PersonX feels
alert”) of theevent’s participants.
Understandinganarrativerequirescommon-sensereasoningaboutthementalstatesof peoplein
relationtoevents.Forexample,if “Robertis dragginghis feetatwork,”pragmaticimplications about
Robert’s intentarethat“Robertwantstoavoiddoingthings”(AboveFig).Youcanalsoinferthat Robert’s
emotionalreactionmightbefeeling “bored”or“lazy.” Furthermore,whilenotexplicitly mentioned,you
canassumethatpeopleotherthanRobertareaffectedby thesituation, andthese peoplearelikelyto
feel“impatient”or“frustrated.”
Page 1 | 12
This type of pragmatic inference can likely be useful for a wide range of NLP applications that
require accurateanticipationofpeople’sintentsandemotionalreactions,evenwhentheyarenotexpressly
mentioned.Forexample,anidealdialoguesystemshouldreactinempatheticwaysbyreasoning aboutthe
humanuser’smentalstatebasedontheeventstheuserhasexperienced,withouttheuser explicitlystatinghowthey
arefeeling.Furthermore,advertisementsystemsonsocialmediashould beabletoreasonabouttheemotional
reactionsof peopleaftereventssuchasmassshootingsand removeadsforguns,whichmight
increasesocial distress.Also, thepragmaticinferenceis a necessarysteptowardautomatic
narrativeunderstandingandgeneration.However,this typeof commonsensesocial reasoninggoes
far beyondthewidely studied entailment tasksandthusfalls outside thescopeofexisting
benchmarks.
Q2. What is SWAG in NLP?

Answer:
SWAG stands for SituationswithAdversarial Generationsisadatasetconsistingof113kmultiple-
choicequestionsaboutarichspectrumofgroundedsituations.
Swag:ALargeScaleAdversarialDatasetforGroundedCommonsenseInference
According to NLP research paper on SWAG is “Given a partial description like “he opened the
hood of the car,” humans can reason about the situation and anticipate what might come next (“then, he
examinedtheengine”).Inthispaper,youintroducethetaskofgroundedcommonsense inference,unifying
naturallanguageinference(NLI),andcommon-sensereasoning.
We present SWAG, a dataset with 113k multiple-choice questions about the rich spectrum of grounded
positions.Toaddressrecurringchallengesofannotationartifactsandhumanbiasesfound inmany
existing datasets, we propose AF(Adversarial Filtering), a novel procedure that constructs a de-
biaseddatasetbyiterativelytraininganensembleofstylisticclassifiers,andusingthemtofilter thedata.
To account for the aggressive adversarial filtering, we use state-of-the-art languagemodels to
oversample a diverse set of potential counterfactuals massively. Empirical results present that while
humanscansolvetheresultinginferenceproblemswithhighaccuracy(88%),various
competitive modelsmakeaneffortonourtask. Weprovide a comprehensive analysis that indicates
significantopportunitiesforfutureresearch.
Whenwereadatale,webringtoitalargebodyofimpliedknowledgeaboutthephysicalworld.
For
instance,giventhecontext“onstage,amantakesaseatatthepiano,”wecaneasilyinferwhatthe
situationmightlooklike:amanisgivingapianoperformance,withacrowdwatchinghim.Wecan
furthermoreinfer himlikely next action: hewill mostlikely sethis fingers onthepianokey and
start
playing.
Page 2 | 12
This type of natural language inference(NLI) requires common-sensereasoning,
substantially broadeningthescopeofpriorworkthatfocusedprimarilyonlinguisticentailment.
Whereasthe
dominantentailmentparadigmasksif2naturallanguagesentences(the‘premise’andthe ‘hypothesis’)
describethesamesetofpossibleworlds,herewefocusonwhethera (multiple-choice)
endingrepresentsapossible(future) worldthatcanafromthesituationdescribedin thepremise,
evenwhenitisnotstrictlyentailed.Makingsuchinferencenecessitatesarichunderstandingof everyday
physical conditions, including objectaffordancesandframesemantics.
Page 3 | 12
Q3. What is the Pix2Pix network?
Answer:
Pix2Pix network: It is a Conditional GANs (cGAN) that learn the mapping from an input
imageto outputan image.
Image-To-Image Translation is the process for translating one representation of the image into
anotherrepresentation.
The image-to-image translation is another example of a task that GANs (Generative Adversarial
Networks) areideally suitedfor. These aretasks in whichit is nearly impossible tohard-code aloss
function. Studies on GANs are concerned with novel image synthesis, translating from a random
vector z into an image. Image-to-Image translation converts one image to another like the edges of
thebag belowtothephotoimage.Anotherexciting example of this is shownbelow:
In Pix2Pix Dual Objective Function withan Adversarial andL1 Loss

A naive way to do Image-to-Image translation would be to discard the adversarial framework
altogether. A source image would just be passed through a parametric function, and the difference in
theresulting image andthegroundtruthoutputwould beusedtoupdatetheweights of the
network. However, designing this loss function with standard distance measures such as L1 and L2
will fail tocapture manyof theessential distinctive characteristics betweenthese images. However,
Page 4 | 12
authors dofind somevalue totheL1 loss function as aweighted sidekick totheadversarial loss
function.
The Conditional-Adversarial Loss (Generator versus Discriminator) is very popularly formatted as
follows:
TheL1lossfunctionpreviouslymentionedisshownbelow:
Combining these functions

resultsin:
In the experiments, the authors report that they foundthe mostsuccess with the lambda
parameter equalto100.
Q4. Explain UNet Architecture?
Answer:
U-Netarchitecture: It is built upon the Fully Convolutional Network and modified in a way
thatit yieldsbettersegmentationinmedicalimaging.ComparedtoFCN-8,thetwomaindifferencesare(a) U-
netis symmetricand(b)theskipconnectionsbetweenthedownsamplingpathandupsampling path
apply aconcatenationoperatorinsteadofasum.Theseskip connectionsintendtoprovidelocal
informationtotheglobalinformationwhileupsampling.Becauseofitssymmetry,thenetworkhasa
largenumberoffeaturemapsin theupsampling path,whichallowstransferring information.By
comparison,theunderlyingFCNarchitectureonlyhadthenumberofclassesfeaturemapsinits
upsamplingway.
Page 5 | 12
Howdoesit work?
TheUNetarchitecturelookslike a‘U,’ whichjustifiesitsname.This UNetarchitectureconsistsof 3

sections:Thecontraction,thebottleneck,andtheexpansionsection.Thecontractionsectionis
madeofmanycontractionblocks.Eachblocktakesaninputthatappliestwo3X3convolutionlayers, followedbya2X2
maxpooling.Thenumberoffeaturesorkernelmapsaftereachblockdoublesso thatUNetarchitecturecan
learncomplexstructures.Bottommostlayermediatesbetweenthe contractionlayerandthe
expansionlayer. Itusestwo3X3 CNNlayers followedby2X2up convolutionlayer.
Buttheheartofthisarchitecturelies in theexpansion section.Similar tothecontractionlayer,it also
hasseveralexpansionblocks.Eachblockpassesinputtotwo3X3CNNlayers, followedbya2X2 upsampling
layer.Aftereachblocknumberoffeaturemapsusedbytheconvolutionallayer,gethalf to
maintainsymmetry.However,everytimeinputis also getappendedbyfeaturemapsof the
correspondingcontractionlayer. This actionwouldensurethatfeaturesthatarelearnedwhile
contractingtheimagewillbeusedtoreconstructit.Thenumberofexpansionblocksisassameas
thenumberofcontractionblocks.Afterthat,theresultantmappingpassesthroughanother3X3CNN layer,with
thenumberoffeaturemapsequaltothenumberofsegmentsdesired.
Page 6 | 12
Q5. What is pair2vec?
Answer:
This paper pre trains wordpair representationsbymaximizing pointwise mutualinformationof
pairs of wordswiththeircontext.This encouragesamodeltolearnmoremeaningfulrepresentations
ofwordpairsthanwithmoregeneralobjectives, like modeling.Thepre-trainedrepresentationsare
useful in taskslike SQuAD andMultiNLI thatrequirecross-sentenceinference. You canexpect to see
morepretrainingtasksthatcapturepropertiesparticularlysuitedtospecificdownstreamtasksand are
complementarytomoregeneral-purposetaskslikelanguagemodeling.
Reasoningaboutimpliedrelationshipsbetweenpairsofwordsiscrucialforcrosssentencesinference
problemslikequestionanswering(QA)andnaturallanguageinference(NLI).InNLI,e.g.,givena
premise such as “golf is prohibitively expensive,” inferring that the hypothesis “golf is a cheap
pastime”isacontradictionrequiresonetoknowthatexpensiveandcheapareantonyms.Recent work
hasshownthatcurrentmodels,whichrelyheavilyonunsupervisedsingle-wordembeddings, struggleto
grasp such relationships. In this pair2vec paper, we show that they can be learned with word
pair2vec(pair vector), which are trained, unsupervised, at a huge scale, and which significantly
improveperformancewhenaddedtoexistingcross-sentenceattentionmechanisms.
Page 7 | 12
Unlike single wordrepresentations,whicharetypically trainedbymodelingtheco-occurrenceofa
targetwordx withits contextc, ourword-pairrepresentationsarelearnedby modelingthethree-way co-
occurrencebetweentwowords(x,y) andthecontextc thattiesthemtogether, as illustrated in above
Table. While similar training signal hasbeenusedtolearnmodelsfor ontology
construction and knowledge base completion, this paper shows, for the first time, that considerable
scale learning of pairwise embeddings can be used to improve the performance of neural cross-
sentence inference modelsdirectly.
Q6. What is Meta-Learning?

Answer:
Meta-learning:It is anexciting areaofresearchthattacklestheproblemoflearningtolearn.Thegoal is
todesignmodelsthatcanlearnnewskillsorfastlytoadapttonewenvironmentswithminimum training
examples.Notonly doesthis dramaticallyspeedupandimprovethedesignofML(Machine Learning)
pipelinesorneuralarchitectures,butit alsoallows ustoreplacehand-engineered algorithmswith
novelapproacheslearnedinadata-drivenway.
Page 8 | 12
The goal of meta-learning is to train the model on a variety of learning tasks, such that it can solve
newlearningtaskswithonlyasmall numberoftrainingsamples.Ittendstofocusonfindingmodel agnostic
solutions,whereasmulti-tasklearningremainsdeeplytiedtomodelarchitecture.
Thus, meta-level AI algorithms make AI systems:
· Learn faster
· Generalizable to many tasks
· Adaptableto environmentalchangeslikeinReinforcementLearning
Onecansolveanyproblemwithasinglemodel,butmeta-learningshouldnotbeconfusedwithone- shot
learning.
Q7. What is ALiPy(Active Learning in Python)?

Answer:
SupervisedML methodsusually requirealargesetof labeledexamples for modeltraining.
However,in manyrealapplications,thereareampleunlabeleddatabutlimitedlabeleddata;and
acquisitionoflabelsis costly. Active learning(AL) reduceslabeling costsby iteratively selecting
themostvaluable datatoquery their labels fromtheannotator.
Active learningis theleadingapproachtolearningwithlimited labeleddata.It tries toreduce
humanefforts ondataannotationby actively querying themostprominentexamples.
ALiPy is a Pythontoolboxfor active learning(AL), which is suitable for various users.On theone
hand, theentire process of active learning has beenwell implemented.Users can efficiently
performexperiments by manylines of codestofinish theentire processfromdatapre-processesto
Page 9 | 12
resultin visualization. Morethan20commonlyusedactivelearning(AL) methodshavebeen
implemented in the toolbox, providing users many choices.
Q8.What is the Lingvo model?

Answer:
Lingvo: It is aTensorflowframeworkoffering acompletesolutionforcollaborative deeplearning
research, with aparticularfocustowardssequence-to-sequencemodels. These models arecomposed
ofmodularbuilding blocksthatareflexible andeasily extensible,andexperimentconfigurationsare
centralizedandhighly customizable.Distributed training andquantizedinferencearesupported
directly withinaframework,andit contains existing implementations of anample numberof utilities,
helperfunctions,andnewestresearchideas.This modelhasbeenusedin collaborationbydozensof
researchers in morethan20papersoverthelast twoyears.
Why doesthis L ingvoresearchmatter?
Theprocessofestablishing anewdeeplearning(DL) systemis quitecomplicated. It involves
exploring anample space of design choices involving training data, data processing logic, thesize,
andtypeof modelcomponents,theoptimization procedures,andthepathtodeployment.This
complexityrequirestheframeworkthatquickly facilitates theproduction of newcombinations and
themodificationsfromexisting documentsandexperimentsandsharesthesenewresults.It is a
workspacereadytobeusedby deeplearning researchersordevelopers.NguyenSays: “Wehave
researchersworkingonstate-of-the-art(SOTA) productsandresearch algorithms, basing their
researchoff ofthesamecodebase. This ensures that code is battle-tested. Our collective experience
is encodedin meansof gooddefaults andprimitives thatwehave founduseful overthesetasks.”
Page 10 | 12
Q9. What is Dropout Neural Networks?
Answer:
Theterm“dropout”referstodroppingoutunits(bothhiddenandvisible)inaneuralnetwork.
At each training stage, individual nodes are either dropped out of the net with probability 1-
p or kept with probability p, so that a reduced network is left; incoming and outgoing edges to a
dropped-outnodearealsoremoved.
WhydoweneedDropout?
Theanswertothesequestionsis“topreventover-fitting.”
A fully connected layer occupies most of the parameters, and hence, neurons
develop co-dependency amongst each other during training, which curbs the individual power of each
neuronleadingtoover-fittingoftrainingdata.
Page 11 | 12
Q10. What is GAN?
Answer:
A generativeadversarial network(GAN): It is aclass of machinelearningsystemsinventedbyIan
Goodfellowandhis colleaguesin 2014.Twoneuralnetworksarecontestingwitheach other in a
game(in theideaofgametheory,oftenbutnotalwaysin theformofazero-sumgame).Given a
training set,thistechniquelearnstogeneratenewdata with thesamestatistics as thetraining set.
E.g., aGAN trainedonphotographscanproduceoriginal picturesthatlookatleastsuperficially
authentictohumanobservers,havingmanyrealistic characteristics.Thoughinitially proposedasa
formofagenerativemodelforunsupervisedlearning, GANs havealsoprovenusefulforsemi-
supervisedlearning,[2] fully supervisedlearning, andreinforcement learning.
Exampleof GAN
Given animageof aface,thenetworkcanconstructanimagethatrepresentshowthat

person couldlookwhentheyareold.
GenerativeAdversarialNetworkstakesupagame-theoreticapproach,unlike aconventionalneural
network.Thenetworklearnstogeneratefromatrainingdistributionthrougha2-playergame.The
twoentitiesareGeneratorandDiscriminator.Thesetwoadversariesareinconstantbattlethroughout thetraining
process.
Page 12 | 12
DATA
SCIENCE
INTERVIEW
PREPARATION
Preparation)
# Day21
Page 1 | 15
Q1. Explain Grad-CAM architecture?
Answer:
According totheresearchpaper,“Weproposeatechniquefor makingConvolutional NeuralNetwork
(CNN)-based models moretransparent by visualizing input regions thatare‘important’ for
predictions–producingvisual explanations.Our approachis called Gradient-weighted Class
Activation Mapping(Grad-CAM), whichusesclass-specific gradientinformationtolocalize the
crucial regions.Theselocalizations arecombinedwiththeexisting pixel-spacevisualizations to
createanewhigh-resolution,andclass-discriminative display calledtheGuidedGrad-CAM. These
methodshelpbettertounderstandCNN-based models,including imagecaptioningandtheapparent
questionanswering(VQA) models.Weevaluateourvisual explanationsby measuringtheability to
discriminate betweentheclasses andtoinspire trustin humans,andtheircorrelationwiththe
occlusion maps.Grad-CAM providesanewwaytounderstand theCNN-based models.”
A technique for making CNN(Convolutional Neural Network)-based models more transparent by
visualizing theregions of input that are “important” for predictions from these models — or visual
explanations.
This visualization is both high-resolution (when the class of interest is ‘tiger cat,’ it identifies
crucial ‘tiger cat’featureslike stripes, pointy ears andeyes)andclass-discriminative(it shows the
‘tiger cat’ but notthe ‘boxer(dog)’).
Page 2 | 15
Q2.Explain squeeze-net architecture?
Answer:
Nowadays, technology is at its peak. Self-driving cars andIoT is going to be household talks in the
next few years to come. Therefore, everything is controlled remotely, say, e.g., in self-driving cars,
we will need our system to communicate with the servers regularly. So accordingly, if we have a
model that has a small size, thenwecan quickly deploy it in the cloud. So that’s why weneededan
architecture that is less in size and also achieves the same level of accuracy that other architecture
achieves.
It’s Architecture
Replace 3x3 filters with 1x1 filter- Weplan to use themaximum numberof 1x1 filters as
using a 1X1 filter rather than a 3X3 filter can reduce the number of parameters by 9X. We
may think that replacing 3X3 filters with 1X1 filters may perform badly as it has less
information to work on. But this is not a case. Typically 3X3 filter may capture the spatial
informationof pixels close toeachother while the1X1 filter zeros in onpixel andcaptures
features amongstits channels.
Decrease number of input channels to 3x3 filters- to maintain a small total number of
parametersin aCNN, andit is crucial notonly todecreasethenumberof 3x3 filters, butalso
to decrease the number of input channels to 3x3 filters. We reduce the number of input
channels to 3x3 filters using squeeze layers. The author of this paper has used a termcalled
the“fire module,”inwhichthereis asqueezelayerandanexpandedlayer.Inthesqueeze layer,
weareusing1X1 filters, whileintheexpandedlayer,weareusingacomboof3X3 filters and
1X1 filters. The author is trying to limit the number of inputs to 3X3 filters to
reducethenumber of parameters inthelayer.
Page 3 | 15
Downsamplelatein anetworksothatconvolutionlayers havealargeactivation map-
Having got an intuition about contracting the sheer number ofparameterswe are
workingwith,howthemodelis getting mostoutoftheremainingsetofparameters.The
authorinthispaperhasdownsampledthefeaturemapinlaterlayers,andthisincreases the
accuracy.Butthisisanexcellent contrasttonetworkslikeVGGwherealargefeature map
is taken, and then it gets smaller as network approach towards the end. This
different approachis toointeresting,andtheycite thepaperbyK. HeandH. Sunthat
similarly applies delayed downsampling that leads to higher classification accuracy.
This architecture consists of the fire module, which enables it to bring down the
number ofparameters.
And other thing that surprises me is the lack of fully connected layers or dense layers at the end,
which one will see in a typical CNN architecture. The dense layers, in the end, learn all the
relationships between the high-level features and the classes it is trying to identify. The fully
connectedlayersaredesignedtolearnthatnosesandearsmakeupaface,andwheelsandlights indicate
cars. However, in this architecture, that extra learning step seems to be embedded within the
transformationsbetweenvarious“firemodules.”
Page 4 | 15
The squeeze-net can accomplish an accuracy nearly equal to AlexNet with 50X less number of
parameters. The most impressive part is that if we apply Deep compression to the already smaller
model, thenit can reducethesize of thesqueeze-netmodelto510x times thatof AlexNet.
Q3.ZFNet architecture
Answer:
The architecture of the network is an optimized version of the last year’s winner - AlexNet. The
authors spent some time to find out the bottlenecks of AlexNet and removing them, achieving
superiorperformance.
(a): First layerZFNET featureswithoutfeaturescaleclipping. (b): theFirst layerfeaturesfrom

AlexNet. Notethattherearelotof deadfeatures- oneswherethenetworkdid notlearn any patterns.
(c): theFirst layerfeaturesforZFNet. Notethatthereareonlyafewdeadfeatures.(d): Secondlayer
features from AlexNet. The grid-like patterns areso-called aliasing artifacts. They appearwhen
Page 5 | 15
receptive fields of convolutional neurons overlap, and neighboring neurons learn similar structures.
(e): 2ndlayer features for ZFNet. Notethattherearenoaliasing artifacts. Source:original paper.
In particular, they reduced the filter size in the 1st convolutional layer from 11x11 to 7x7, which
resulted in fewerdeadfeatures learned in thefirst layer (see theimage belowforanexample of that).
A deadfeature is a situation wherea convolutional kernel fails to learn any significant representation.
Visually it looks like amonotonicsingle-color image,whereall thevalues areclose toeachother.
In addition to changing the filter size, the authors of FZNet have doubled thenumberof filters in all
convolutional layers andthenumberof neurons in thefully connected layers as compared tothe
AlexNet. In the AlexNet, there were 48-128-192-192-128-2048-2048 kernels/neurons, and in the
ZFNet, all these doubled to 96-256-384-384-256-4096-4096. This modification allowed the network
to increase the complexity of internal representations and as a result, decrease the error rate from
15.4% forlast year’s winner,to14.8% tobecomethewinnerin 2013.
Q4. What is NAS (Neural Architecture Search)?
Answer:
Developing the neural network models often requires significant architecture engineering. We
can sometimesgetbywithtransferlearning,butifwewantthebestpossibleperformance,it’susually
besttodesignyournetwork.Thisrequiresspecializedskillsandischallengingingeneral;wemay noteven
know the limits of the current state-of-the-art(SOTA) techniques. Its a lot of trial and error, and
experimentationitselfistime-consumingandexpensive.
This is the NAS(Neural Architecture Search) comes in. NAS(Neural Architecture Search)
is an algorithm that searches for the best neural network architecture. Most of the
algorithms work in the following way. Start off by defining the set of “building blocks” that can be
usedforournetwork. E.g.,thestate-of-the-art(SOTA)NASNetpaperproposesthesecommonlyusedblocks
foranimage recognition network-
Page 6 | 15
In the NAS algorithm, the controller Recurrent Neural Network (RNN) samples the building
blocks, putting them together to create some end to end architecture. Architecture generally
combinesthe samestyleasstate-of-the-art(SOTA)networks,suchasDenseNetsorResNets,butusesa
much differentcombinationandtheconfigurationofblocks.
This new network architecture is then trained to convergence to obtain the least accuracy
ontheheld-outvalidationset.Theresultingefficienciesareusedtoupdatethecontrollersothatthecontroller
will generate better architectures over time, perhaps by selecting better blocks or
makingbetter connections.Thecontrollerweightsareupdatedwithapolicygradient.Thewhole
end-to-endsetup isshownbelow.
It’s a reasonably intuitive approach! In simple means: have an algorithm grab different
blocks and put those blocks together to make the network. Train and test out that network.
Based on our results, adjust the blocks we used to make the network and how you put them
together!
Q5. What is SENets?

Answer:
SENets stands for Squeeze-and-Excitation Networks introduces a building block for CNNs
that improveschannelinterdependenciesatalmostnocomputationalcost.Theyhaveusedinthe2017
ImageNet competition and helped to improve the result from last year by 25%. Besides this large
performanceboost,theycanbeeasilyaddedtoexistingarchitectures.Theideaisthis:
Page 7 | 15
Let’s add parameters to each channel of the convolutional block so that the network can adaptively
adjust theweightingof eachfeature map.
As simple asmayit sound, this is it. So, let’s takeacloser look atwhythis workssowell.
Why it workstoowell?
CNN's uses its convolutional filters to extract hierarchal information from the images. Lower layers
find little pieces of context like high frequencies or edges, while upper layers can detect faces, text,
orothercomplex geometrical shapes. They extract whatever is necessary tosolve thetask precisely.
All of this worksbyfusing spatialandchannelinformationof animage.Thedifferent filters will first
find thespatial featuresin eachinputchannelbeforeaddingtheinformation across all available
output channels.
All we need to understand for now is that the network weights each of its channels equally when
creating output feature maps. It is all about changing this by adding a content-aware mechanism to
weight each channel adaptively. In its too basic form, this could meanadding a single parameter to
eachchannel andgiving it linear scalar howrelevant eachoneis.
However, the authors push it a little further. First, they get the global understanding of each channel
by squeezing feature mapstoasingle numeric value. This results in thevectorof size n, wherenis
equal to the number of convolutional channels. Afterward, it is fed through a two-layer neural
network, which outputsa vector of thesamesize. These n values can nowbeusedas weights onthe
original features maps,scaling eachchannel basedonits importance.
Q6. Feature Pyramid Network (FPN)

Answer:
Page 8 | 15
The Bottom-Up Pathway
The bottom-up pathway is feedforward computation of backbone ConvNet. It is known as one
pyramidlevelisforeachstage.Theoutputoflastlayerofeachstepwillbeusedasthereferenceset of
featuremapsforenrichingthetop-downpathwaybylateralconnection.
Top-Down Pathway and Lateral Connection
Thehigherresolutionfeaturesareupsampledspatiallycoarser,butsemanticallystronger,
featuremapsfromhigherpyramidlevels.Moreparticularly,thespatialresolution
is upsampled by a factor of 2 using nearest neighbor for simplicity.
Eachlateralconnectionaddsfeaturemapsofthesamespatialsizefromthebottom-
uppathwayand top-downpathway.
Specifically, the feature maps from the bottom-up pathway undergo 1×1
convolutionstoreducechanneldimensions.
Andfeaturemapsfromthebottom-uppathwayandtop-downpathwayaremerged
byelement-wiseaddition.
Predictionin FPN
Finally, the3×3 convolution is appended on each merged map to generate a final feature
map,which istoreducethealiasingeffectofupsampling.This last set of feature maps is
called{P2,P3,P4,P5},correspondingto{C2,C3,C4,C5}thatarerespectivelyofsame
spatial sizes.
Because all levels of pyramid use shared classifiers/regressors as in a traditional featured
imagepyramid,featuredimensionatoutputdis fixed with d= 256. Thus, all extra
convolutionallayershave256channeloutputs.
Q7. DeepID-Net( Def-Pooling Layer)
Answer:
A new def-pooling (deformable constrained pooling) layer is used to model
thedeformationofthe objectpartswithgeometricconstraintsandpenalties.Thatmeans,exceptdetectingthe
wholeobject directly,itisalsoimportanttoidentifyobjectparts,whichcanthenassistindetectingthewhole
object.
Page 9 | 15
Thestepsinblack coloraretheoldstuffthatexistedinR-CNN. The stages in redcolordonot
appearin R-CNN.
1.Selective Search
First, colorsimilarities, texturesimilarities, regionssize, andregionfilling areusedasnon-

object-basedsegmentation.Thereforeyouobtainmanysmall segmentedareas as shown
atthebottomleft of theimage above.
Then,thebottom-upapproachis usedthatsmall segmentedareasare mergedtoformthe
larger segmentareas.
Thus, about2K regions, proposals (bounding box candidates) are generated,as shown
in theaboveimage.
2. Box Rejection
R-CNN is usedtoreject bounding boxesthataremostlikely tobethebackground.
Page 10 | 15
3. Pre train Using Object-Level Annotations
Usually, pretraining is onimage-levelannotation.It is notgoodwhenanobjectistoosmallwithin the

imagebecausetheobjectshouldoccupyalargeareawithintheboundingboxcreatedbythe selective
search.
Thus, pretraining is onobject-levelannotation.Andthedeeplearning(DL)modelcanbeany
modelssuchasZFNet,VGGNet,and GoogLeNet.
4.Def-PoolingLayer
Page 11 | 15
For thedef-poolingpath,output fromconv5,goesthroughtheConv layer, thengoesthrough
the def-
poolinglayer,andthenhasamax-poolinglayer.
In simple terms, the summation of ac multiplied by dc,n, is the 5×5deformationpenaltyinthe
figure
above.Thepenaltyofplacingobjectpartfromassumedthecentralposition.
By training the DeepID-Net, object parts of the object to be detected will give a high
activation
value
afterthedef-poolinglayeriftheyareclosedtotheiranchorplaces.Andthisoutputwillconnectto 200-
classscoresforimprovement.
5.Context Modeling
In object detection tasks in ILSVRC, there are 200 classes.And there is also the
classification mcoomrepetitiontaskinILSVRCforclassifyingandlocalizing1000-classobjects.Thecontentsare
diversecomparedwiththeobjectdetectiontask. Hence,1000-class scores, obtainedby
classificationnetwork,areusedtorefine200-classscores.
Page 12 | 15
6.The Model Averaging-
Multiplemodelsareusedtoincreasetheaccuracy,andtheresultsfromallmodelsareaveraged. This
7.BoundingBoxRegression
Boundingboxregressionistofine-tunetheboundingboxlocation,whichhasbeenusedinR-CNN.
Q8. What is FractalNet Architecture?

Answer:
In2015, aftertheinventionof ResNet, withnumerouschampionwon,thereareplentyof researchers
workingonhowtoimprovetheResNet,suchasPre-ActivationResNet,RiR,RoR,StochasticDepth,
andWRN.Inthisstory,conversely,anon-residual-networkapproach,FractalNet,isshortly reviewed.When
VGGNet is starting todegrade whenit goesfrom16layers (VGG-16) to19layers (VGG-19),
FractalNetcangoup to40layersoreven80layers.
Architecture
In the above picture: A Simple Fractal Expansion (on Left), Recursively Stacking of
Fractal ExpansionasOneBlock(intheMiddle),5BlocksCascadedasFractalNet(onthe
Right)
For the base case, f1(z) is the convolutional layer:
After that, recursive

fractals are:
Page 13 | 15
WhereC is anumberof columns as in themiddle of theabovefigure. The numberof the
convolutionallayers atthedeepest pathwithin theblock will have 2^(C-1). In this case, C=4, thereby,
anumber of convolutional layers are2³=8layers.
For thejoin layer (green), theelement-wise meanis computed. It is notconcatenation oraddition.
Withfive blocks (B=5) cascadedasFractalNet attheright ofthefigure, thenthenumberof
convolutionallayersatthemostprofoundpathwithinthewholenetworkis B×2^(C-1), i.e., 5×2³=40
layers.
In between2 blocks, 2×2 max pooling is donetoreduce thesize of feature maps. Batch Normand
ReLU areusedaftereachconvolution.
Q9. What is the SSPNet architecture?

Answer:
SPPNet has introduced thenewtechnique in CNN called Spatial Pyramid Pooling (SPP) at the
transition of theconvolutionallayer andfully connected layer. This is aworkfromMicrosoft.
Conventionally,atthetransformationoftheConvlayerandFC layer,thereis onesinglepooling

layerorevennopoolinglayer.InSPPNet,itsuggestshavingmultiplepoolinglayerswithdifferent
scales.
In the figure, 3-levelSPPis used. Suppose conv5 layer has 256 feature maps. Then at the
SPP layer,
Page 14 | 15
1. first, each feature mapis pooledtobecomeonevalue(whichisgrey).Thus 256-dvector is
formed.
2. Then,eachfeaturemapis pooledtohavefourvalues(whichisgreen), andformthe4×256- d
vector.
3. Similarly, each feature map is pooledtohave16values(inblue), andformthe16×256-d
vector.
4. Theabovethreevectorsareconcatenatedtoforma1-dvector.
5.Finally, this 1-dvectorisgoingintoFClayersasusual.
WithSPP,youdon’tneedtocroptheimagetoafixedsize,like AlexNet,beforegoingintoCNN.Any
imagesizescanbeinputted.
Page 15 | 15
DATA SCIENCE
INTERVIEW
PREPARATION
(30 Days of Interview Preparation)
# Day22
Page 1 | 16
Q1. Explain V-Net (Volumetric Convolution) Architecture with related to
Biomedical Image Segmentation?
Answer:
There were several medical data used in clinical practice consists of 3D volumes, such as MRI
volumes illustrate prostate, while most approaches are only able to process 2D images. A 3D image
segmentationbasedonavolumetric,fullyconvolutionalneuralnetworkisproposedinthiswork.
Slices fromMRI volumesdepictingprostate

Prostatesegmentationneverthelessis thecrucial taskhaving clinical relevancebothduringdiagnosis,
wherethevolumeoftheprostateneedstobeassessedandduringtreatmentplanning, wherethe
estimateof theanatomical boundaryneeds tobeaccurate.
Architecture
Page 2 | 16
V-Net, justifies by its name, it isshown as V-shape. The left part ofthe network
consistsof acompressionpath,whileontherightpartdecompressessignaluntilitsoriginal
size is
reached.
This is the same as U-Net, but with some difference.
On Left
Theleftsideofthenetworkisdividedintodifferentstagesthat operateatvariousresolutions.
Each stage comprises one to 3 convolutional layers.
At eachstage,a residual function is learned.Theinputofeachstageis used
in convolutional layers andprocessedthroughnon-linearities andadded tothe output
of thelast convolutional layerofthatstagetoenablelearningaresidualfunction.This V-net
architectureensuresconvergencecomparedwithnon-residuallearningnetworkssuchasU-
Net.
Theconvolutionsperformedineachstageusevolumetrickernelshavingthesizeof5×5×5
voxels.(Avoxelrepresents avalue onaregular gridin3D-space. Thetermvoxelis
commonlyusedin3Dmuch 3Dspace,justlikevoxelizationinapointcloud.)
Along the compression path, the resolution is reduced by convolution with 2×2×2
voxels full kernels applied with stride 2. Thus, the size of the resulting feature maps is
halved, with a similar purpose as pooling layers. And number of feature channels
doublesat
eachstageofthecompressionpathofV-Net.
Replacing pooling operations with convolutional ones helps to have a smaller
memoryfootprintduringtrainingbecausenoswitchesmappingtheoutput ofpoolinglayersbackto
theirinputsareneededforback-propagation.
Downsamplinghelpstoincreasethereceptivefield.
PReLU is used as a non-linearity activation function.
On Right Part
The networkextractsfeaturesandexpandsspatial supportof thelowerresolution

feature mapstogatherandassemblethenecessaryinformationtooutputatwo-channel
volumetric
segmentation.
Page 3 | 16
At each stage, a deconvolutionoperationisemployedtoincrease thesize oftheinputs
followedbyonetothreeconvolutional layers,involving half thenumberof5×5×5 kernels
appliedinthepreviouslayer.
Theresidualfunctionis learned, similar to left part of the network.
The2featuresmapscomputed byaverylastconvolutionallayer, having 1×1×1kernel
sizeand producingoutputsofthesamesizeasinputvolume.
Thesetwooutputfeaturemapsaretheprobabilistic segmentationoftheforegroundand
backgroundregionsbyapplyingsoft-maxvoxelwise.
Q2. Highway Networks- Gating Function to highway
Answer:
It is found that difficulties are optimizing a very deep neural network.However, it’s still an
open problemwithwhyitisdifficulttooptimizeadeepnetwork.(itisduetogradientvanishingproblem.) Inspired
byLSTM (Long Short-TermMemory),authorsthereby makeuseofgating functionto adaptively
bypassor transformthesignalsothatthenetworkcangodeeper.Thedeepnetwork withmore
than1000layerscanalsobeoptimized.
Plain Network
Beforegoing intoHighway Networks,Let usstartwithplain networkwhichconsists of L layers
wherethel-thlayer(withomittingthesymbolforthelayer):
Wherexis input, WHis the weight, H is the transform function followed by an activation
function, andyis the output. And for i-thunit:
Wecomputetheyiandpassittothenextlayer.
Highway Network
Page 4 | 16
In a highway network, 2 non-linear transforms T andC are
introduced:
whereT is Transform Gate, and C is the Carry

Gate. In particular, C = 1 - T:
Wecanhavebelowconditions for specific T values:
WhenT=0, we passinputas outputdirectly,whichcreatesan informationhighway.

That’s why
it is called the Highway Network.
WhenT=1, weuse non-linear activated transformed input asoutput.
Here,incontrasttothei-thunitinplainnetwork,theauthorsintroducetheblockconcept.Fori-th
block, there is a block state Hi(x), and transform gate output Ti(x). And the corresponding
block
outputyi:
whichisconnectedtothenextlayer.
Formally,T(x) is the sigmoid function:
Page 5 | 16
Sigmoidfunctioncapstheoutput between 0to1.Whentheinputhasa too-smallvalue, it
becomes
0.Whentheinputhasatoo-largeamount,itbecomes1.Therefore,by learningWTandbT, a
networkcanadaptivelypassH(x) or pass x tothenextlayer.
Andthe authorclaimsthat this helpstohavethe simpleinitialization schemeforWTwhichis
independentofnatureofH.
bTcanbeinitializedwiththenegativevalue(e.g.,-1,-3,etc.)suchthat thenetworkisinitially
biased
towardscarryingbehaviour.
LSTM inspires the above idea as the authors mentioned.
QHo3w.evWer,hthaetexaisctreDsGradientDescent)did
AndSGD(Stochastic uetltsNhAavSe:notNbeeeunrparlovA
iderdc.hitecture
notstall fornetworks withmorethan1000 layers.
Search(NAS) on Object
Detection?
Answer:
Objectdetectionis oneofthemostfundamentalcomputervision(OpenCV) tasksandhasbeenwidely
usedinreal-worldapplications.Theperformanceofobjectdetectorshighlyreliesonfeatures extractedby
backbones.However,mostworksonobjectdetectiondirectlyusenetworksdesigned forclassificationas
abackbonethefeatureextractors,e.g.,ResNet.Thearchitecturesoptimizedon imageclassificationcannot
guaranteeperformanceonobjectdetection.Itisknownthatthereisan essentialgapbetweenthesetwo
different tasks. Imageclassification basically focuseson”What” mainobjectoftheimageis,while
objectdetectionaimsatfinding”Where”and”What”eachobject
Page 6 | 16
instance in an image. There have beenlittle works focusing onbackbone design for object detector,
except thehand-craft network, DetNet.
Neural architecture search (NAS) has achieved significant progress in image classification and
semantic segmentation. The networks produced by search have reached or even surpassed the
performance of thehand-crafted ones on this task. But object detection has never beensupported by
NAS before. Some NAS(Neural architecture search) work directly applies architecture searched on
CIFAR-10classificationonobjectdetection.
In this work, we present the first effort towards learning a backbone network for object
detection tasks. Unlike previous NAS works, our method does not involve any architecture-level
transfer. We proposeDetNAS to conduct neural architecture search directlyonthe target tasks. The
quests are even performed with precisely the same settings to the target task. Training an objector
detector usuallyneedsseveraldaysandGPUs,nomatterusingapre-train-finetuneschemeortraining
from scratch.Thus,itisnotaffordabletodirectlyusereinforcementlearning(RL)orevolutionalgorithm
(EA) to search the architectures independently. To overcome this obstacle, we
formulatethis
problem
intosearchingtheoptimalpathin thelarge graphorsupernet.In simpleterms,DetNAS consists of
threesteps:(1)trainingasupernetthatincludesallsub-networksinsearchspace;(2)searchingfor
thesub-networkwtihthehighest performanceonthevalidationsetwithEA;(3)retrainingthe res4u.Y

l to
i ngunheatwvoerkaannyd
iQ
d e a a b o t E C Eu(Em o
e v aluati n gti o n th etes ts et .tioncauseextraction).
Answer:
Emotioncauseextraction(ECE)aimsatextractingpotentialcausesthatleadtoemotionexpressions in
thetext.TheECEtaskwasfirstproposedanddefinedasaword-levelsequencelabelingproblem inLeeetal.To
solvetheshortcomingofextractingcausesatthewordlevel,Guietal.2016released anewcorpus
whichhasreceivedmuchattentioninthefollowingstudyandbecomesabenchmark datasetforECE
research.
Page 7 | 16
BelowFig.Displaysanexamplefromthiscorpus,therearefiveclausesinadocument.Theemotion
“happy” is contained in fourth clause. We denote this clause as an emotion clause, which refers to a
termthatincludesemotions.Ithastwocorrespondingcauses:“apolicemanvisitedtheoldmanwith thelost
money”inthesecondclauseand,“toldhimthatthethiefwascaught”inthethirdclause.We namethemas
cause clause, which refers to a term that contains causes.
In this work,we proposea new task: emotion-causepair extraction (ECPE), which aims to
extract all potentialpairsof emotionsandcorrespondingcausesin thedocument.In AboveFig, we
showthe differencebetweenthetraditionalECEtaskandournewECPEtask.ThegoalofECEistoextract the
correspondingcauseclauseofthegiven emotion.In additiontoadocumentastheinput,ECE needsto
provideannotated feelingatfirstbeforecauseextraction.
In contrast, the output of our ECPE task is a pair of emotion-cause, without the need of
providing emotionannotationin advance.FromAbovefig., e.g., given theannotationof feeling:
“happy,”the goal of ECE is totrackthetwocorrespondingcauseclauses:“apolicemanvisited theold
manwiththelostmoney”and“andtoldhimthatthethiefwascaught.”WhileintheECPEtask,thegoalistodirectly
extractall pairsofemotionclauseandcauseclause,including (“The oldmanwasdelighted”, “apoliceman
visited theoldmanwiththelostmoney”)and(“The oldmanwaspleased”, “andtold himthat
thethiefwascaught”),withoutprovidingtheemotionannotation“happy”.
To address this new ECPE task, we propose a two-step framework. Step 1 converts the
emotion- cause pair extraction task to two individual sub-tasks (emotion extraction and cause
extraction respectively)viatwokindsofmulti-tasklearningnetworks,intendingtoextractasetof
emotion clauses and a set of cause clauses. Step 2 performs emotion-cause pairing and filtering. We
combine alltheelementsofthetwosetsintopairsandfinallytrainafiltertoeliminatethecouplesthatdonot
containacausalrelationship.
Q5.What is DST (Dialogue state tracking)?

Answer:
Page 8 | 16
Dialogue state tracking (DST) is a core component in task-oriented dialogue systems, such as
restaurant reservations or ticket bookings. The goal of DST is to extract user goals expressed during
conversation andto encodethemas a compact set of thedialogue states, i.e., a set of slots andtheir
corresponding values. E.g., asshownin belowfig., (slot, value) pairs such as(price,
cheap)and(area, centre)areextractedfromtheconversation. Accurate DST performanceis
importantfor appropriatedialogue management,whereuserintentiondeterminesthenextsystem
actionandthecontent toquery fromthedatabases.
Statetrackingapproachesarebasedontheassumptionthatontologyisdefinedinadvance,where all
slotsandtheirvaluesareknown.Havingapredefined ontologycansimplifyDSTintoa
classification problem and improve performance (Henderson et al., 2014b; Mrkšić et al., 2017;
Zhong et al., 2018). However, there are two significant drawbacks to this approach: 1) A full
ontology is hard to obtain in advance (Xu and Hu, 2018). In the industry, databases are usually
exposedthroughanexternalAPIonly,whichisownedandmaintainedbyothers.Itisnotfeasibleto gain
accesstoenumerateallthepossiblevaluesforeachslot.2)Evenifafullontologyexists,the numberof
possible slot values could be significant and variable. For example, a restaurant name or a train
departuretimecancontainalargenumberofpossiblevalues.Therefore,manyoftheprevious worksthatare
basedonneuralclassificationmodelsmaynotbeapplicableinrealscenarios.
Q6.What is NMT(Neural machine translation)?

Answer:
NMT stands for Neuralmachine translation, whichis the useof neural networkmodelsto
learnthe statisticalmodelformachinetranslation.
Page 9 | 16
Thekeybenefittotheapproachisthatthesinglesystemcanbetraineddirectlyonthesourceand
targettext,nolongerrequiringthepipelineofspecializedmethodsusedinstatistical(ML)machine learning.
Unlike the traditional phrase-based translation system which consists of many sub-
componentsthat aretunedseparately,neuralmachinetranslationattemptstobuildandtrainasingle,large
neural networkthatreadsasentenceand outputsacorrecttranslation.
As such, neural machine translation(NMT) systems are said to be end-to-end systems
asonlyone modelisrequiredforthetranslation.
In Encoder
Thetaskoftheencoderistoprovidetherepresentationofainputsentence.Theinputsentenceisa
sequenceofwords,forwhichwefirstconsultembeddingmatrix.Then,asintheprimarylanguage model
described previously,we processthesewordswitha recurrent neural network(RNN).This results in
hiddenstatesthatencodeeachwordwithitsleftcontext,i.e.,alltheprecedingwords.To alsogettheright
context,wealsobuildarecurrentneuralnetwork(RNN)thatrunsright-to-left,or, fromtheendofthe
sentencetobeginning.Havingtworecurrentneuralnetworks(RNN)runningin twodirectionsisknownas
thebidirectionalrecurrentneuralnetwork(RNN).
In Decoder
Thedecoderis therecurrentneuralnetwork(RNN). It takessomerepresentationof inputcontext
(moreonthatinthenextsectionontheattentionmechanism)andprevioushiddenstateandtheoutput word
prediction,andgeneratesanewhiddendecoderstateandthenewoutputwordprediction.
If you use LSTMs for the encoder, then you also use LSTMs for the decoder. From
hiddenstate.You nowpredicttheoutputword.Thispredictiontakestheformoftheprobabilitydistribution
overentire outputvocabulary.Ifyouhaveavocabularyof,say,50,000words,thenthepredictionisa
50,000 dimensionalvector,eachelementcorrespondingtotheprobability predictedforonewordin
the vocabulary.
Q7. What is Character-Level models (CLM)?

Answer:
In English, there is strong empirical evidence that the character sequence that create up
proper nouns tend to be distinctive. Even divorced of context,human reader can predict that
“hoekstenberger”isan entity,but“abstractually”isnot.SomeNERresearchexploresuseofcharacter-level
features
Page 10 | 16
including capitalization, prefixesandsuffixes CucerzanandYarowsky; RatinovandRoth(2009),
andcharacter-levelmodels(CLMs)Kleinetal.(2003)toimprovetheperformanceofNER,butto datetherehas
beennosystematicstudyisolatingutilityofCLMsincapturingthedistinctionsbetween nameandnon-nametokensin
English oracrossother languages.
Weconducttheexperimentalassessmentofthediscriminative powerofCLMs forarangeof

languages:English, Arabic, Amharic,Bengali, Farsi, Hindi, Somali, andTagalog. These languages
usethevariety of scripts andorthographic conventions (e.g, only three usecapitalization), comefrom
different language families, andvary in their morphological complexity. Werepresent the effectiveness
ofCLMs(character-level models) in distinguishing nametokens from non-name tokens, asillustrated
bytheaboveFigure, whichshowsconfusionin histogramsfrom aCLM trained on entitytokens.Our
modelsuseindividual tokens,butperformextremelywell in spiteoftaking no accountof theword
context.
Wethenassesstheutility ofdirectly adding simplefeaturesbasedonthisCLM(character-level
model) implementation toanexisting NER system,andshowthat they have thesignificant positive
impact on performance across many of thelanguages wetried. By adding very simple CLM-based
features to the system, our scores approach those of a state-of-the-art(SOTA) NER system Lample
et al. (2016) across multiple languages, representing boththe unique importance and broad utility of
this approach.
Q8.What is LexNLP package?

Answer:
Overthelast 2decades,manyhigh-quality, open-sourcepackagesfor naturallanguage
processing(NLP) andmachinelearning(ML) havebeenreleased.Developersandresearcherscan
quickly write applications in languages such as Python, Java, andR that stand onshoulders of
Page 11 | 16
comprehensive, well-tested libraries such as Stanford NLP (Manning et al. (2014)), OpenNLP
(ApacheOpenNLP (2018)), NLTK (Bird etal. (2009)), spaCy (Honnibal andMontani (2017), scikit-
learn library (Buitinck et al. (2013), Pedregosa et al. (2011)), and Gensim (Řehůřek and Sojka
(2010)). Consequently, for most of the domains, rate of research has increased and cost of the
applicationdevelopment hasdecreased.
Forsomespecialized areas like marketing andmedicines, therearefocused libraries and
organizationslike BioMedICUS (Consortium(2018)), RadLex (Langlotz (2006)), andtheOpen
HealthNatural Language Processing(NLP) Consortium. Law, however, has received substantially
less attentionthanothers, despite its ubiquity, societal importance, andthespecialized form. LexNLP
is designedtofill this gapbyproviding bothtools anddata for developers andresearchers towork
withreallegal andregulatorytext,including statutes,regulations,thecourt opinions, briefs, contracts,
andtheotherlegal workproducts.
Law is thedomaindrivenbylanguage, logic, andtheconceptual relationships, ripe for computation
andanalysis(Ruhl etal. (2017)). However,in ourexperience,naturallanguageprocessing(NLP) and
machinelearning(ML) havenotbeenappliedasfruitfully orwidely in legal asonemighthope.We
believethatthekeyimpedimenttoacademicandcommercialapplicationhasbeenlack of tools that
allow userstoturnthereal, unstructuredlegal documentintostructureddataobjects.TheGoal of
LexNLP is tomakethis tasksimple,whetherfortheanalysis of statutes,regulations,courtopinions,
briefs orthemigration of legacy contractstosmartcontractordistributedledger systems.
Q9.Explain The Architecture of LeNet-5.

Answer:
Yann LeCun, LeonBottou,YosuhaBengioandPatrick Haffnerproposedtheneuralnetwork
architecturefor thehandwrittenandmachine-printedcharacterrecognitionin the1990’s whichthey
called themLeNet-5. The architectureis straightforward andtoosimple to understandthat’s why it
is mostly usedasafirst stepfor teaching (CNN)Convolutional Neural Network.
Architecture
Page 12 | 16
This architecture consists of two sets of convolutional and average pooling layers,
followedbythe flatteningconvolutionallayer,then2fully-connectedlayersandfinallythesoftmax
classifier.
In the First Layer:
The input for LeNet-5 is the 32×32 grayscale image which passes through first convolutional layer
with6featuremapsorfiltershavingsize5×5andthestrideofone.Imagedimensionschanges from
32x32x1to28x28x6.
In Second Layer:
Thenit appliesaveragepoolinglayer orsub-samplinglayer withthefilter size 2×2andstrideof two. The
resultingimagedimensionwillbereducedto14x14x6.
ThirdL ayer:
Next, thereis thesecondconvolutional layer with 16feature mapshaving size 5×5 andthestrideof
1. In this layer, only tenoutof sixteen feature mapsareconnected to6feature mapsof previous layer
asshownbelow.
Page 13 | 16
The main reason is to break symmetry in the network and keeps a number of connections within
reasonable bounds. That is why the numberof training parameters in this layers are 1516 instead of
2400 andsimilarly, numberof connections are151600 insteadof 240000.
Fourth Layer:
In the fourth layer (S4) is an average pooling layer with filter size 2×2andstrideof2.Thislayer
is same as second layer (S2) except it has 16 feature maps so output will be reduced to
5x5x16.
Page 14 | 16
Fifth Layer:
Thefifthlayer(C5)isthefullyconnectedconvolutionallayerwith120featuremapseachofthesize
1×1.Eachof120unitsinC5isconnectedtoallthe400nodes(5x5x16)inthefourthlayerS4.
SixthL ayer:
The sixth layer is also fully connected layer (F6) with84units.
Page 15 | 16
OutputLayer:
Finally, there is fully connected softmax output layer ŷ with 10possible values
correspondingto digits from0to9.
Page 16 | 16
DATA SCIENCE
INTERVIEW
PREPARATION
Preparation)
# DAY 23
Page 1 of 18
Q1.Explain Overfeat in Object detection.
Answer:
Overfeat: It is a typical model of integrating object detection, localization, and classification tasks whole
into one convolutional neural network(CNN). The main idea is to do image classification at different
locations on regions of multiple scales of the image in a sliding window fashion, and second, predict
bounding boxlocationswiththeregressor trained ontopof thesameconvolution layers.
This modelarchitecture is toosimilar toAlexNet. This modelis trained asfollows:
1. Traina CNNmodel(identicaltoAlexNet)onimageclassificationtasks.
2. Then,wereplacetopclassifierlayers bytheregressionnetworkandtrainedit topredictobject
boundingboxesateachspatiallocationandscale.Regressorisclass-specific,eachgeneratedfor
oneclassimage.
•Input: Images with classification and bounding box.
• Output:(xleft,xright,ytop,ybottom)(xleft,xright,ytop,ybottom), 4valuesintotal,
representingthecoordinatesoftheboundingboxedges.
• Loss: The regressor is trained to minimize l2 norm between the generated
boundingbox andtruthforeachtrainingexample.
At the detection time,
1.It Performs classification at each location using the pretrained CNN model.
2.It Predicts object bounding boxes on all classified regions generated by the classifier.
3.Mergeboundingboxeswithsufficientoverlapfromlocalizationandsufficientconfidenceofbeing thesame
objectfromtheclassifier.
Page 2 of 18
Q2. What is Multipath: Multiple Probabilistic Anchor Trajectory
Hypotheses for Behavior Prediction?
Answer:
In this paper, we focus on problem of predicting future agent states, which is the crucial task
for robot planning in real-world environments. We are specifically interested in addressing this
problemforself- drivingvehicles,applicationwithapotentiallyenormoussocietalimpact.Mainly,predictingthe
future
ofother agents in this domain is vital for safe, comfortable, andefficient operation. E.g., it is important
toknowwhethertoyieldtothevehicleiftheyaregoingtocutinfrontofourrobotorwhenwouldbe thebesttime
toaddintotraffic.Suchfuturepredictionrequiresanunderstandingofastaticanddynamic worldcontext:road
semantics (like lane connectivity, stop lines), traffic light informations, and past observations of other
agents,asinbelowFig.
A fundamental aspect of the future state prediction is that it is inherently stochastic,
asagentscan’tknow
eachother’smotivations.Whenwearedriving,wecanneverreallybesurewhatotherdriverswilldo next,and
it is essential to considermultiple outcomesand their likelihood.
Weseekthemodelofthefuturethat canprovideboth(i)aweighted,parsimonioussetofdiscrete
trajectoriesthatcoversspaceoflikelyoutcomesand(ii)aclosed-formevaluationofthelikelihoodof
anytrajectory.Thesetwoattributesenableefficient reasoningin relevantplanninguse-cases,e.g.,
human-likereactionstodiscretetrajectoryhypotheses (e.g., yielding, following), and probabilistic
queriessuchastheexpectedriskofcollisioninaspace-timeregion.
This model addresses these issues with critical insight: it employs a fixed set of trajectory anchors
asthe basisofourmodeling.Thisletsusfactorstochasticuncertaintyhierarchically:First,intent
uncertaintycapturestheuncertaintyofwhatanagentintendstodoandisencodedasadistributionover thesetof
anchortrajectories.Second,givenanintent,control uncertainty representsouruncertainty overhowthey
mightachieveit.Weassumecontroluncertaintyisnormallydistributedateachfuture timestep[Thrun05],
parameterized such that the mean corresponds to a context-specific offset from the anchor state, with the
associated covariance capturing the unimodal aleatoric uncertainty [Kendall17]. In Fig. Illustrates a typical
scenariowheretherearethreelikelyintentsgiventhescenecontext,withcontrol meanoffsetrefinements
respectingroadgeometry,andcontroluncertaintyintuitivelygrowingovertime.
Ourtrajectoryanchorsaremodesfoundinourtrainingdatainstate-sequencespaceviaunsupervised
learning. Theseanchorsprovidetemplatesfor coarse-granularityfuturesfor anagentandmight
correspondtosemanticconceptslike“changelanes,”or“slowdown”(althoughtobeclear,wedon’tuse any
semanticconceptsinourmodeling).
Page 3 of 18
Ourcompletemodelpredicts aGaussian mixturemodel(GMM) ateachtimestep,withthemixture
weights(intentdistribution) fixed overtime.Given suchaparametricdistributionmodel,wecandirectly
evaluatethelikelihood of anyfuturetrajectoryandhaveasimplewaytoobtainacompact,diverse
weightedsetoftrajectorysamples: theMAP samplefromeachanchor-intent.
Q3. An Object detection approach using MR-CNN

Answer:
Multi-Region CNN (MR-CNN): Object representation using multiple regions to capture several
different aspectsofoneobject.
Page 4 of 18
Network Architectureof MR-CNN
• First, theinputimagegoesthroughActivation MapsModule,asshownabove,andoutputsthe

activationmap.
•Bounding boxorRegion proposals candidates aregeneratedusing Selective Search.
• For eachbounding box candidate B, aset of regions {Ri}, with i=1 tok, aregenerated,thatis why
it is knownasmulti-region. More details about thechoicesof multipleareasaredescribedin next
sub-section.
• ROI pooling is performedfor eachregionRi,croppedorpooledareagoesthroughfully connected (FC)

layers, ateachRegionAdaptationModule.
• Finally, theoutputfromall FC layers areaddedtogethertoforma1D featurevector,whichis an object
representation ofthebounding boxB.
• Here,VGG-16 ImageNetpre-trainedmodelis used.Themax-pooling layer afterthelast conv
layer is removed.
Q4. Object detection using Segmentation-aware CNN

Answer:
Page 5 of 18
• Therearecloseconnectionsbetweensegmentationanddetection.Andsegmentation
relatedquesareempiricallyknowntohelpobjectdetectionoften.
Twomodulesareadded:Activationmapsmoduleforsemanticsegmentation-
•
aware
features,andregionsadaptationmodulefor grammarly segmentation-aware
feature.
•
Thereisnoadditionalannotationusedfortraininghere.
• FCN is usedforanactivationmapmodule.
ThelastFC7layerchannelsnumberischangedfrom4096to512.
•
• Theweaklysupervisedtrainingstrategyisused.Artificial foregroundclass-specific
segmentationmaskiscreatedusingboundingboxannotations.
Page 6 of 18
• More particularly, the ground truth bounding boxes of an image are projected on the spatial
domain of the last hidden layer of theFCN, and the”pixels” that lay inside theprojected boxes
arelabelled asforeground while therest arelabelledasbackground.
• Aftertraining theFCN using the mask,the lastclassification layeris dropped.Only therest

ofFCN isused.
Thoughit isweaklysupervisedtraining, theforegroundprobabilitiesshownasabovestillcarry
•
someinformation,asshownabove.
Theboundingboxusedis1.5× larger than the original bounding box.
•
Q5. What is CRAFT (Object detection)?

Answer:
CRAFT stands for CascadeRegion-proposal-networkAndFasT R-CNN. It is reviewed by the
ChineseAcademyofSciencesandTsinghuaUniversity.InFasterR-CNN,regionproposal
networkisusedtogenerateproposals.Theseproposals,afterROIpooling,aregoing through
networkforclassification.However,CRAFTisfoundthatthereisacoreproblem inFasterR-CNN:
• In proposal generation, there is still a large proportion of background regions. The existence
of manybackgroundsamplecausesmanyfalsepositives.
Page 7 of 18
In CRAFT(Cascade Region-proposal-network), asshownabove,anotherCNN(Convolutional neural
network)is addedafterRPN togeneratefewer proposals (i.e., 300 here). Then, classification is performed
on300proposalsandoutputsabout20first detectionresults. For eachprimitive result, refinedobject
detection is performedusing one-vs-restclassification.
CascadeProposal Generation
BaselineRPN
•An ideal proposalgeneratorshould generateas few proposal as possiblewhile covering almost all
object instances.Duetoresolutionloss causedby CNN pooling operationand thefixed aspect ratio
of thesliding window,RPN is weakatcovering objects withextremeshapesorscales.
Page 8 of 18
Recall Rates (is in %), Overall is 94.87%, lowerthan94.87% is boldin thetext.
• TheaboveresultsarebaselineRPNbasedonVGG_MtrainedusingPASCALVOC
2007train+val,andtestedonthetestset.
• Therecall rateoneachobjectcategoryvaries alot. Objectwithextremeaspectratioandscale are
hardtobedetected,suchasboatandbottle.
Proposed Cascade
Structure
Page 9 of 18
TheconcatenationclassificationnetworkafterRPNisdenotedasFRCNNethere
• Anadditional classification network that comes after RPN.

• Theadditionalnetworkis the2-classdetectionnetworkdenotedasFRCNnetin abovefigure.It
usesoutputofRPNastrainingdata.
AfterRPNnetistrained,the2000firstproposalsofeachtrainingimageareusedastraining
•
data fortheFRCNnet.
Duringtraining,+veand -vesamplingarebasedon0.7IoUfornegativesand below0.3IoU
• for negatives,respectively.
There are 2 advantages:
•
1)First, additional FRCN net further improvesqualityoftheobjectproposalsandshrinks
• morebackgroundregions,makingproposalsfitbetterwithtaskrequirement.
2)Second,proposalsfrommultiplesourcescanbemergedastheinputoftheFRCNnetsothat
•
complementaryinformationcanbeused.
Q6. Explain YOLOv1 for Object

Detection.
Answer:
Page 10 of 18
YOLOv1 stands for You Look Only Once, it is reviewed by FAIR (Facebook AI
Research).Thenetwork onlylooksattheimageoncetodetectmultipleobjects.
By just looking image once, thedetection speedis in real-time(45 fps). Fast YOLOv1 achieves155 fps.
YOLO suggestshaving aunified networktoperformall atonce. Also, anend-to-endtraining networkcan

beachieved.
Theinputimageis dividedintotheS×S grid (S=7). If the center of the object falls into the grid
cell, that grid cell is responsible for detecting that object.
Each grid cell predict B boundingboxes(B=2) andconfidencescoresfor thoseboxes.Theseconfidence
scorereflecthowconfidentmodelisthattheboxcontainsanobject,i.e.,anyobjectsinthebox,P(Objects). Each
boundingboxconsistsoffivepredictions:x,y,w,h,andconfidence.
Page 11 of 18
•The(x,y)coordinatesrepresentcenteroftheboxrelativetotheboundofthegridcell.
•Theheighthandwidthwarepredictedrelativetowholeimage.
• The confidencerepresentstheIOU (IntersectionOverUnion) betweenthepredictedboxandany

groundtruthbox.
Eachgridcellalsopredictsconditional classprobabilities,P(Class|Object). (Totalnumberofclasses=20)
The output size becomes:

7×7×(2×5+20)=1470
Network Architectureof YOLOv1
Page 12 of 18
Themodelconsistsof 24convolutionallayers, followedbytwofully connectedlayers. Alternating 1×1
convolutionallayersreducefeaturesspacefrompreceding layers. (1×1)Conv has beenusedin GoogLeNet for
reducing thenumberof parameters.)
Fast YOLO fewerconvolutional layers (9 instead of 24) andfewer filters in thoselayers. The network
pipelineis summarizedlike below:
Therefore,wecanseethattheinputimagegoesthroughnetworkonceandthenobjectscanbedetected.
Andwecanhaveend-to-endlearning.
Page 13 of 18
Q7. Adversarial Examples Improve Image Recognition
Answer:
Adversarial examples crafted by adding imperceptible perturbations to images can lead to

(ConvNets)Convolutional Neural Networks to make wrong predictions. The existence of
adversarial
examplesnotonlyreveallimitedgeneralizationability ofConvNets,butalsoposessecurity threatson the
real-world deployment of these models. Since the first discovery of the vulnerability of ConvNets to
adversarialattacks,manyeffortshavebeenmadetoimprovenetworkrobustness.
Page 14 of 18
Above Fig.: AdvProp improves image recognition. By training model on ImageNet, AdvProp helps
EfficientNet-B7 to achieve 85.2% accuracy on ImageNet, 52.9% mCE (mean corruption error,
loweris better)onImageNet-C,44.7%accuracyonImageNet-Aand26.6%accuracyonStylized-ImageNet,
beating its vanilla counterpart by 0.7%, 6.5%, 7.0% and 4.8%, respectively. Theses sample images are
randomlyselectedfromcategory“goldfinch.”
In this paper, rather than focusing on defending against adversarial examples, we shift our
attentionto
leveragingadversarialexamplestoimproveaccuracy.Previousworksshowthattrainingwithadversarial
examplescanenhancemodelgeneralizationbutarerestrictedtocertainsituations—theimprovementis only
observedeitheronsmalldatasets(e.g., MNIST) in the fully-supervisedsetting [5], oronlarger
datasetsbutinthesemi-supervisedsetting[21,22].Meanwhile,recentworks[15,13,31]alsosuggest that
training withadversarialexamples onlargedatasets,e.g., ImageNet[23], withsupervisedlearning
resultsin performancedegradationoncleanimages.To summarize,it remainsanopenquestionof how
adversarialexamplescanbeusedeffectivelytohelpvisionmodels.
Weobserveallpreviousmethodsjointlytrainovercleanimagesandadversarialexampleswithout
distinction, even though they should be drawn from different underlying distributions. We
hypothesize thisdistributionmismatchbetweenfreshexamplesandadversarialexamplesisakeyfactor
thatcauses performancedegradationinpreviousworks.
Q8. Advancing NLP with Cognitive Language Processing Signals

Answer:
Whenreading,humansprocesslanguage“automatically”withoutreflectingoneachstep—Humans
string words together into sentences, understand the meaning of spoken and written ideas, and
process languagewithoutoverthinkingabouthowtheunderlyingcognitiveprocesshappens.Thisprocess
generatescognitivesignalsthatcouldpotentiallyfacilitatenaturallanguageprocessingtasks.
In recent years, collecting these signals has become increasingly accessible and less
expensivePapoutsakietal.(2016); asaresult,using cognitivefeaturestoimproveNLPtaskshasbecome
morepopular.Forexample,researchershaveproposedarangeofworkthatuseseye-trackingorgaze signalsto
improvepart-of-speechtagging(Barrettetal.,2016),sentimentanalysis(Mishraetal.,2017), namedentity
recognitionHollensteinand Zhang(2019),amongothertasks.Moreover,thesesignalshave beenused
successfully toregularizeattention inneuralnetworksforNLP Barrettetal.(2018).
However,mostpreviousworkleveragesonlyeye-tracking data,presumablybecauseitisthemost
accessibleformofcognitive languageprocessingsignal. Also, moststate-of-the-artwork(SOTA)focused
onimprovingasingletaskwithasingletypeofcognitivesignal.Butcancognitiveprocessingsignals bringconsistent
improvementsacrossmodality(e.g.,eye-trackingandEEG)andacrossvariousNLP
Page 15 of 18

Data Science Interview Questions 30 Days 1686062665

Uploaded by

Copyright:

Available Formats

Data Science Interview Questions 30 Days 1686062665

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Science Interview Questions 30 Days 1686062665

Uploaded by

Copyright:

Available Formats

30 Days of Interview

Deep Learning: It is a technique for implementing ML.

Types of Machine Learning

Q3. Describe the general architecture of Machine learning.

Q4. What is Linear Regression?

Q6. What is L1 Regularization (L1 = lasso) ?

.Q7L2 Regularization(L2 =Ridge Regression)

.So to avoid overfitting, we perform Regularization

formula for Ridge Regression:-

Q8. What is R square(where to use and where not)?

R-squared = Explained variation / Total variation

R-squared is always between 0 and 100%.

Q9. What is Mean Square Error?

The line equationy=Mx+B. We want to findM (slope)andB (y-intercept)that minimizes the

Blueline: Hyper Plane; Red Line: Boundary-Line

Logistic Regressionis used whenthedependent variable (target) is categorical.

If ‘Z’ goestoinfinity, Y(predicted) will become1,andif ‘Z’ goestonegativeinfinity, Y(predicted)

Cost( hΘ (x) , Y(Actual)) = -log (hΘ (x)) if y=1

This implementationis forbinarylogistic regression.For datawithmorethan2classes, softmaxre

Q2. Difference between logistic and linear regression?

Key Differences between Linear and Logistic Regression

Linearregression modelsdatausingcontinuousnumeric value. As against, logistic regression

Q3. Why we can’t do a classification problem using Regression?

Nowourh(x)>0.5→malignant doesn't workanymore. To keep making correct predictions, we

Q4. What is Decision Tree?

A decision tree is a type of supervised learning algorithm that can be used in

Q5. Entropy, Information Gain, Gini Index, Reducing Impurity?

Entropyvariesfrom0to1.0if all thedatabelongtoasingle class and1if theclass distributionis

2. Take average informationentropy fortheattribute.

3. Calculate gain forthecurrent attribute.

A leaf nodeis decidedwhenentropy is zero

2)CART Algorithm (Classification andRegression trees): In CART, weusetheGINI index as

Choosethesplitbasedonhigher Gini value

Therearetwotypes of pruning: Pre-pruning andPost-pruning.

1. Pre-pruningis alsoknownastheearly stoppingcriteria. As thenamesuggests, thecriteria

TheCP (complexity parameter)is usedtocontroltreegrowth.If thecostofaddingavariable is

Q7. How to handle a decision tree for numerical and categorical

1. If thefeatureis categorical, thesplit is donewith theelementsbelonging toa particular

At everysplit, thedecisiontreewill takethebestvariableatthatmoment.This will bedone

Q9. What is Variance and Bias tradeoff?

Variance: Variability of amodelprediction foragiven datapoint. Wecan build themodelmultiple

For example:VotingRepublican-13VotingDemocratic-16Non-Respondent-21Total-50 The

Bagging (Bootstrap Aggregation)is usedwhenourgoal is toreducethevariance of adecision tree.

Thedifferenttypes of boosting algorithms are:

Q11. What is SVM Classification?

SVM orLarge marginclassifier is asupervisedlearningalgorithmthatusesapowerfultechnique

Advantages of SVM classifier:

Disadvantagesof SVM classifier:

Q11. What is Naive Bayes Classification and Gaussian Naive Bayes

Now,withregardstoourdataset,wecanapplyBayes’ theoremin following way:

Q12. What is the Confusion Matrix?

Q13. What is Accuracy and Misclassification Rate?

Misclassification Rateis definedastheratioofthesumofFalse Positive andFalse

Q14. True Positive Rate & True Negative Rate

Q15. What is False Positive Rate & False negative Rate?

Recall is given by the relation:

The relation gives precision:

Q17. What is RandomizedSearchCV?

Q19. What is BaysianSearchCV?