Data Science Interview Questions 30 Days 1686062665
Data Science Interview Questions 30 Days 1686062665
Data Science Interview Questions 30 Days 1686062665
Preparation
Q1. What is the difference between AI, Data Science, ML, and DL?
Ans 1 :
Artificial Intelligence
: AI is purely math and scientific exercise, but when it became computational, it
started to solve human problems formalized into a subset of computer science. Artificial intelligence
has changed the original computational statistics paradigm to the modern idea that machines could
mimic actual human capabilities, such as decision making and performing more “human” tasks.
Modern AI into two categories
1. General AI - Planning, decision making, identifying objects, recognizing sounds, social &
business transactions
2. Applied AI - driverless/ Autonomous car or machine smartly trade stocks
Machine Learning: Instead of engineers “teaching” or programming computers to have what they
need
to carry out tasks, that perhaps computers could teach themselves – learn something without being
explicitly programmed to do so. ML is a form of AI where based on more data, and they can change
actions and response, which will make more efficient, adaptable and scalable. e.g., navigation apps
and
recommendation engines. Classified into:-
1. Supervised
2. Unsupervised
3. Reinforcement learning
Data Science: Data science has many tools, techniques, and algorithms called from these fields, plus
others –to handle big data
The goal of data science, somewhat similar to machine learning, is to make accurate predictions and
to automate and perform transactions in real-time, such as purchasing internet traffic or automatically
generating content.
Page 2
Data science relies less on math and coding and more on data and building new systems to process the
data. Relying on the fields of data integration, distributed architecture, automated machine learning, data
visualization, data engineering, and automated data-driven decisions, data science can cover an entire
spectrum of data processing, not only the algorithms or statistics related to data.
Q2. What is the difference between Supervised learning, Unsupervised learning and
Reinforcement learning?
Ans 2 :
Machine Learning
Machine learning is the scientific study of algorithms and statistical models that computer systems use to
effectively perform a specific task without using explicit instructions, relying on patterns and inference
instead.
Building a model by learning the patterns of historical data with some relationship between data to make
a data-driven prediction.
Supervised learning
In a supervised learning model, the algorithm learns on a labeled dataset, to generate reasonable
predictions for the response to new data. (Forecasting outcome of new data)
• Regression
• Classification
Page 3
Unsupervised learning
An unsupervised model, in contrast, provides unlabelled data that the algorithm tries to make sense of by
extracting features, co-occurrence and underlying patterns on its own. We use unsupervised learning for
• Clustering
• Anomaly detection
• Association
• Autoencoders
Reinforcement Learning
Reinforcement learning is less supervised and depends on the learning agent in determining the output
solutions by arriving at different possible ways to achieve the best possible solution.
Business understanding: Understand the give use case, and also, it's good to know more about the
domain for which the use cases are built.
Data Acquisition and Understanding: Data gathering from different sources and understanding the
data. Cleaning the data, handling the missing data if any, data wrangling, and EDA( Exploratory data
analysis).
Page 4
Modeling: Feature Engineering - scaling the data, feature selection - not all features are important. We
use the backward elimination method, correlation factors, PCA and domain knowledge to select the
features.
Model Training based on trial and error method or by experience, we select the algorithm and train with
the selected features.
Model evaluation Accuracy of the model , confusion matrix and cross-validation.
If accuracy is not high, to achieve higher accuracy, we tune the model...either by changing the algorithm
used or by feature selection or by gathering more data, etc.
Deployment - Once the model has good accuracy, we deploy the model either in the cloud or Rasberry
py or any other place. Once we deploy, we monitor the performance of the model.if its good...we go live
with the model or reiterate the all process until our model performance is good.
It's not done yet!!!
What if, after a few days, our model performs badly because of new data. In that case, we do all the
process again by collecting new data and redeploy the model.
Ans 4:
Linear Regression tends to establish a relationship between a dependent variable(Y) and one or more
independent variable(X) by finding the best fit of the straight line.
The equation for the Linear model is Y = mX+c, where m is the slope and c is the intercept
In the above diagram, the blue dots we see are the distribution of 'y' w.r.t 'x.' There is no straight line that
runs through all the data points. So, the objective here is to fit the best fit of a straight line that will try to
minimize the error between the expected and actual value.
Page 5
Q5. OLS Stats Model (Ordinary Least Square)
Ans 5:
OLS is a stats model, which will help us in identifying the more significant features that can has an
influence on the output. OLS model in python is executed as:
lm = smf.ols(formula = 'Sales ~ am+constant', data = data).fit() lm.conf_int() lm.summary()
And we get the output as below,
The higher the t-value for the feature, the more significant the feature is to the output variable. And
also, the p-value plays a rule in rejecting the Null hypothesis(Null hypothesis stating the features has zero
significance on the target variable.). If the p-value is less than 0.05(95% confidence interval) for a
feature, then we can consider the feature to be significant.
Ans 6:
The main objective of creating a model(training data) is making sure it fits the data properly and reduce
the loss. Sometimes the model that is trained which will fit the data but it may fail and give a poor
performance during analyzing of data (test data). This leads to overfitting. Regularization came to
overcome overfitting.
Lasso Regression (Least Absolute Shrinkage and Selection Operator) adds “Absolute value of
magnitude” of coefficient, as penalty term to the loss function.
Page 6
Lasso shrinks the less important feature’s coefficient to zero; thus, removing some feature altogether. So,
this works well for feature selection in case we have a huge number of features.
Methods like Cross-validation, Stepwise Regression are there to handle overfitting and perform feature
selection work well with a small set of features. These techniques are good when we are dealing with a
large set of features.
Along with shrinking coefficients, the lasso performs feature selection, as well. (Remember the
‘selection‘ in the lasso full-form?) Because some of the coefficients become exactly zero, which is
equivalent to the particular feature being excluded from the model.
Ans 7:
Overfitting happens when the model learns signal as well as noise in the training data and wouldn’t
perform well on new/unseen data on which model wasn’t trained on.
To avoid overfitting your model on training data like , cross-validation samplingreducing the
, f, feetca.tures, pruningregularization
o number
Page 7
The Regression model that uses L2 regularization is called Ridge Regression. The
Regularization adds the penalty as model complexity increases. The regularization parameter
(lambda) penalizes all the parameters except intercept so that the model generalizes the data and
won’t overfit.
Ridge regression adds “squared magnitude of the coefficient" as penalty term to the loss
function. Here the box part in the above image represents the L2 regularization element/term.
Lambda is a hyperparameter.
Page 8
Iiffllaammbbddaaissvzeerryo,lathrgeen,itthiesneiqtuwivilalaedndtttooOoLmSu.cBhut
.weight, and it will lead to under-fitting
Ridge regularization and forces the weights to be small but does not make them zerodoes not give
.the sparse solution
nRoidtgreobisusatstosqouuatrlieertesrmsblowuptheerrordiferencesoftheoutliers,andtheregularizationtermtriestofixitbypenalizing
the weights
Rwiediggehtrsegressionperformsbetterwhenalltheinputfeaturesinfluencetheoutput,andallwith
are of roughly equal size.
L2 regularization can learn complex data patterns.
Ans 8.
R-squared is a statistical measure of how close the data are to the fitted regression line. It is also
known as the coefficient of determination, or the coefficient of multiple determination for
multiple regression.
The definition of R-squared is the percentage of the response variable variation that is explained by a
linear model.
0% indicates that the model explains none of the variability of the response data around its mean.
100% indicates that the model explains all the variability of the response data around its mean.
In general, the higher the R-squared, the better the model fits your data.
Page 9
There is a problem with the R-Square. The problem arises when we ask this question to ourselves.** Is it
good to help as many independent variables as possible?**
The answer is No because we understood that each independent variable should have a meaningful
impact. But, even** if we add independent variables which are not meaningful**, will it improve R-
Square value?
Yes, this is the basic problem with R-Square. How many junk independent variables or important
independent variable or impactful independent variable you add to your model, the R-Squared value will
always increase. It will never decrease with the addition of a newly independent variable, whether it
could be an impactful, non-impactful, or bad variable, so we need another way to measure
equivalent R- Square, which penalizes our model with any junk independent variable.
So, we calculate the Adjusted R-Square with a better adjustment in the formula of generic R-square.
The mean squared error tells you how close a regression line is to a set of points. It does this by taking
the distances from the points to the regression line (these distances are the “errors”) and squaring
them.
Giving an intuition
Page 10
Q10. Why Support Vector Regression? Difference between SVR and a simple regression
model?
Ans 10:
In simple linear regression, try to minimize the error rate. But in SVR, we try to fit the error within
a certain threshold.
Main Concepts:-
1. Boundary
2. Kernel
3. Support Vector
4. Hyper Plane
Page 11
Our best fit line is the one where the hyperplane has the maximum number of points.
We are trying to do here is trying to decide a decision boundary at ‘e’ distance from the
original hyperplane such that data points closest to the hyperplane or the support
vectors are within that boundary line
Page 12
DATA SCIENCE
INTERVIEW
PREPARATION
Preparation)
(30 Days of Interview
#DAY 02
Page 1 | 22
Q1. What is Logistic Regression?
Answer:
Thelogistic regressiontechniqueinvolves thedependentvariable,whichcanberepresentedin the
binary(0 or1, trueorfalse, yesorno)values,which meansthattheoutcomecould only bein either
oneformof two. For example, it can beutilized whenweneedtofind theprobability of a
successful orfail event.
Model
Output = 0or 1
Z = WX + B
hΘ (x) = sigmoid(Z)
hΘ (x) = log(P(X) /1- P(X) ) = WX +B
Page 2 | 22
CostFunction
Answer:-
Withlinear regression you fit apolynomial through thedata- say, like ontheexample below, wefit a
straightlinethrough{tumor size, tumor type} sampleset:
Page 3 | 22
Above, malignant tumors get 1, andnon-malignant ones get 0, and thegreen line is our hypothesis
h(x). To make predictions, we may say that for any given tumor size x, if h(x) gets bigger than 0.5,
wepredict malignant tumors.Otherwise, wepredict benignly.
It looks like this way, wecould correctly predict every single training set sample, but now let's change
thetask abit.
Intuitively it's clear that all tumors larger certain threshold are malignant. So let's addanother sample
withhugetumor size, andrunlinear regressionagain:
Wecannotchangethehypothesiseachtimeanewsamplearrives.Instead,weshouldlearnitoffthe
training setdata,andthen(using thehypothesiswe'velearned)makecorrectpredictionsfor the
data wehaven'tseen before.
Linear regression is unbounded.
Page 4 | 22
distributedrecursively onthebasisof attributevalues4) Whichattributesareconsideredtobein root
nodeorinternal nodeis doneby using astatistical approach.
1) ID3(Iterative Dichotomiser 3): This solution uses Entropy and Information gain as metrics
to form a betterdecision tree. The attribute with the highest information gain is used as a root
node,andasimilar approachis followedafter that. Entropy is themeasurethatcharacterizes
theimpurityof anarbitrarycollection of examples.
Page 5 | 22
1.Computetheentropy forthedataset
2.For every attribute:
1. Calculateentropy forall categorical values.
SplitonGender:
Gini for sub-nodeFemale = (0.2)*(0.2)+(0.8)*(0.8)=0.68
Gini for sub-nodeMale= (0.65)*(0.65)+(0.35)*(0.35)=0.55
Page 6 | 22
Q6. How to control leaf height and Pruning?
Answer:
Tocontrol theleaf size, wecanset theparameters:-
1. Maximumdepth:
Maximumtreedepthis alimit tostopthefurthersplitting ofnodeswhenthespecified treedepthhas
beenreachedduring thebuilding of theinitial decisiontree.
NEVER use maximum depth to limit the further splitting of nodes. In other words: use the
largestpossiblevalue.
2. Minimumsplit size:
Minimumsplit size is alimit tostopthefurther splitting of nodes whenthenumberof observations
in thenodeis lowerthantheminimumsplit size.
This is agoodwaytolimit thegrowthof thetree. Whenaleaf containstoofewobservations,further
splitting will result in overfitting (modeling of noise in thedata).
3. Minimumleaf size
Minimumleafsize is alimit tosplit anodewhenthenumberof observations in oneof thechild nodes
is lowerthan theminimumleaf size.
Pruning is mostly donetoreducethechances of overfitting the treetothe training dataandreduce
theoverall complexity of thetree.
Page 7 | 22
2.InPost-pruning, theideaistoallowthedecisiontreetogrowfullyandobservetheCPvalue.
Next,weprune/cut thetreewiththeoptimalCP(Complexity Parameter) value as
the
parameter.
Answer:
Decisiontrees canhandlebothcategorical andnumerical variables at thesametime as features.
There is notany problemin doing that.
Every splitinadecisiontreeis basedonafeature.
according
to an impurity measure with the split branches. And the fact that the variable used to do split is
categorical or continuous is irrelevant (in fact, decision trees categorize continuous variables by
creating binary regions withthethreshold).
At last, thegoodapproachis toalwaysconvert your categoricals tocontinuous
using L abelEncoder or OneHotEncoding.
Page 8 | 22
Q8. What is the Random Forest Algorithm?
Answer:
Random Forest is an ensemble machine learning algorithm that follows the bagging technique. The
base estimators in the random forest are decision trees. Random forest randomly selects a set of
features that areusedtodecide thebestsplit ateachnodeof thedecision tree.
Looking atit step-by-step,this is whatarandomforest modeldoes:
1. Randomsubsets arecreatedfromtheoriginal dataset (bootstrapping).
2. At eachnodein thedecisiontree,only arandomsetoffeaturesareconsideredtodecidethe best
split.
3.A decision treemodelis fitted oneachof thesubsets.
4. The final prediction is calculated by averagingthepredictions fromall decision trees.
To sumup,theRandomforestrandomlyselectsdatapointsandfeaturesandbuilds multiple trees
(Forest).
RandomForestis usedfor featureimportanceselection.Theattribute(.feature_importances_)is
usedtofind feature importance.
SomeImportantParameters:-
1. n_estimators:-It defines thenumberof decision treestobecreatedin arandomforest.
2.criterion:- "Gini" or "Entropy."
3. min_samples_split:- Usedtodefine theminimumnumberof samples required in aleaf
nodebeforeasplit is attempted
4. max_features:-It definesthemaximumnumberof featuresallowedforthesplit in each
decision tree.
5. n_jobs:- The numberof jobs torunin parallel for bothfit andpredict. Always keep(-1) to use
all thecores for parallel processing.
Answer:
In predicting models,theprediction erroris composedof two different errors
1. Bias
2. Variance
Page 9 | 22
It is important to understand thevariance and bias trade-off which tells about to minimize theBias
andVariance in theprediction andavoidsoverfitting & underfitting of themodel.
Bias: It is thedifference betweentheexpected or average prediction of the model and thecorrect
value which we are trying to predict. Imagine if we are trying to build more than one model by
collecting different data sets, and later on, evaluating the prediction, we may end up by different
prediction for all themodels. So, bias is something which measures howfar these model prediction
fromthecorrect prediction.It always leadstoahigh errorin training andtestdata.
Page 10 | 22
Q10. What are Ensemble Methods?
Answer
1. Bagging andBoosting
Decision treeshave beenaroundfor along time andalso knowntosuffer frombias andvariance.
You will have alarge bias withsimple treesandalarge variance withcomplex trees.
Ensemble methods- which combines several decision treestoproducebetterpredictive
performancethanutilizing asingle decisiontree.Themainprinciple behindtheensemble modelis
thatagroup of weaklearners cometogether toformastrong learner.
Twotechniques toperformensembledecisiontrees:
1. Bagging
2. Boosting
Boosting is another ensemble technique to create a collection of predictors. In this technique, learners
arelearned sequentiallywith early learners fitting simple modelstothedataandthenanalyzing data
Page 11 | 22
for errors.In otherwords,wefit consecutivetrees(randomsample),andateverystep,thegoal is to
solve forneterrorfromtheprior tree.
Whenahypothesismisclassifies aninput, its weightis increased,sothatthenexthypothesisis more
likely toclassify it correctly. By combining thewhole set at theendconverts weak learners into a
better performing model.
Answer:
1) Linear SVM: In Linear SVM, the data points are expected to beseparated by someapparent
gap. Therefore, the SVM algorithm predicts a straight hyperplane dividing the two classes. The
hyperplane is also called asmaximummargin hyperplane
Page 12 | 22
2) Non-Linear SVM: It is possiblethatourdatapoints arenotlinearly separable in ap-
dimensionalspace,butcanbelinearly separablein ahigherdimension.Kernel tricks makeit
possibletodrawnonlinearhyperplanes.Somestandardkernelsarea) Polynomial Kernel b) RBF
kernel(mostlyused).
To clear, anexampleof afeaturevectorandcorresponding class variable can be: (refer 1st rowof
thedataset)
Page 14 | 22
X = (Rainy, Hot, High, False) y = No So basically, P(X|y) here means,
theprobabilityof“Not playinggolf”giventhattheweatherconditionsare“Rainyoutlook”,“Temperatureis
hot”,“high humidity”and “nowind”.
NaiveBayesClassification:
1. Weassumethatnopairoffeaturesaredependent.Forexample,thetemperaturebeing‘Hot’ hasnothing
todowiththehumidity,ortheoutlookbeing‘Rainy’doesnotaffectthewinds.
Hence,thefeaturesareassumed tobeindependent.
2.Secondly,eachfeatureisgiventhesameweight(orimportance).Forexample, knowing
the onlytemperatureandhumidityalonecan’tpredicttheoutcomeaccurately.Noneofthe
attributesisirrelevantandassumedtobecontributingequallytotheoutcome
Gaussian Naive Bayes
Continuousvaluesassociatedwitheachfeatureareassumedtobedistributedaccordingtoa Gaussian
distribution.A Gaussiandistributionis alsocalledNormaldistribution.Whenplotted,it givesabell-
shapedcurvewhichissymmetricaboutthemeanofthefeaturevaluesasshownbelow:
This is as simple as calculating the mean and standard deviation values of each input
variable(x)for eachclassvalue.
Mean(x)=1/n*sum(x)
Wherenisthenumberofinstances,andxisthevaluesforaninputvariableinyourtrainingdata. Wecan
calculatethestandarddeviationusingthefollowingequation:
Standarddeviation(x)=sqrt(1/n*sum(xi-mean(x)^2))
Whentousewhat?StandardNaiveBayesonlysupportscategoricalfeatures,whileGaussianNaive Bayes
onlysupportscontinuouslyvaluedfeatures.
Answer:
A confusion matrix is a table that is often used to describe theperformance of aclassification model
(or “classifier”) ona set of test data for which the true values are known. It allows the visualization
of theperformanceof analgorithm.
Page 15 | 22
A confusion matrix is a summary of prediction results on a classification
problem.Thenumberof correctandincorrectpredictionsaresummarizedwithcountvaluesand
brokendownbyeachclass.
This is the key to the confusion matrix.
It gives us insight not only into the errors being made by a classifier but, more importantly,
thetypes oferrorsthatarebeingmade.
Here,
Class 1: Positive
Class 2: Negative
Definitionof theTerms:
1. Positive (P): Observation is positive (for example: is an apple).
2. Negative (N): Observation is notpositive (for example: is notanapple).
3.True Positive (TP): Observation is positive, andis predicted to bepositive.
4. False Negative (FN): Observation is positive, butis predicted negative.
5.True Negative (TN): Observation is negative, andis predicted tobenegative.
6. False Positive (FP): Observation is negative, but is predicted positive.
Answer:
Accuracy
Accuracy is defined as theratio of thesumof TruePositive andTrue
Negativeby Total(TP+TN+FP+FN)
Page 16 | 22
However, there are problems with accuracy. It assumes equal costs for both kinds of errors.
A 99% accuracy can beexcellent,good,mediocre,poor,or terrible dependingupon the
problem.
MisclassificationRate
Answer:
True Positive Rate:
Sisecnaslictiuvliatyte(dSN as) the number of correct positive predictions divided by the
total number of positives. It is also called Recall (REC) or true positive rate (TPR).
The best sensitivity is 1.0, whereas the worst is 0.0.
Page 17 | 22
True Negative Rate
Specificity (SP) is calculated as the number of correct negative predictions divided by the
total number of negatives. It is also called a true negative rate (TNR). The best specificity is
1.0, whereas the worst is 0.0.
False positive rate (FPR) is calculated as the numberof incorrect positive predictions divided by the
total number of negatives. The best false positive rate is 0.0, whereas the worst is 1.0. It can also be
calculated as 1– specificity.
FalseNegativeRate
False Negativerate(FPR) is calculatedasthenumberof incorrect positive predictions divided by
thetotal numberof positives. The bestfalse negative rateis 0.0, whereastheworstis 1.0.
Page 18 | 22
Q16. What are F1 Score, precision and recall?
Recall:-
Recall canbe defined asthe ratioofthe total number ofcorrectlyclassified positive
examples dividetothetotalnumberofpositiveexamples.
1. High Recall indicates the class iscorrectly recognized (small number ofFN).
2. LowRecall indicates the class is incorrectly recognized (large number ofFN).
Precision:
Toget the value ofprecision, wedivide the total number ofcorrectly classified positive
examples bythetotalnumberofpredictedpositiveexamples.
1. High Precision indicates anexample labeled as positive is indeed positive (a small
number
ofFP).
2. LowPrecision indicates anexample labeled aspositive isindeed positive (large
numberof
FP).
Remember:-
High recall, lowprecision: This means that mostofthe positive examples are correctly
recognized (lowFN),buttherearealotoffalsepositives.
Lowrecall, high precision: This shows that wemiss alot ofpositive examples (high FN),but
those
wepredictaspositiveareindeedpositive(lowFP).
F-measure/F1-Score:
Page 19 | 22
Since we have two measures (Precision and Recall), it helps to have a measurement that
represents both of them. We calculate an F-measure, which uses Harmonic Mean in place of
Arithmetic Mean as it punishes the extremevalues more.
TheF-MeasurewillalwaysbenearertothesmallervalueofPrecisionorRecall.
Answer:
RandomizedsearchCV is usedtoperformarandomsearchonhyperparameters.Randomized
searchCV uses afit and score method, predict proba, decision_func, transform,etc..,
Theparametersof theestimatorusedtoapplythesemethodsareoptimizedbycross-validated
searchoverparameter settings.
In contrast toGridSearchCV, notall parametervalues are tried out, but rather a fixed number of
parametersettings is sampledfromthespecified distributions. The numberof parametersettings that
aretriedis given by n_iter.
CodeExample:
class sklearn.model_selection.RandomizedSearchCV(estimator,
param_distributions, n_iter=10, scoring=None, fit_params=None,
n_jobs=None, iid=’warn’, refit=Trnu_ejo,
cbvs=’,rwaanrdno’,mv_e
deprecating’, srbtaotsee==N0o,nper,e_edriosrp_astccohr=e‘=2’raise-
return_train_score=’warn’)
Q18. What is
GridSearchCV?
Answer:
Page 20 | 22
Grid search is theprocess of performing hyperparameter tuning todetermine theoptimal values for
agiven model.
CODE Example:-
fromsklearn.model_selectionimportGridSearchCV fromsklearn.svmimportSVR gsc =
GridSearchCV( estimator=SVR(kernel='rbf'),param_grid={ 'C': [0.1, 1, 100, 1000],'epsilon':
[0.0001, 0.0005, 0.001,0.005,0.01, 0.05, 0.1, 0.5, 1, 5, 10], 'gamma':[0.0001, 0.001,0.005,0.1, 1,
3, 5] }, cv=5, scoring='neg_mean_squared_error', verbose=0,n_jobs=-1)
Grid searchrunsthemodelonall thepossiblerangeof hyperparametervaluesandoutputsthebest
model
Answer:
Bayesiansearch,incontrasttothegridandrandomsearch,keepstrackofpastevaluationresults,
whichtheyusetoformaprobabilistic modelmappinghyperparameterstoaprobability ofascore
ontheobjectivefunction.
Code:
fromskopt import BayesSearchCV
opt=BayesSearchCV(
SVC(),
Page 21 | 22
{
'C': (1e-6, 1e+6, 'log-uniform'),
'gamma': (1e-6, 1e+1, 'log-uniform') ,
'degree':(1, 8), # integer valued parameter
'kernel': ['linear', 'poly', 'rbf']
},
n_iter=32,
cv=3)
Answer:
ZeroComponentAnalysis:
Makingtheco-variancematrixastheIdentity matrix is called whitening. This will remove thefirst
andsecond-orderstatistical structure
ZCA transformsthedata tozero meansandmakesthefeatures linearly independent of eachother
In someimage analysis applications, especially whenworkingwithimages of thecolor andtiny typ
e, it is frequently interesting toapply somewhitening tothedatabefore,e.g. training aclassifier.
Page 22 | 22
DATA SCIENCE
INTERVIEW PREPARATION
(30 Days of Interview
Preparation)
# DAY 03
Page 1 | 18
Q1. How do you treat heteroscedasticity in regression?
WhatcausesHeteroscedasticity?
Heteroscedasticity occurs more often in datasets, where we have a large range between the largest
and the smallest observed values. There are many reasons why heteroscedasticity can exist, and a generic
explanationisthattheerrorvariancechangesproportionallywithafactor.
WecancategorizeHeteroscedasticityintotwogeneraltypes:-
Pure heteroscedasticity:- It refers to cases where we specify the correct model and let us
observethe non-constantvarianceinresidualplots.
Impure heteroscedasticity:- It refers to cases where you incorrectly specify the model, and that
causes
thenon-constantvariance.Whenyouleaveanimportantvariableoutofamodel,theomittedeffectis absorbedinto
theerrorterm.Iftheeffectoftheomittedvariablevariesthroughouttheobservedrangeof data,itcanproducethe
telltalesignsofheteroscedasticityintheresidualplots.
How to Fix Heteroscedasticity
Page 2 | 18
differential.Todothis,changethemodelfromusingtherawmeasuretousingratesandpercapita
values. Of course,thistypeofmodelanswersaslightly differentkind ofquestion.You’ll needtodetermine
whetherthisapproachissuitableforbothyourdataandwhatyouneedtolearn.
Weightedregression:
It is a method that assigns each data point to a weight based on the variance of its fitted value.
Theidea is togive small weightstoobservationsassociatedwithhighervariancestoshrink theirsquared
residuals. Weightedregressionminimizesthesumoftheweightedsquaredresiduals.Whenyouusethe
correct weights,heteroscedasticityisreplacedbyhomoscedasticity.
Correcting Multicollinearity:
1)Removeoneofthehighly correlatedindependentvariables fromthemodel.Ifyouhavetwoormore
factors with a high VIF, remove one from the model.
2) PrincipleComponentAnalysis(PCA)-Itcutthenumberofinterdependentvariablestoasmallerset
ofuncorrelatedcomponents.Insteadofusinghighlycorrelatedvariables,usecomponents inthe
model thathaveeigenvaluegreaterthan1.
3) RunPROCVARCLUSandchoosethevariablethathasaminimum(1-R2)ratiowithinacluster.
4)Ridge Regression-It is atechniqueforanalyzing multipleregressiondatathatsuffer
from multicollinearity.
5)If you include an interaction term (the product of two independent variables), you
canalsoreduce
multicollinearity by"centering"thevariables.By "centering,"it meanssubtractingthemeanfrom
the valuesoftheindependentvariablebeforecreatingtheproducts.
Page 3 | 18
Whenismulticollinearitynotaproblem?
1)If your goal is to predict Y from a set of X variables, then multicollinearity is not a
problem.The predictionswillstillbeaccurate,andtheoverallR2(oradjustedR2)quantifieshowwellthe
model predictstheYvalues.
2) Multipledummy(binary)variablesthatrepresentacategoricalvariablewiththreeormorecategories.
Market basket analysis is the study of items that are purchased or grouped in a single transaction or
multiple, sequential transactions. Understanding the relationships and the strength of those relationships
is valuable information that can be used to make recommendations, cross-sell, up-sell, offer coupons,
etc.
Market Basket Analysis is oneof thekey techniques used by large retailers touncover associations
betweenitems.It worksby looking for combinationsof itemsthatoccurtogether frequently in
transactions.To putit anotherway,it allowsretailerstoidentify relationships betweentheitemsthat
peoplebuy.
The technique of association rules is widely usedfor retail basket analysis. It can also beused for
classification byusing ruleswithclass labels ontheright-handside. It is evenusedforoutlier detection
withrules indicating infrequent/abnormal association.
Association analysis also helps us to identify cross-selling opportunities, for example, wecan use the
rules resulting from the analysis to place associated products together in a catalog, in thesupermarket,
or the Webshop, or apply them when targeting a marketing campaign for product B at customers who
havealready purchasedproduct A.
Page 4 | 18
Associationrules aregivenintheformas below:
A=>B[Support,Confidence] The partbefore=> is referredtoasif (Antecedent)andthepartafter=> is
referred toasthen(Consequent).
WhereA andB aresetsof itemsin thetransactiondata,aandB aredisjoint sets.
Computer=>Anti−virusSoftware[Support=20%,confidence=60%] Above rule says:
1. 20% transaction showAnti-virus software is bought withpurchase of aComputer
2. 60%of customerswhopurchaseAnti-virus softwareis boughtwith purchase of aComputer An
exampleof AssociationRules * Assumethereare100 customers
1. 10of thembought milk, 8bought butter and6bought bothof them2.bought milk => bought
butter
2. support= P(Milk & Butter) = 6/100 = 0.06
3. confidence = support/P(Butter) = 0.06/0.08= 0.75
4.lift= confidence/P(Milk) = 0.75/0.10 = 7.5
It is the simplest machine learning algorithm. Also known as lazy learning (why? Because it
doesnot createageneralizedmodelduringthetimeoftraining,sothetestingphaseisveryimportantwhereit
doestheactualjob.HenceTestingisverycostly-intermsoftime&money).Alsocalledaninstance- basedor
memory-basedlearning
In k-NN classification, the output is a class membership. An object is classified by a plurality vote
of its neighbors, with the object being assigned to the class most common among its k nearest
neighbors(kis apositiveinteger,typicallysmall).Ifk=1,thentheobjectisassignedtotheclassofthatsingle
nearest neighbor.
Page 5 | 18
In k-NN regression,theoutputis theproperty value fortheobject. This value is theaverage of the
values of k nearest neighbors.
All three distance measures are only valid for continuous variables. In the
instanceofcategorical variables,theHammingdistancemustbeused.
pipeline= Pipeline(steps)
Page 7 | 18
Mainimportantpointstobeconsidered:
1. Normalizethe data
2.Calculate the covariance matrix
3. Calculate theeigenvaluesand eigenvectors
4. Choosingcomponentsandformingafeaturevector
5.Forming Principal Components
UnderstandingVIF
If the variance inflation factor of a predictor variable is 5 this means that variance
for the coefficient of that predictor variable is 5 times as largeas it would be if that predictor variable
wereuncorrelatedwiththeotherpredictorvariables.
In other words, if theofvariance
errorforthecoefficient inflation
thatpredictor factor
variable is 2.23times (√5 =variable
of a predictor is 5 this
2.23) as large as it means that the
would beif
that
standard
predictorvariablewereuncorrelatedwiththeother predictorvariables.
Page 8 | 18
Weightofevidence(WOE) andinformation value(IV) aresimple, yetpowerfultechniquesto
performvariable transformation andselection.
TheformulatocreateWOE andIV is
Hereisasimpletablethatshowshowtocalculatethesevalues.
Page 9 | 18
Q10: How to evaluate that data does not have any outliers ?
In statistics, outliers are data points that don’t belong to a certain population. It is an abnormal
observation thatliesfarawayfromothervalues.Anoutlierisanobservationthatdivergesfromotherwisewell-
structureddata.
Detection:
Method1—StandardDeviation:In statistics, If a data distribution is approximately normal, then
about68%ofthedata valuesliewithinonestandarddeviationofthemean,andabout95%arewithin two
standarddeviations,and about 99.7%liewithinthreestandarddeviations.
Therefore, if you have any data point that is morethan3 times the standard deviation, then those points
arevery likely tobeanomalousoroutliers.
Page 10 | 18
Method3-ViolinPlots: Violin plotsaresimilar toboxplots,exceptthattheyalsoshowtheprobability
densityof thedataatdifferentvalues, usually smoothedby akerneldensityestimator.Typically aviolin plot
will include all the data that is in a box plot: a marker for the median of the data, a box or
marker indicatingtheinterquartilerange,andpossiblyallsamplepointsifthenumberofsamplesisnottoohigh.
Page 11 | 18
Q11: What you do if there are outliers?
Q12: What are the encoding techniques you have applied with
Examples ?
In many practical data science activities, the data set will contain categorical variables. These variables
are typically stored as text values". Since machine learning is based on mathematical equations, it
wouldcause aproblemwhenwekeepcategorical variables as is.
Page 12 | 18
The irreducible error cannot be reduced regardless of what algorithm is used. It is the error introduced
from the chosen framing of the problem and may be caused by factors like unknown variables that
influencethemapping of theinput variables totheoutput variable.
Bias: Bias meansthatthemodelfavors oneresult morethan theothers. Bias is thesimplifying
assumptions madeby a model to make the target function easier to learn. The model with high bias pays
very little attention to the training data and oversimplifies the model. It always leads to a high error in
training andtestdata.
Variance: Variance is theamountthattheestimateof thetarget function will change if different training
datawasused.Themodelwithhigh variancepaysalot ofattentiontotraining dataanddoesnot
generalizeonthedatawhichit hasn’tseenbefore.As aresult,such models performvery well ontraining
databuthave high errorrateson testdata.
So, the end goal is to come up with a model that balances both Bias and Variance. This is called
Bias Variance Trade-off.To builda goodmodel,we needto finda good balance betweenbias and
variance suchthatitminimizesthetotalerror.
Page 13 | 18
A Type I errorwouldthusoccurwhenthepatient doesn’t have thevirus, butthetestshows thatthey do.
In otherwords,thetestincorrectly rejects thetruenull hypothesis thatthepatient is HIV negative.
TypeII Error
A Type II error is theinverse of a Type I error andis thefalse acceptance of a null hypothesis that is not
true, i.e., a false negative. A Type II error would entail the test telling the patient they are free of HIV
whentheyarenot.
Considering this HIV example, which errortypedoyouthink is moreacceptable? In otherwords,would
youratherhave atest thatwas morepronetoType I or Types II error? WithHIV, themomentarystress of
afalse positiveis likely betterthanfeeling relieved atafalse negativeandthenfailing totakestepsto treat
the disease. Pregnancy tests, blood tests, and any diagnostic tool that has serious consequences for the
healthofapatientareusually overlysensitiveforthisreason–theyshoulderronthesideofafalse positive.
But in mostfields of science, Type II errors areseenasless serious thanType I errors.WiththeType II
error, a chance to reject the null hypothesis was lost, and no conclusion is inferred from a non-rejected
null. But the Type I error is more serious because you have wrongly rejected the null hypothesis and
ultimately madeaclaim that is not true. In science, finding a phenomenonwhere there is noneis more
egregiousthanfailing tofind aphenomenonwherethereis.
Page 14 | 18
Q16: What is the Mean Median Mode standard deviation for
the sample and population?
MeanIt is an important technique in statistics. Arithmetic Mean can also be called an average.
It is the number of the quantity obtained by summing two or more numbers/variables and
thendividingthesumbythenumberofnumbers/variables.
ModeThemodeisalsooneofthetypesforfindingtheaverage.Amodeisanumberthatoccursmost
frequentlyinagroupofnumbers.Someseriesmightnothaveanymode;somemighthavetwomodes,
whichiscalleda bimodalseries.
In the study of statistics, the three most common ‘averages’ in statistics are mean, median, and
mode.
Median is also a way of finding the average of a group of data points. It’s the middle number
ofasetof numbers.Therearetwopossibilities,thedatapointscanbeanoddnumbergroup,oritcan
beanevennumbergroup.
If the group is odd, arrange the numbers in the group from smallest to largest. The
medianwillbethe onewhichisexactlysittinginthemiddle,withanequalnumberoneithersideofit.Ifthe
groupis even,arrangethenumbersinorderandpickthetwomiddlenumbersandaddthemthendivideby
2. Itwillbethemediannumberofthatset.
Standard Deviation (Sigma) Standard Deviation is a measure of how much your data is spread out in
statistics.
Mean Absolute Error The Mean Absolute Error(MAE) is the average of all absolute errors. The
formula
is: mean absolute error
Page 15 | 18
Where,
n= thenumberoferrors,Σ = summationsymbol(whichmeans“add themall up”), |xi – x|= the
absolute errors.The formula maylook alittle daunting, butthesteps areeasy:
Find all of your absolute errors, xi – x. Add themall up. Divide by thenumberof errors. For example,
if youhad 10measurements,divide by 10.
Q18: What is the difference between long data and wide data?
There are many different ways that you can present thesamedataset totheworld. Let's take alook at
oneof themostimportantandfundamentaldistinctions, whetheradataset is wideorlong.
Thedifferencebetweenwideandlong datasets boils downto whetherweprefer tohave morecolumns
in ourdataset ormorerows.
Wide Data A dataset thatemphasizes putting additional dataabout asingle subject in columns is called
awidedataset because, asweadd morecolumns,thedataset becomes wider.
Long Data Similarly, adataset thatemphasizes including additional dataaboutasubject in rowsis called
along datasetbecause,asweaddmorerows,thedatasetbecomeslonger.It's importanttopointoutthat
there's nothing inherently goodorbadaboutwideorlong data.
In theworldof datawrangling, wesometimes needtomakealong dataset wider, andwesometimes need
tomakeawidedatasetlonger. However,it is truethat,asageneralrule, datascientists whoembracethe
conceptof tidy datausually prefer longer datasets over wider ones.
Q19: What are the data normalization method you have applied, and
why?
Normalization is a technique often applied as part of data preparation for machine learning. The goal
of normalization is to change the values of numeric columns in the dataset to a common scale, without
distorting differences in the ranges of values. For machine learning, every dataset does not require
normalization.It is required only whenfeatures have different ranges.
In simple words, when multiple attributes are there, but attributes have values on different scales, this
may lead to poor data models while performing data mining operations. So they are normalized to bring
all theattributes onthesamescale, usually something between(0,1).
It is not always a good idea to normalize thedata since wemight lose information about maximum and
minimumvalues. Sometimesit is agoodidea todoso.
For example, ML algorithms such asLinear RegressionorSupportVector Machines typically
converge
cfaosuteldronnormalizeddata. But onalgorithms like K-means orK NearestNeighbours, normalization
Page 16 | 18
beagoodchoiceorabaddependingontheusecasesincethedistancebetweenthepointsplaysakey rolehere.
Types of Normalisation :
1Min-MaxNormalization:In most cases, standardization is used feature-
wise
Page 17 | 18
Q20: What is the difference between normalization and
Standardization with example?
In ML, every practitioner knows that feature scaling is an important issue. The two most discussed
scaling methods are Normalization and Standardization. Normalization typically means it
rescales thevalues into arange of [0,1].
It is an alternative approach to Z-score normalization (or standardization) is the so-called Min-Max
scaling (often also called “normalization” - a commoncause for ambiguities). In this approach, thedata
is scaled toafixed range - usually 0to1. Scikit-Learn providesatransformer called MinMaxScaler
for
this. A Min-Max scaling is typically donevia thefollowing equation:
Xnorm= X-Xmin/Xmax-Xmin
Example withsampledata: BeforeNormalization: AttributePrice in Dollars StorageSpaceCamera
AttributePriceinDollars StorageSpaceCamera
Mobile1250 1612
Mobile2200 168
Mobile3300 3216
Mobile4275 328
Mobile5225 1616
After Normalization: (Values ranges from 0-1 whichis working as expected)
AttributePriceinDollars StorageSpaceCamera
Mobile10.50 0.5
Mobile2000
Mobile3111
Mobile40.751 0
Mobile50.250 1
Standardization (or Z-score normalization) typically means rescales data to have a mean of 0 and a
standarddeviation of 1(unit variance) Formula:Z or X_new=(x−μ)/σ whereμ is themean(average),
andσ is thestandard deviation from themean;standard scores (also called z scores) Scikit-Learn
provides a transformer called StandardScaler for standardization Example: Let’s take an approximately
normally distributedsetof numbers:1, 2, 2, 3, 3, 3, 4, 4, and5. Its meanis 3, andits standarddeviation:
1.22. Now,let’s subtractthe meanfromall datapoints.weget anewdataset of: -2, -1, -1, 0, 0, 0, 1, 1,
and2. Now,let’s divide eachdatapoint by 1.22. As youcan seein thepicture below, weget: -1.6,
-0.82,
-0.82, 0, 0, 0, 0.82, 0.82, and1.63
Page 18 | 18
DATA SCIENCE
INTERVIEW
PREPARATION
(30 Days of Interview
Preparation)
# DAY 04
Q1. What is upsampling and downsampling with examples?
The classification data set with skewed class proportions is called an
imbalanced data set. Classes which make up a large proportion of the
data sets are called majority classes. Those make up smaller
proportions are minority classes.
Degree of imbalance Proportion of Minority Class
1>> Mild 20-40% of the data set
2>> Moderate 1-20% of the data set
3>> Extreme <1% of the data set
If we have an imbalanced data set, first try training on the true
distribution.
If the model works well and generalises, you are done! If not, try
the following up sampling and down sampling technique.
1. Up-sampling
Upsampling is the process of randomly duplicating observations from
the minority class to reinforce its signal.
First, we will import the resampling module from Scikit-
Learn: Module for resampling Python
1- From sklearn.utils import resample
Next, we will create a new Data Frame with an up-sampled minority
class. Here are the steps:
1First, we will separate observations from each class into different
Data Frames.
2Next, we will resample the minority class with replacement, setting the
number of samples to match that of the majority class.
3Finally, we'll combine the up-sampled minority class Data Frame with the
original majority class Data Frame.
2-Down-sampling
Downsampling involves randomly removing observations from the
majority class to prevent its signal from dominating the learning
algorithm.
The process is similar to that of sampling. Here are the steps:
1-First, we will separate observations from each class into different
Data Frames.
2Next, we will resample the majority class without replacement, setting the
number of samples to match that of the minority class.
3Finally, we will combine the down-sampled majority class Data Frame
with the original minority class Data Frame.
Q2. What is the statistical test for data validation with an example,
Chi-square, ANOVA test, Z statics, T statics, F statics,
Hypothesis Testing?
IMP-If the test statistic is lower than the critical value, accept the hypothesis
or else reject the hypothesis.
Chi-Square Test:-
A chi-square test is used if there is a relationship between two categorical
variables.
Chi-Square test is used to determine whether there is a significant
difference between the expected frequency and the observed frequency in
one or more categories. Chi-square is also called the non-parametric test
as it will not use any parameter
2-Anova test:-
ANOVA, also called an analysis of variance, is used to compare multiples
(three or more) samples with a single test.
Useful when there are more than three populations. Anova compares the
variance within and between the groups of the population. If the variation is
much larger than the within variation, the means of different samples will
not be equal. If the between and within variations are approximately the
same size, then there will be no significant difference between sample
means. Assumptions of ANOVA: 1-All populations involved follow a normal
distribution. 2-All populations have the same variance (or standard
deviation). 3-The samples are randomly selected and independent of one
another.
ANOVA uses the mean of the samples or the population to reject or
support the null hypothesis. Hence it is called parametric testing.
3-Z Statics:-
In a z-test, the samples are assumed to be normal distributed. A z score is
calculated with population parameters as “population mean” and
“population standard deviation” and it is used to validate a hypothesis that
the sample drawn belongs to the same population.
The statistics used for this hypothesis testing is called z-statistic, the score
for which is calculated as z = (x —μ) / (σ / √n), where x= sample mean μ =
population mean σ / √n =population standard deviation If the test statistic is
lower than the critical value, accept the hypothesis or else reject the
hypothesis
4- T Statics:-
A t-test used to compare the mean of the given samples. Like z-test, t-test
also assumed a normal distribution of the samples. A t-test is used when
the population parameters (mean and standard deviation) are unknown.
There are three versions of t-test
1. Independent samples t-test which compare means for two groups
2. Paired sample t-test which compares mean from the same group at
different times
3. Sample t-test, which tests the mean of the single group against the
known mean. The statistic for hypothesis testing is called t-statistic,
the score for which is calculated as t =(x1 —x2) / (σ / √n1 +σ / √n2),
where
5- F Statics:-
The F-test is designed to test if the two population variances are equal. It
compares the ratio of the two variances. Therefore, if the variances are
equal, then the ratio of the variances will be 1.
The F-distribution is the ratio of two independent chi-square variables
divided by their respective degrees of freedom.
F =s1^2 / s2^2 and where s1^2 >s2^2.
If the null hypothesis is true, then the F test-statistic given above can be
simplified. This ratio of sample variances will be tested statistic used. If the
null hypothesis is false, then we will reject the null hypothesis that the ratio
was equal to 1 and our assumption that they were equal.
Meaning
>It is generally recommended to break the problem into smaller chunks, sol
ve them and then combine the results | It generally focusses on solving the
problem end to end
Sigmoid or Logistic
Tanh — Hyperbolic tangent
ReLu -Rectified linear
units
The size of these steps called the learning rate. With the high learning rate,
we can cover more ground each step, but we risk overshooting the lower
point since the slope of the hill is constantly changing. With a very lower
learning rate, we can confidently move in the direction of the negative
gradient because we are recalculating it so frequently. The Lower learning
rate is more precise, but calculating the gradient is time-consuming, so it
will take a very large time to get to the bottom.
Math
Now let’s run gradient descent using new cost function. There are two
parameters in cost function we can control: m (weight) and b (bias). Since
we need to consider that the impact each one has on the final prediction,
we need to use partial derivatives. We calculate the partial derivative of the
cost function concerning each parameter and store the results in a
gradient.
Math
Given the cost function:
To solve for the gradient, we iterate by our data points using our new m
and b values and compute the partial derivatives. This new gradient tells
us about the slope of the cost function at our current position (current
parameter values) and the directions we should move to update our
parameters. The learning rate controls the size of our update.
Q13: What is optimiser is deep learning, and which one is the best?
Deep learning is an iterative process. With so many hyperparameters
to tune or methods to try, it is important to be able to train models
fast, to quickly complete the iterative cycle. This is the key to increase
the speed and efficiency of a machine learning team.
Hence the importance of optimisation algorithms such as
stochastic gradient descent, min-batch gradient descent,
gradient descent with momentum and the Adam optimiser.
Gradient Descent
it is an iterative machine learning optimisation algorithm to reduce the cost
function, and help models to make accurate predictions.
Gradient indicates the direction of increase. As we want to find the
minimum points in the valley, we need to go in the opposite direction of the
gradient. We update the parameters in the negative gradient direction to
minimise the loss.
Where θ is the weight parameter, η is the learning rate, and ∇ J(θ ;x,y)
is the gradient of weight parameter θ
Types of Gradient Descent
Different types of Gradient descents are
Batch Gradient Descent or Vanilla Gradient
Descent Stochastic Gradient Descent
Mini batch Gradient Descent
Batch Gradient Descent
In the batch gradient, we use the entire dataset to compute the gradient of
the cost function for each iteration for gradient descent and then update the
weights.
Autoencoder Components:
3Decoder: In this, the model learns how to reconstruct the data from the
encod represented to be as close to the original inputs as possible.
4Reconstruction Loss: In this method that measures measure how well
the decoder is performing and how closed the output is related to the
original input.
Types of Autoencoders :
1. Input Layer: It holds the raw input of image with width 32, height 32
and depth 3.
2. Convolution Layer: It computes the output volume by computing dot
products between all filters and image patches. Suppose we use a
total of 12 filters for this layer we’ll get output volume of dimension 32
x 32 x 12.
3.Activation Function Layer: This layer will apply the element-wise
activation function to the output of the convolution layer. Some
activation functions are RELU: max(0, x), Sigmoid: 1/(1+e^-x), Tanh,
Leaky RELU, etc. So the volume remains unchanged. Hence output
volume will have dimensions 32 x 32 x 12.
4. Pool Layer: This layer is periodically inserted within the covnets, and
its main function is to reduce the size of volume which makes the
computation fast reduces memory and also prevents overfitting.
Two common types of pooling layers are max pooling and average
pooling. If we use a max pool with 2 x 2 filters and stride 2, the
resultant volume will be of dimension 16x16x12.
The most common form is a pooling layer with filters of size 2x2 applied
with a stride of 2 downsamples every depth slice in the input by two along
both width and height, discarding 75% of the activations. Every MAX
operation would, in this case, be taking a max over four numbers (little 2x2
region in some depth slice). The depth dimension remains unchanged.
LeNet in 1998
LeNet is a 7-level convolutional network by LeCun in 1998 that classifies
digits and used by several banks to recognise the hand-written numbers
on cheques digitised in 32x32 pixel greyscale input images.
AlexNet in 2012
AlexNet: It is considered to be the first paper/ model, which rose the
interest in CNNs when it won the ImageNet challenge in the year 2012.
It is a deep CNN trained on ImageNet and outperformed all the entries
that year.
VGG in 2014
VGG was submitted in the year 2013, and it became a runner up in
the ImageNet contest in 2014. It is widely used as a simple
architecture compared to AlexNet.
GoogleNet in 2014
In 2014, several great models were developed like VGG, but the winner of
the ImageNet contest was GoogleNet.
GoogLeNet proposed a module called the inception modules that includes
skipping connections in the network, forming a mini-module, and this
module is repeated throughout the network.
ResNet in 2015
There are 152 layers in the Microsoft ResNet. The authors showed
empirically that if you keep on adding layers, the error rate should keep on
decreasing in contrast to “plain nets” we're adding a few layers resulted in
higher training and test errors.
Q19: How to initialise biases in deep learning?
It is possible and common to initialise the biases to be zero since the
random numbers in the weights provide the asymmetry braking. For ReLU
non-linearities, some people like to use small constant value such as 0.01
for all biases because this ensures that all ReLU units fire in the beginning,
therefore obtain, and propagate some gradient. However, it is unclear if this
provides a consistent improvement (in fact some results seem to indicates
that this performs worst) and it is more commonly used to use 0 bias
initialisation.
Q20: What is learning Rate?
Learning Rate
The learning rate controls how much we should adjust the weights
concerning the loss gradient. Learning rates are randomly initialised.
Lower the values of the learning rate slower will be the convergence to
global minima.
Higher values for the learning rate will not allow the gradient descent to
converge
Since our goal is to minimise the function cost to find the optimised
value for weights, we run multiples iteration with different weights and
calculate the cost to arrive at a minimum cost
--- -- --- -- --- -- -- --- -- -- --- -- -- --- -- --- -- -- --- -- -- -- --- -- --- -- --- -- -- --- -
- -- --- --
DATA SCIENCE
INTERVIEW
PREPARATION
(30 Days of Interview
Preparation)
# Day-5
Benefits:-
•GridSearch
•RandomSearch
•SageMaker
•Comet.ml
•Weights &Biases
•DeepCognition
•AzureML
Thefunctionthatis usedtocomputethis error is known as Loss Function J(.). Different loss functions
will give different errors for thesameprediction, andthus have aconsiderable effect ontheperformance
of themodel.Oneof themostwidely usedloss function is meansquare error, which calculates thesquare
of thedifferencebetweentheactualvaluesandpredictedvalue. Different loss functions areusedtodeals
withadifferenttypeof tasks, i.e. regressionandclassification.
Absolute error
1.Binary Cross-Entropy
2. Negative Log-Likelihood
3. MarginClassifier
4.Soft Margin Classifier
Activation functionsdecide whetheraneuronshouldbeactivated ornotbycalculating aweighted
sum andaddingbiaswithit.Thepurposeoftheactivationfunctionistointroducenon-linearityintotheoutput ofa
neuron.
In a neural network, we would update the weights and biases of the neurons based on the
error at the outputs. This process is known as back-propagation. Activation function makes the back-
propagation possible since the gradients are supplied alongwith the errors toupdate the weights and
biases.
2.Binary Step
3. Sigmoid
4. Tanh
5.ReLU
6.Leaky ReLU
Data Science Interview Questions Page 5
7. Softmax
The activation functionsdothenon-lineartransformationtotheinput,makingit capableof learning and
performing morecomplex tasks.
Q7: What do you under by vanishing gradient problem and how can
Do we solve that?
Theproblem:
As morelayers using certain activation function areaddedto neural networks, thegradients of theloss
functionapproachzero, making thenetworks tougher totrain.
Why:
Certain activation functions, like thesigmoid function, squishes a large input space into a small input
space between0 and1. Therefore, alarge change in theinput of thesigmoid function will cause a small
change in theoutput.Hence,thederivativebecomessmall.
Transferlearningisamachinelearningtechniquewhereamodeltrainedononetaskisre-purposedon
asecondrelatedtask.
Transferlearningisanoptimizationthatallowsrapidprogressorimprovedperformancewhen
modelling
thesecondtask.
Transferlearningonlyworksindeeplearningif themodelfeatureslearnedfromthefirsttaskare
general.
VGG-16 is a simpler architecture model since it’s not using many hyperparameters. It always uses 3 x 3
filters with the stride of 1 in convolution layer and uses SAME padding in pooling layers 2 x 2 with a
stride of2.
Three fully connected layers follow the VGG convolutional layers. The width of the networks starts at
thesmall value of 64 and increases by afactor of 2 after every sub-sampling/pooling layer. It achieves
thetop-5accuracy of92.3% onImageNet.
GPU
ucaspagabeilty,youneedtoedit theMakefile in thefirst twolines, whereyoutell it tocompileforGPU
Qwi1th3C:UWDhAadtriivserYs.OLOand explain the architecture of YOLO (you only
CoreConcept:-
Thealgorithmworksoff bydividing theimageintothegrid of thecells, for each cell bounding boxes
andtheirscoresarepredicted,alongsideclass probabilities. Theconfidenceis given in termsof IOU
(intersection over union), metric, which is measuring howmuchthedetected object overlaps with the
groundtruthasafractionof thetotal areaspanned by thetwotogether (the union).
YOLO v2-
This improves onsomeof theshortcomings of thefirst version, namely thefact that it is notvery good
atdetecting objects thatarevery nearandtendstomakesomeof themistakes onlocalization.
It introduces a few newer things: Which are anchor boxes (pre-determined sets of boxes such that the
network moves from predicting the bounding boxes to predicting the offsets from these) and the use of
features thataremorefine-grainedsosmaller objects canbepredicted better.
YOLO v3-
YOLOv3 came about April 2018, and it adds small improvements, including the fact that
boundingboxes getpredictedatthedifferentscales. TheunderlyingmeatypartoftheYOLOnetwork,Darknet,
is expanded in this version to have 53 convolutional layers
# DAY 06
Q1. What is NLP?
Natural language processing (NLP): It is the branch of artificial intelligence that helps computers
understand, interpret and manipulate human language. NLP draws from many disciplines,
including computer science and computational linguistics, in its pursuit to fill the gap between
humancommunication andcomputerunderstanding.
from
thetenuniquewords.
“It” = 1
“is” = 1
“going” = 1
“to” = 1
“rain” = 1
“today” = 1
“I” = 0
“am” = 0
“not” = 0
“outside” = 0
Restof thedocuments will be:
“Itis going toraintoday” = [1, 1, 1, 1, 1, 1, 0, 0, 0, 0]
“Today I amnotgoing outside” = [0, 0, 1, 0, 0, 1, 1, 1, 1, 1]
“I amgoing towatchtheseasonpremiere” = [0, 0, 1, 1, 0, 0, 1, 1, 0, 0]
In this approach, eachword(a token)is called a“gram”. Creating thevocabulary oftwo-word
pairs
is calledabigrammodel.
The processof converting theNLP text into numbersis called vectorisation in ML. There are
different
ways toconverttext intothevectors :
•Counting thenumber of times that eachwordappears inthedocument.
•I amcalculating thefrequencythateachwordappears in adocumentoutofall thewordsin
thedocument.
Q7.What do you understand by TF-I DF?
TF-IDF: It stands forthetermof frequency-inversedocumentfrequency.
TF-IDF weight: It is astatistical measureusedtoevaluate howimportantawordis toadocumentin
a collection or corpus. The importance increases proportionally to the numberof times a word appears
in thedocumentbutis offset by thefrequency ofthewordin thecorpus.
Thus,
Word2Vecisashallow, two-layerneuralnetworkwhichistrainedtoreconstructlinguistic
contextsofwords.Ittakesasitsinputalargecorpusofwordsandproducesavectorspace,
typically of several of hundred dimensions, with each of unique wordin the corpus
being assignedtothecorrespondingvectorinspace.
Wordvectorsarepositionedinavectorspacesuchthatwordswhichsharecommon
contextsinthecorpusarelocatedclosetooneanotherinthespace.
Word2Vecisaparticularly computationally-efficientpredictivemodelforlearningword
embeddingsfromrawtext.
Word2Vecisagroupofmodelswhichhelpsderive relations betweenawordandits
contextualwords.Let’s lookattwoimportantmodelsinside Word2Vec:Skip-grams and
CBOW.
Skip-grams
ContinuousBag-of-Words(CBOW)
CBOW predictstargetwords(e.g. ‘mat’) fromthesurrounding context words(‘the cat sits
onthe’).
Statistically, it affectsthatCBOW smoothes overa lot of distributional information (by
treatinganentirecontextasoneobservation).For themostpart,this turnsouttobeauseful
thing forsmaller datasets.
This wasaboutconvertingwordsintovectors.But wheredoesthe“learning” happen?
Essentially, webeginwithsmall randominitialisation of wordvectors.Ourpredictivemodel
learnsthevectorsbyminimising theloss function.In Word2vec,this happenswithfeed-
forwardneuralnetworksandoptimisationtechniquessuchasStochastic gradientdescent.
There arealso count-based models which maketheco-occurrence count matrix of thewords
in ourcorpus;wehaveavery large matrixwitheachrowfor the“words”andcolumnsfor the
“context”.Thenumberof“contexts” is, ofcourseverylarge, sinceit is veryessentially
combinatorialin size. To overcomethisissue,weapplySVD toamatrix.This reducesthe
dimensionsofthematrix toretainmaximumpieces of information.
The basic idea behindPV-DM is inspiredby Word2Vec. In CBOW modelof Word2Vec, the
model learns topredict acentre wordbasedonthecontexts. For example- given asentence
“The catsatonthetable”, CBOW modelwouldlearntopredictthewords “sat” given the
contextwords—thecat, onandtable.Similarly,in PV-DM themainideais: randomlysample
consecutivewordsfrom theparagraph andpredict acentre wordfrom therandomly sampled
setofwordsby taking astheinput—thecontextwordsandtheparagraphid.
Let’s have alook at themodel diagram for somemoreclarity. In this given model, wesee
Paragraph matrix, (Average/Concatenate) andclassifier sections.
Paragraph matrix: It is the matrix where each column represents the vector of a paragraph.
Average/Concatenate: It means that whether the word vectors and paragraph vector are
averagedorconcatenated.
Classifier: In this, it takes the hidden layer vector (the one that was concatenated/averaged) as
input andpredicts theCentre word.
In the matrix D, It has the embeddings for “seen” paragraphs (i.e. arbitrary length
documents),thesamewayWord2Vecmodelslearnsembeddingsforwords.Forunseen
paragraphs,themodelisagainrunthroughgradientdescent(5orsoiterations)toinfera
documentvector.
Time series forecasting is a technique for the prediction of events through a sequence of
time. The technique is used across many fields of study, from the geology to behaviour to
economics.Thetechniquespredictfutureeventsbyanalysingthetrendsofthepast,onthe assumptionthat
futuretrendswillholdsimilartohistoricaltrends.
Time-series:
1. Wheneverdata isrecordedatregularintervalsoftime.
2. Time-seriesforecastisExtrapolation.
3. Time-seriesreferstoan orderedseriesofdata.
Regression:
1. Whereasinregression,whether data isrecordedatregularorirregularintervalsoftime, wecan
apply.
2.Regression is Interpolation.
3. Regressionreferbothorderedandunorderedseriesofdata.
Q11. What is the difference between stationery and non-
stationary data?
Stationary: A series is said tobe"STRICTLY STATIONARY” if theMean, Variance &
Covarianceis constantoversometimeortime-invariant.
Non-Stationary:
o Mostmodelsassumestationaryofdata.Inotherwords,standardtechniquesare invalidifdatais
"NON-STATIONARY".
Autoocorrelationmayresultdueto"NON-STATIONARY".
o Non-stationaryprocessesarearandomwalkwithorwithoutadrift(aslow,steady
change).
o Deterministic trends (trends that are constant, positive or negative, independent of
timeforthewholelifeof theseries).
DATA SCIENCE
INTERVIEW
PREPARATION
(30 Days of Interview
Preparation)
# DAY 07
Q1. What is the process to make data stationery from non-
stationary in time series?
Ans:
The twomostcommonwaystomakeanon-stationary timeseriesstationaryare:
Differencing
Transforming
Differencing:
To makeyour series stationary, you take adifference betweenthedata points. So let us say, your
original timeserieswas:
Once, you make the difference, plot theseries andsee if there is any improvement in theACF curve.
If not, you can try a second or even a third-order differencing. Remember, the more you difference,
themorecomplicated your analysis is becoming.
Transforming:
If we cannot make a time series stationary, you can try out transforming the
variables. Log transform isprobably themostcommonlyusedtransformation ifweseethe
divergingtimeseries.
However,itissuggestedthatyouusetransformationonlyincasedifferencingisnotworking.
Ans:
In the first plot, we can see that the mean varies (increases) with time, which
resultsinan upwardtrend.Thisisthenon-stationaryseries.
Fortheseriesclassification asstationary, itshouldnotexhibit thetrend.
Movingontothesecondplot,wedonotseeatrendin theseries,butthevarianceof theseries
is a function of time. As mentioned previously, a stationary series must have a
constant variance.
If welook atthethirdplot,thespreadbecomescloser, asthetimeincreases,whichimplies that
covariance is afunction of time.
Hmm,thislookslikethereisatrend.Tobuildupconfidence,let'saddalinearregressionforthis graph:
In the plot above, we applied the moving average model to a 24h window. The
green linesmoothedthetimeseries,andwecanseethattherearetwopeaksinthe24hperiod.
Thelongerthewindow,thesmootherthetrendwillbe.
Fromtheaboveplot,thedarkbluelinerepresentstheexponentialsmoothingofthetimeseries
using a smoothing factor of 0.3, and the orange line uses a smoothing factor of 0.05. As we can
see, thesmallerthesmoothingfactor,thesmootherthetimeserieswillbe.Becauseassmoothingfactor
approaches0,weapproachtothemovingaveragemodel
DATA
SCIENCE
INTERVIEW
PREPARATI ON
(30 Days of I nterview
Preparation)
# DAY 08
Page 1 | 7
Q1. What is Tensorflow?
Ans:
TensorFlow: TensorFlow is an open-source software library released in 2015 by Google to make it
easier for the developers to design, build, and train deep learning models. TensorFlow is originated
as an internal library that the Google developers used to build the models in house, and we expect
additional functionality to be added in the open-source version as they are tested and vetted in
internalflavour.AlthoughTensorFlowis theonlyoneofseveraloptionsavailabletothedevelopers and
wechoosetouseit herebecauseof thoughtful design andeaseof use.
At ahigh level, TensorFlow is aPython library thatallows users toexpress arbitrary computationas
agraph of data flows.Nodes in this graph represent mathematical operations, whereasedges
represent data thatis communicated fromonenodetoanother. Data in TensorFlow arerepresented
astensors,whicharemultidimensionalarrays.Although this framework for thinking about
computation is valuable in manydifferent fields, TensorFlow is primarily usedfor deeplearning in
practiceand research.
• It allows Deep
• Learning. It is
• open-sourceandfree.
• It is reliable (and without major
• bugs)
• It is backed by Google and a good
community. Itisaskillrecognisedbymany
employers.
Q6. List a few limitations of Tensorflow.
It is easy to implement.
Ans :
Page 4 | 7
Q7. What are the use cases of Tensor flow?
Ans:
Tensorflowis animportanttoolof deeplearning, it hasmainly five usecases, andthey are:
•TimeSeries
•Imagerecognition
•SoundRecognition
•Videodetection
•Text-basedApplications
Q8. What are the very important steps of Tensorflow architecture? Ans:
•Pre-process theData
•BuildaModel
•Trainandestimatethemodel
Page 5 | 7
Q9. What is
Keras? Ans:
Keras: It is an Open Source Neural Network library written in Python that runs on the top
of TheanoorTensorflow.Itisdesignedtobethemodular,fastandeasytouse.Itwasdevelopedby
FrançoisChollet, aGoogleengineer.
Page 6 | 7
RNN (Recurrent Neural Network)
•Best suited for sequential data
•RNN supports less feature set than CNN.
•This networkcan manage the arbitrary input and output lengths.
•It is ideal for text and speech analysis.
•Scalability
•Visualisation of Data
•Debugging facility
•Pipelining
Page 7 | 7
DATA SCI ENCE
INTERVIEW
PREPARATI ON
(30 Days of I nterview
Preparation)
# DAY 09
Page 1 | 9
Q1: How would you define Machine
Learning? Ans:
Machinelearning:It isanapplicationofartificialintelligence(AI)thatprovidessystems theability
to learn automatically and to improve from experiences without being programmed. It focuses on the
developmentofcomputerapplicationsthatcanaccessthedataandusedittolearnforthemselves.
Theprocessoflearningstartswiththeobservationsordata,suchasexamples,directexperience,or
instruction, to look for the patterns in data and to make better decisions in the future based
on
examples that we provide. The primary aim is to allow the computers to learn automatically without
humaninterventionorassistanceandadjustactionsaccordingly.
Page 2 | 9
Q3. What are the two common supervised tasks?
Ans:
The twocommonsupervisedtasks areregressionandclassification.
Regression-
The regressionproblemis whentheoutputvariable is thereal orcontinuous value, such as “salary”
or“weight.”Manydifferent models can beused, andthesimplest is linear regression. It tries tofit
thedatawiththebesthyper-plane,which goes throughthepoints.
Classification
It is thetype of supervised learning. It specifies theclass towhich thedata elements belong toand
is best used whenthe output has finite anddiscrete values. It predicts a class for aninput variable,
aswell.
Page 3 | 9
It is a Machine Learning technique that involves the grouping of the data points. Given a
setofdata points,andwecanuseaclusteringalgorithmtoclassifyeachdatapointintothespecificgroup.In
theory,datapointsthatliein thesamegroupshouldhavesimilar propertiesand/orfeatures,and data
pointsin thedifferentgroupsshouldhavehigh dissimilar propertiesand/orfeatures.Clustering isthe
methodofunsupervisedlearningandisacommontechniqueforstatisticaldataanalysisused inmanyfields.
Visualization
Data visualization is the technique that uses an array of static and interactive visuals within
the specificcontexttohelppeopletounderstandandmakesenseofthelargeamountsofdata.Thedata isoften
displayed in thestoryformatthatvisualizespatterns,trends,andcorrelationsthatmaygo otherwise
unnoticed.Itisregularlyusedasanavenuetomonetizedataastheproduct.Anexample ofusing
monetizationanddatavisualizationisUber.Theappcombinesvisualizationwithreal-time datasothat
customerscanrequestaride.
Page 5 | 9
price prediction.Online learningalgorithms might bepronetocatastrophic interference and
problemthatcan beaddressedby theincremental learningapproaches.
Page 6 | 9
Q9. What is the Model Parameter?
Ans:
Modelparameter:It is aconfigurationvariable thatis internaltoamodelandwhosevaluecanbe
predictedfromthedata.
Whilemakingpredictions,themodelparameteris
needed. Thevaluesdefinetheskillofamodelonproblems.
It is estimated or learned from data.
It is often not set manually by the
practitioner. Itisoftensavedaspartofthelearned
model.
Page 7 | 9
Parametersarekey tomachinelearningalgorithms.They arepartof themodelthatis learnedfrom
historical training data.
Q11: What is Model Hyperparameter?
Ans:
Modelhyperparameter:It is aconfigurationthatis external toamodel andwhosevalues cannot
beestimatedfromthedata.
Itis oftenusedinprocesses tohelpestimatemodel parameters.
The practitioneroftenspecifies them.
Itcanoftenbetheset using heuristics.
It is tunedforthegiven predictive modeling problems.
Wecannotknowthebestvalueforthemodel hyperparameter onthegiven problem. Wemayusethe
rulesof thumb,copyvaluesusedonotherproblems,orsearchfor thebest value by trial and error.
Reservesomeportionof thesampledataset.
Page 8 | 9
Using therestdataset andtrainmodels.
Testthemodel using areserveportionof thedata-set.
Page 9 | 9
DATA SCIENCE
INTERVIEW
PREPARATION
(30 Days of Interview
Preparation)
# DAY 10
Page 1 | 11
Q1. What is a Recommender System?
Answer:
A recommendersystemis todaywidely deployedin multiplefields like movierecommendations,music
preferences,social tags, researcharticles, search queries andso on. The recommender systems work as
percollaborativeandcontent-basedfiltering orbydeployingapersonality-basedapproach.This typeof
systemworksbasedonaperson’spast behavior in ordertobuild amodel for thefuture. This will predict
thefutureproductbuying, movieviewing orbookreadingbypeople.It alsocreates afiltering approach
using thediscrete characteristicsof items while recommendingadditionalitems.
Answer:
SAS: it is one of the most widely used analytics tools used by some of the biggest
companiesonearth. Ithassomeofthebeststatisticalfunctions,graphicaluserinterface,butcancomewitha
pricetagand henceitcannotbereadilyadoptedbysmallerenterprises
Page 2 | 11
R: The best part about R is that it is an Open Source tool and hence used generously
byacademiaand theresearchcommunity.It is arobusttoolforstatistical computation,graphical
representationand reporting.Duetoits opensourcenatureit isalwaysbeingupdatedwiththelatest
featuresandthenreadily availabletoeverybody.
Python:Pythonisapowerfulopensourceprogramming language thatiseasytolearn,workswellwith
mostothertoolsandtechnologies.ThebestpartaboutPythonisthatithasinnumerablelibrariesand community
createdmodulesmakingitveryrobust.Ithasfunctionsforstatisticaloperation,model buildingand more.
Answer:
With datacomingin frommultiple sourcesit is importanttoensure thatdatais goodenoughfor analysis.
This is wheredatacleansing becomesextremely vital. Data cleansing extensively deals with theprocess
of detectingandcorrectingof datarecords, ensuring thatdatais completeandaccurate andthe
componentsof datathatareirrelevant aredeletedormodifiedaspertheneeds.This processcanbe
deployed in concurrence withdatawrangling orbatchprocessing.
Page 3 | 11
Oncethedatais cleanedit confirmswiththerulesofthedatasetsinthesystem.Datacleansingis an
essential partofthedatasciencebecause thedatacanbepronetoerrorduetohumannegligence,
corruptionduring transmissionorstorageamongotherthings. Datacleansing takesahugechunkoftime
andeffortofaDataScientistbecauseofthemultiplesourcesfromwhichdataemanatesandthespeedat whichit
comes.
Answer:
Herewewill discuss thecomponentsinvolved in solving aproblemusing machinelearning.
Domainknowledge
This is thefirst step wherein weneedto understand howto extract the various features from the data and
learn moreabout thedata that wearedealing with. It has got moreto dowith thetype of domain that we
aredealing withandfamiliarizing thesystem tolearnmoreabout it.
Page 4 | 11
FeatureSelection
This stephasgotmoretodowiththefeaturethatweareselecting fromthesetof featuresthatwehave.
Sometimesit happensthattherearealot of features andwehavetomakeanintelligent decision
regarding
thetype of feature thatwewanttoselect togo aheadwithourmachinelearning endeavor.
Algorithm
This is avital stepsincethealgorithmsthatwechoosewill haveavery majorimpactontheentire
process
of machine learning. You can choosebetweenthelinear andnonlinear algorithm. Someof the
algorithms
usedareSupportVector Machines,Decision Trees, Naïve Bayes, K-Means Clustering, etc.
Training
This is the most important part of the machine learning technique and this is where it differs from the
traditional programming. The training is done based on the data that we have and providing morereal
world experiences. With each consequent training step themachine gets better andsmarter andable to
takeimproved decisions.
Evaluation
In this stepweactually evaluate thedecisions takenby themachinein ordertodecide whether it is up
to
themarkornot.Therearevarious metricsthatareinvolved in this processandwehavetoclosed
deploy
eachof thesetodecide ontheefficacy of thewholemachine learning endeavor.
Optimization
This processinvolves improvingtheperformanceof themachinelearningprocessusing various
optimizationtechniques.Optimizationof machinelearningis oneof themostvital componentswherein
theperformanceof thealgorithmis vastly improved.Thebestpartof optimization techniquesis that
machine learning is not just aconsumer of optimization techniques but it also provides newideas for
optimization too.
Testing
Here various tests are carried out and some these are unseen set of test cases. The data is partitioned
into test and training set. There are various testing techniques like cross-validation in order to deal with
multiplesituations.
Page 5 | 11
Q4. What is I nterpolation and Extrapolation?
Answer:
The termsof interpolationandextrapolationareextremelyimportantin anystatistical analysis.
Extrapolation is thedetermination orestimationusing aknownsetof values orfacts by extending it and
taking it toanareaor regionthat is unknown. It is thetechnique of inferring something using data that
is available.
Interpolationontheotherhandis themethodof determining a certainvalue which falls betweenacertain
setof valuesorthesequenceof values. This is especially useful whenyou have dataat thetwoextremities
of acertainregionbutyoudon’thaveenoughdatapointsatthespecific point. This is whenyoudeploy
interpolationtodeterminethevaluethat youneed.
Page 6 | 11
Q5. What does P-value signify about the statistical data?
Answer:
P-valueisusedtodetermine thesignificance ofresultsafterahypothesis testinstatistics. P-valuehelps
thereaderstodrawconclusionsand isalwaysbetween0and 1.
•P- Value > 0.05 denotes weak evidence against the null hypothesis which means the null
hypothesis cannotberejected.
•P-value <= 0.05denotes strongevidence against thenull hypothesis whichmeansthenull
hypothesis canberejected.
•P-value=0.05isthemarginalvalueindicatingitispossibletogoeitherway.
Page 7 | 11
Therearevariousfactorstobeconsideredwhenansweringthisquestion-
Understandtheproblemstatement,understandthedataandthengivetheanswer.Assigningadefault
valuewhichcanbemean,minimumormaximumvalue.Gettingintothedataisimportant.
If it is a categorical variable, the default value is assigned. The missing value is
assigneda defaultvalue.
If you have a distribution of data coming, for normal distribution give the mean value.
Shouldweeventreatmissingvaluesisanother important pointtoconsider? If 80%ofthevaluesfora
variablearemissingthenyoucananswerthat youwouldbedroppingthevariableinsteadoftreating
the missingvalues.
Q7. Explain the difference between a Test Set and a Validation Set?
Answer:
Validation setcanbeconsideredasapartof thetrainingsetasit is usedforparameterselection andto
avoidOverfitting of themodel being built. On theotherhand, test set is used for testing or evaluating
theperformanceof atrainedmachineleaning model.
In simpleterms,thedifferencescanbesummarizedas-
Training Setis tofittheparameters i.e. weights.
Test Setis toassesstheperformanceof themodeli.e. evaluatingthepredictivepowerand
generalization.
Validation setis totunetheparameters.
Page 8 | 11
Q8. What is the curse of dimensionality? Can you list some ways to deal
with it?
Answer:
Page 9 | 11
Q9. What is data augmentation? Can you give some examples?
Answer:
Cross-validation is a technique for dividing data between training and validation sets. On typical
cross- validation this split is done randomly. But in stratified cross-validation, the split preserves the
ratioof thecategoriesonboththetrainingandvalidationdatasets.
Page 10 | 11
For example, if wehave a dataset with 10% of category A and90% of category B, andweuse stratified
cross-validation, we will have the same proportions in training and validation. In contrast, if we use
simple cross-validation, in the worst case we may find that there are no samples of category A in the
validation set.
Stratified cross-validation maybeapplied in thefollowing scenarios:
Page 11 | 11
DATA SCIENCE
INTERVIEW
PREPARATION
(30 Days of Interview
Preparation)
# DAY 11
Page 1 | 12
Q1. What are tensors?
Answer:
The tensors are nomorethana methodof presenting the data in deeplearning. If put in thesimple term,
tensors are just multidimensional arrays that allow developers to represent the data in a layer, which
means deep learning you are using contains high-level data sets where each dimension represents a
different feature.
The foremost benefit of using tensors is it provides the much-needed platform-flexibility and is easy to
trainable on CPU. Apart from this, tensors have the auto differentiation capabilities, advanced support
system for queues,threads, andasynchronous computation. All thesefeatures also makeit customizable.
Page 2 | 12
MostcommonissuesfacedwithRNN
AlthoughRNNis aroundforawhileandusesbackpropagation,therearesomecommonissuesfaced
bydeveloperswhoworkit.Outofall,someofthemostcommonissuesare:
Exploding
gradients
Vanishing
gradients
Page 3 | 12
Q3. What is a ResNet, and where would you use it? Is it efficient?
Answer:
In general, a skip connection allows us to skip the training of a few layers. Skip connections are also
called identity shortcut connections as they allow us to directly compute an identity function by just
relying onthese connections andnot having to look at thewhole network.
The skipping of theselayers makesResNet anextremely efficient network.
Page 4 | 12
Q4. Transfer learning is one of the most useful concepts today. Where
can it be used?
Answer:
For anyone who does not have access to huge computational power, training complex models is always
a challenge. Transfer learning aims to help by both improving the performance and speeding up your
network.
Page 5 | 12
The general idea behind transfer learning is totransfer knowledge notdata.For humans,this task is easy
–wecangeneralize models that wehave mentally created along time ago for adifferent purpose. One
ortwosamplesis almostalwaysenough.However,in thecase of neural networks, ahuge amountof data
andcomputational power arerequired.
Transfer learning shouldgenerally beusedwhenwedon’t have alot of labeled training data, orif there
already exists anetworkfor thetask youaretrying toachieve, probably trained onamuchmoremassive
dataset. Note, however, that the input of the model must have the same size during training. Also, this
works only if the tasks are fairly similar to each other, and the features learned can be generalized. For
example, something like learning how to recognize vehicles can probably be extended to learn how to
recognize airplanes andhelicopters.
Answer:
Page 6 | 12
Q6. Why are deep learning models referred as black boxes?
Answer:
Lately, theconcept of deeplearningbeing ablack box hasbeenfloating around. A black boxis asystem
whosefunctioning cannotbeproperlygrasped,buttheoutputproduced can beunderstood andutilized.
Now,sincemostmodelsaremathematicallysoundandarecreatedbasedonlegit equations,howis it
possiblethat wedonotknowhowthesystemworks?
Page 7 | 12
To make a deeplearning model not bea black box, a newfield called Explainable Artificial Intelligence
or simply, Explainable AI is emerging. This field aims to be able to create intermediate results and trace
back thedecision-making process of asystem.
Answer:
An advantage of using gates is that it enables the network to either forget information that it has
already learnedortoselectivelyignoreinformationeitherbasedonthestateofthenetworkortheinput
thegatereceives.
Gatesareextensivelyusedinrecurrentneuralnetworks,especiallyinLongShort-TermMemory(LSTM)
networks.A generalLSTMnetworkwillhave3to5gates,typically aninputgate,outputgate,hidden gate,
and activationgate.
The Sobel filter performs a two-dimensional spatial gradient measurement ona given image, which then
emphasizes regions thathave ahigh spatial frequency.In effect, this meansfinding edges.
In most cases, Sobel filters are used to find the approximate absolute gradient magnitude for every point
in a grayscale image. The operator consists of a pair of 3×3 convolution kernels. One of these kernels is
rotatedby 90degrees.
Page 9 | 12
Thesekernelsrespondtoedgesthatrunhorizontalorverticalwithrespecttothepixelgrid,onekernel for
eachorientation.Apointtonoteisthatthesekernelscanbeappliedeitherseparatelyorcanbe combinedto
findtheabsolutemagnitudeofthegradientateverypoint.
The Sobel operator has a large convolution kernel,which ends upsmoothing the image to a greater
extent, and thus, the operator becomes less sensitive to noise. It also produces higher output values for
similaredgescomparedtoothermethods.
To overcome the problem of output values from the operator overflowing the maximum
allowedpixel valueperimagetype,avoidusingimagetypesthatsupportpixelvalues.
Answer:
Boltzmannmachinesarealgorithmsthatarebasedonphysics,specificallythermalequilibrium.Aspecial and
morewell-knowncaseofBoltzmannmachinesistheRestrictedBoltzmannmachine,whichisatype ofBoltzmann
machinewheretherearenoconnectionsbetweenhiddenlayersofthenetwork.
Page 10 | 12
The concept was coined by Geoff Hinton, who most recently won the Turing award. In general, the
algorithm uses the laws of thermodynamics and tries to optimize a global distribution of energy in the
system.
filtering,
learning featuresaswell asmodeling.It canalsobeusedfor classification andregression.In general,
restrictedBoltzmannmachinesarecomposedofatwo-layernetwork,whichcanthenbeextended
further.
Note that these models are probabilistic since each of the nodes present in the system learns low-level
features from items in the dataset. For example, if we take a grayscale image, each node that is
responsibleforthevisiblelayerwilltakejustone-pixelvaluefromtheimage.
Answer:
Therearetwomajortypesofweightinitialization:-zeroinitializationandrandominitialization.
Zero initialization:Inthis process, biases and weights are initialisedto 0. If the weights are
setto
0, all derivativeswithrespecttotheloss functionsin theweightmatrixbecomeequal.Hence,none
of theweightschangeduringsubsequent iterations. Settingthebiasto0cancelsoutanyeffectit
may have.
All hidden units become symmetric due to zero initialization. In general, zero
initialization is notvery usefuloraccurate forclassification andthus mustbeavoidedwhenany
classificationtaskisrequired.
Page 12 | 12
DATA SCIENCE
INTERVIEW
PREPARATION
(30 Days of Interview
Preparation)
# DAY 12
Page 1 | 11
Q1. Where is the confusion matrix used? Which module would you use to
show it?
Answer:
Page 2 | 11
Q2: What is Accuracy?
Answer:
It is the most intuitive performance measure and it simply a ratio of correctly predicted to
thetotal observations.Wecansayas,ifwehavehighaccuracy,thenourmodelisbest.Yes,wecouldsaythat
accuracyis agreatmeasurebutonlywhenyouhavesymmetricdatasetswherefalsepositivesand
falsenegativesarealmostsame.
Page 3 | 11
Q4: What is Recall?
Answer:
Recall wecanalso called assensitivity ortruepositive rate.
It is several positives thatourmodelpredicts comparedtotheactual numberof positives in ourdata.
Recall = True Positives/(True Positives+ FalsePositives)
Recall = TruePositives /Total Actual Positive
Recall is a measure of completeness. High recall which meansthat our model classified most or all
of thepossiblepositiveelements as positive.
Page 4 | 11
Q6: What is Bias and Variance trade-off?
Answer:
Bias
Bias meansit’s howfar arethepredictvaluesfromtheactual values. If theaverage predicted values
arefar off fromtheactual values, thenwecalled asthis onehave high bias.
Whenourmodelhas ahigh bias, thenit meansthat ourmodelis toosimple anddoesnotcapturethe
complexity of data, thus underfitting thedata.
Variance
It occurs whenour model performs good onthetrained dataset but doesnot dowell ona dataset that
it is not trained on, like a test dataset or validation dataset. It tells us that actual value is how much
scattered fromthepredicted value.
Because of High variance it cause overfitting that implies that thealgorithm models random noise
present in thetraining data.
Whenmodel have high variance, thenmodel becomes very flexible andtuneitself to the data points
of thetraining set.
Page 5 | 11
Bias-variance: It decomposition essentially decomposes the learning error from any algorithm by
adding bias, the variance and a bit of irreducible error due to noise in the underlying dataset.
Essentially, if we make the model more complex and add more variables, We’ll lose bias but gain
some variance —to get the optimally reduced amount of error, you’ll have to tradeoff bias and
variance.Wedon’t wanteither high bias orhigh variance in your model.
Datawranglingisaprocessbywhichweconvertandmapdata.Thischangesdatafromitsraw
form to a format that is a lot more valuable.
Datawranglingisthefirststepformachinelearninganddeeplearning.Theendgoalistoprovide data
thatisactionableand toprovideitasfastaspossible.
Page 6 | 11
There arethreemajor things tofocus onwhile talking aboutdatawrangling –
1.Acquiring data
Thefirst andprobablythemostimportant step in datascience is theacquiring, sorting andcleaning
of data.This is an extremely tediousprocessandrequires themost amountof time.
Oneneeds to:
2.Datacleaning
Datacleaningis anessentialcomponentof datawranglingandrequires alot of patience. To makethe
job easier it is first essential toformatthedatamakethedatareadable for humansatfirst.
Theessentials involvedare:
Formatthedatatomakeitmorereadable
Find outliers(data points thatdonotmatchtherest of thedataset) in data
Find missing valuesandremovethemfromthedataset(withoutthis, anymodelbeing
trainedbecomesincomplete anduseless)
3.DataComputation
At times,yourmachinenothaveenoughresourcestorunyouralgorithm e.g. youmightnothave a
GPU. In thesecases,youcanusepublicly available APIs torunyouralgorithm. These arestandard
endpoints found onthewebwhich allow you tousecomputing powerover thewebandprocess
datawithout having torely onyour ownsystem. An example wouldbetheGoogleColab Platform.
Page 7 | 11
Q8. Why is normalization required before applying any machine
learning model? What module can you use to perform normalization?
Answer:
A problemwemightface if wedon’t normalize datais thatgradients would take avery long time
todescendandreachtheglobal maxima/minima.
For numericaldata, normalization is generally donebetweentherange of 0 to1.
Thegeneral formulais:
Xnew= (x-xmin)/(xmax-xmin)
Page 8 | 11
Q9. What is the difference between feature selection and
feature extraction?
Featureselectionandfeatureextractionaretwomajorwaysoffixingthecurseofdimensionality
1.Feature selection:
Feature selectionis usedto filter asubsetof input variablesonwhichtheattentionshouldfocus. Every
othervariableisignored.Thisissomethingwhichwe,ashumans,tendtodosubconsciously.
Many domains have tens of thousands of variables out of which most are irrelevant and redundant.
Featureselectionlimitsthetrainingdataandreducestheamountofcomputationalresourcesused. Itcan
significantly improve a learning algorithms performance.
In summary, we can say that the goal of feature selection is to find out an optimal feature
subset. Thismightnotbeentirelyaccurate,however,methodsofunderstandingtheimportanceof
features alsoexist.SomemodulesinpythonsuchasXgboosthelpachievethesame.
2.Feature extraction
Featureextractioninvolves transformationoffeaturessothatwecanextractfeaturestoimprovethe
processoffeatureselection.Forexample,inanunsupervisedlearningproblem,theextractionof bigrams
fromatext,ortheextractionofcontoursfromanimageareexamplesoffeatureextraction.
Thegeneralworkflowinvolvesapplying featureextractionongiven datatoextractfeaturesand
thenapplyfeatureselectionwithrespecttothetargetvariabletoselectasubsetofdata.Ineffect, thishelps
improve the accuracy of a model.
Page 9 | 11
Q10. Why is polarity and subjectivity an issue?
Polarity andsubjectivity aretermswhich aregenerally usedin sentimentanalysis.
Polarity is thevariationofemotionsin asentence.Since sentimentanalysisis widelydependenton
emotionsandtheir intensity, polarity turns outtobeanextremely important factor.
In mostcases,opinionsandsentimentanalysis areevaluations.They fall underthecategoriesof
emotional and rationalevaluations.
Rational evaluations,asthenamesuggests,arebasedonfactsandrationality whileemotional
evaluations arebasedonnon-tangibleresponses, which arenotalwayseasy todetect.
Subjectivity in sentimentanalysis, is amatterof personal feelings andbeliefs which may or may
notbebasedonany fact. Whenthereis alot of subjectivity in atext, it mustbeexplained and
analysedin context.Onthecontrary, if therewasalot of polarity in thetext, it could beexpressed
asapositive, negative orneutral emotion.
ARIMA is a widely used statistical method which stands for Auto Regressive
IntegratedMoving Average.Itisgenerallyusedforanalyzingtimeseriesdataandtimeseriesforecasting.
Let’s take a quick look at the terms involved.
AutoRegressionis amodelthatusestherelationshipbetweentheobservationandsomenumbersof
laggingobservations.
Page 10 | 11
Integratedmeansuseofdifferencesinrawobservationswhichhelpmakethetimeseriesstationary.
MovingAveragesis amodelthatusestherelationshipanddependencybetweenthe
observation andresidualerrorfromthemodelsbeingappliedtothelaggingobservations.
Notethateachofthesecomponentsareusedasparameters.Aftertheconstructionofthemodel,a linear
regressionmodelisconstructed.
Data is prepared by:
Removingtrendsandstructuresthatwillnegativelyaffectthemodel
Page 11 | 11
DATA SCI ENCE
INTERVIEW
PREPARATI ON
(30 Days of I nterview
Preparation)
# Day13
Page 1 | 10
Q1. What is Autoregression?
Answer:
Theautoregressive(AR)modeliscommonlyusedtomodeltime-varyingprocessesandsolve
problemsin thefields of natural science, economicsandfinance, andothers. The modelshave
always
beendiscussedin thecontextof randomprocessandareoftenperceivedasstatistical toolsfor time
seriesdata.
A regressionmodel,like linear regression,modelsanoutputvalue whicharebasedonalinear
combinationof input values.
Example: y^ = b0+ b1*X1
Wherey^ is theprediction,b0andb1arecoefficients foundbyoptimising themodel ontraining
data,and X is an input value.
This modeltechniquecan beused onthetime serieswhereinput variables aretaken as observations
atprevious timesteps, called lag variables.
For example,wecanpredictthevaluefor thenext time step(t+1) given theobservations at thelast
twotimesteps(t-1 andt-2). As aregressionmodel, this would look asfollows:
X(t+1) = b0+ b1*X(t-1) + b2*X(t-2)-
Becausetheregressionmodelusesthedatafromthesameinputvariable atprevioustimesteps,it is
referred toasan autoregression.
The notationAR(p) referstotheautoregressive modelof orderp.The AR(p) modelis written
Page 2 | 10
Q2. What is Moving Average?
Answer:
Movingaverage:Fromadataset, wewill get anoverall idea oftrends by this technique; it is an
averageofanysubsetofnumbers.Forforecastinglong-termtrends,themovingaverageis extremelyuseful
forit. Wecancalculateit foranyperiod.For example:if wehavesalesdatafor twentyyears,we
cancalculate thefive-yearmovingaverage,afour-yearmovingaverage,athree- yearmovingaverageand
soon.Stockmarketanalystswilloftenusea50or200-daymovingaverage tohelpthemseetrendsinthe
stockmarketand(hopefully)forecastwherethestocksareheaded.
Page 3 | 10
ThefollowingistheprocedureforusingARMA.
Selecting theAR modelandthenequalizing theoutputtoequalthesignal being studied
if the input is an impulse function or the white noise. It should at least be
goodapproximation
ofsignal.
Findinga model’sparameters numberusingtheknownautocorrelationfunctionorthe
data .
Usingthederivedmodelparameterstoestimatethepowerspectrumofthesignal.
MovingAverage(MA)model-
It is a commonly used model in the modern spectrum estimation and is also one of the
methods of the model parametric spectrum analysis. The procedure for estimating MA model’s signal
spectrumisasfollows.
Page 4 | 10
Q4. What is Autoregressive Integrated Moving Average (ARIMA)?
Answer:
ARIMA: It is a statistical analysis model that uses time-series data to either better
understandthedata setortopredictfuturetrends.
AnARIMAmodelcan be understoodby the outlining each ofits components as follows-
Autoregression (AR): It refers to a model that shows a changing variable that
regresseson itsownlagged,orprior,values.
Integrated(I):Itrepresentsthedifferencingofrawobservationstoallowforthetimeseries to
becomestationary,i.e.,datavaluesarereplacedbythedifferencebetweenthedatavalues
andthepreviousvalues.
Moving average (MA): It incorporates the dependency between an observation and
theresidualerrorfromthemovingaveragemodelappliedtothelaggedobservations.
Each component functions as the parameter with a standard notation. For ARIMA models,
the standard notation would be the ARIMA with p, d, and q, where integer values substitute for the
parameterstoindicatethetypeoftheARIMAmodelused.Theparameterscanbedefinedas-
p:Itthenumberoflagobservations inthemodel;alsoknownasthelagorder.
d: It thenumberof timesthattherawobservations aredifferenced; alsoknownasthe
degree ofdifferencing.
q:Itthesizeofthemovingaveragewindow;alsoknownastheorderofthemovingaverage.
Page 5 | 10
Q5.What is SARIMA (Seasonal Autoregressive Integrated Moving-
Average)?
Answer:
Seasonal ARIMA: It is an extension of ARIMA that explicitly supports the univariate time series
datawiththeseasonalcomponent.
It adds three new hyper-parameters to specify the autoregression (AR), differencing (I) and the
moving average (MA) for theseasonal component of theseries, as well as an additional parameter
fortheperiodof theseasonality.
Configuring the SARIMA requires selecting hyperparameters for both the trend and seasonal
elements of theseries.
Trend Elements
Threetrendelementsrequirestheconfiguration.
They aresameas theARIMA model, specifically-
p: Itis Trendautoregressionorder.
d: Itis Trenddifferenceorder.
q: Itis Trendmoving averageorder.
Seasonal Elements-
SARIMA(p,d,q)(P,D,Q)m-
The elements can bechosen through careful analysis of theACF andPACF plots looking atthe
correlations of recenttimesteps.
Page 6 | 10
Q6. What is Seasonal Autoregressive Integrated Moving-Average
with Exogenous Regressors (SARIMAX) ?
Answer:
SARIMAX: It is anextensionof theSARIMA modelthatalsoincludesthemodellingof the
exogenousvariables.
Exogenousvariablesarealsocalled thecovariatesandcanbethought of as parallel input sequences
thathaveobservationsatthesametime steps as theoriginal series. The primary series may be
referredasendogenousdatatocontrastit fromexogenous sequence(s). The observations for
exogenousvariablesareincludedin themodeldirectly ateachtimestepandare notmodeledin the
samewayastheprimaryendogenoussequence(e.g. asanAR, MA, etc.process).
Page 7 | 10
Q7. What is Vector autoregression (VAR)?
Answer:
VAR: It is a stochastic process model used to capture the linear interdependencies
among
multipletimeseries.VARmodelsgeneralisetheunivariateautoregressivemodel(ARmodel)by
allowing formorethanoneevolving variable.All variablesin theVAR enterthemodelin thesame
way:eachvariablehasanequationexplaining its evolutionbasedonits ownlaggedvalues,thelagged
valuesof theothermodelvariables, andanerrorterm.VAR modelling doesnotrequires asmuch
knowledgeabouttheforcesinfluencingthevariableasdostructuralmodelswithsimultaneous equations:
Theonly priorknowledgerequiredis alist ofvariableswhichcanbehypothesisedto affecteach
otherintertemporally.
A VAR model describes the evolution of the set of kvariablesoverthesamesample
period(t= 1,
..., T) as the linear function of only their past values. The variables are collected
in the k- vector((k × 1)-matrix) y,,whichhasasthtthe(i )element,yi,t, the
observationattimetof
the(i th)variable.Example:ifthe(i th)variableistheGDP,thenyi,t is the value of GDP at time “t”.
-
wheretheobservationyt−i is called the (i-th) lagofy, c is the k-vectorofconstants(intercepts),Ai is a
time-invariant(k× k)-matrix,and etis a k-vectoroferrortermssatisfying.
Page 9 | 10
Q10. What is Simple Exponential Smoothing (SES)?
Answer:
SES: It methodmodelsthenexttimestepasanexponentiallyweightedlinear functionof
observations
atprior timesteps.
This methodis suitable for univariate timeserieswithout trend andseasonalcomponents.
Exponential smoothing is the rule of thumb technique for smoothing time series data using the
exponential window function. Whereas in the simple moving average, the past observations are
weighted equally, exponential functions are used to assign exponentially decreasing weights
over time. It is easily learned and easily applied procedure for making some determination
based on prior assumptions by the user, such as seasonality. Exponential smoothing is often
usedfor analysis of time-series data.
Exponential smoothing is oneof manywindowfunctions commonly applied tosmoothdatain
signal
processing,acting as low-pass filters toremovehigh-frequency noise.
The raw data sequence is often represented by {xt} beginning at time t = 0, and the output of the
exponential smoothing algorithm is commonly written as {st} which may be regarded as a best
estimateof whatthenext value of x will be. Whenthesequenceof observations begins attimet=
0, thesimplest formof exponential smoothing is given by theformulas:
Page 10 | 10
DATA SCIENCE
INTERVIEW
PREPARATION
(30 Days of Interview
Preparation)
# DAY 14
Page 1 | 11
Q1. What is
Alexnet? Answer:
The Alex Krizhevsky, Geoffrey Hinton and Ilya Sutskever created the neural network architecture
called ‘AlexNet’ and won Image Classification Challenge (ILSVRC) in 2012. They trained their
networkon1.2millionhigh-resolutionimagesinto1000differentclasseswith60million parametersand
650,000neurons.ThetrainingwasdoneontwoGPUswithsplitlayerconcept becauseGPUswerea
little bit slow at that time.
AlexNet is the name of convolutional neural network which has had a large impact on the
field of machine learning, specifically in the application of deep learning to machine vision.
The network had very similar architecture as the LeNet by Yann LeCun et al. but was deeper with
morefilters
per
layer,andwiththestackedconvolutionallayers.Itconsistof(11×11,5×5,3×3,convolutions),max
pooling, dropout, data augmentation, ReLU activations and SGD with the momentum. It
attached withReLUactivationsaftereveryconvolutionalandfullyconnectedlayer.AlexNetwastrainedfor
six days simultaneously on two Nvidia Geforce GTX 580 GPUs, which is the reason for
whytheir networkissplitintothetwo pipelines.
Architecture
Page 2 | 11
In short, AlexNet contains five convolutional layers and three fully connected layers. Relu is
applied aftertheveryconvolutionalandthefullyconnectedlayer.Dropoutis appliedbeforethefirst
and secondfullyconnectedyear.Thenetworkhasthe62.3millionparametersandneeds1.1billion
computationunitsinaforwardpass.Wecanalsoseeconvolutionlayers,whichaccountsfor6%of allthe
parameters,consumes95%ofthecomputation.
The idea behind having thefixed size kernels is that all the variable size convolutional kernels used
in the Alexnet (11x11, 5x5, 3x3) can be replicated by making use of multiple 3x3 kernels as the
building blocks. The replication is in termof thereceptive field covered by kernels .
Page 3 | 11
Let’s consider the example. Say we have an input layer of the size 5x5x1. Implementing
theconv layerwithkernelsizeof5x5andstrideonewilltheresultsandoutputfeaturemapof(1x1).Thesame
outputfeaturemapcanobtainedbyimplementingthetwo(3x3)Convlayerswithstrideof1as below:
Now, let’s look at the number of the variables needed to be trained. For a 5x5 Conv layer filter, the
numberofvariablesis25.Ontheotherhand,twoconvlayersofkernelsize3x3haveatotalof 3x3x2=18
variables (a reductionof 28%).
Page 4 | 11
Q3. What is VGG16?
Answer:
VGG16: It is a convolutional neural network model proposed by the K. Simonyan andA. Zisserman
from the University of Oxford in the paper “Very Deep Convolutional Networks for the Large-Scale
Image Recognition”. The model achieves 92.7% top 5 test accuracy in ImageNet, which is the
dataset of over 14 million images belonging to the 1000 classes. It was one of famous model
submitted to ILSVRC-2014. It improves AlexNet by replacing the large kernel-sized filters (11 and
5inthefirstandsecondconvolutionallayer,respectively)withmultiple3×3kernel-sizedfiltersone after
another.VGG16 wastrainedforweeksandwasusing NVIDIA Titan Black GPU’s.
The Architecture
The architecture depicted below is
VGG16.
Theinput totheCov1 layer is of fixed size of 224 x 224 RGB image. The image is passed through
thestackofconvolutional(conv.) layers, wherethefilters wereusedwith avery small receptive field:
3×3(whichis thesmallestsize tocapturethenotionofleft/right,up/down,centre).In oneofthe
configurations,it also utilises the1×1 convolution filters, which can beseenas thelinear
transformationof theinputchannels . The convolution stride is fixed tothe1 pixel, thespatial padding
of theConv. layer input is such that, thespatial resolutionis preserved aftertheconvolution, i.e. the
Page 5 | 11
padding is 1-pixel for 3×3 Conv. layers. Spatial pooling is carried out by the five max-pooling layers,
which follows some of the Conv. Layers. Max-pooling is performed over the 2×2 pixel window, with
stride2.
ThreeFully-Connected(FC)layersfollowthestackofconvolutionallayers(whichhasthedifferent depth
in different architectures): the first two have 4096 channels each, the third performs 1000-way
ILSVRC classification and thus contains 1000 channels . The final layer is
softmaxlayer.The configurations ofthefullyconnectedlayersissameinallthenetworks.
All hidden layers are equipped with rectification (ReLU) non-linearity. It is also
noted that none of the networks (except for one) contain the Local Response Normalisation (LRN), such
normalisation
does not improve the performance on the ILSVRC dataset, but leads to increased memory
consumptionandcomputationtime.
positive
andnegativeimages.Itisthenusedtodetecttheobjectsinotherimages. The Page 6 | 11
algorithmhasfourstages:
Haar FeatureSelection
Creating Integral Images
AdaboostTraining
Cascading Classifiers
It is well knownfor being able todetect faces andbody parts in animage but can betrained to
identify almost any object.
trained
tmasokdelsthathavebeenusedforanothertasktojump-startthedevelopmentprocessonanew
orproblem.
Page 7 | 11
The benefits of theTransfer Learning are thatit can speedupthe time as it takes to develop and
train themodel by reusing these pieces or modules of already developed models. This helps to
speedupthemodeltraining processandaccelerateresults.
Page 8 | 11
Region Proposal Network:
Theoutputoftheregionproposalnetworkisthebunchofboxes/proposalsthatwillbeexamined
byaclassifierand regressortochecktheoccurrenceofobjectseventually.Tobemore
precise,RPN predictsthepossibility of ananchorbeingbackgroundorforeground,andrefine
theanchor.
Page 9 | 11
Problems withR-CNN:
Q9.What is GoogLeNet/Inception?
Answer:
The winner of the ILSVRC 2014 competition was GoogLeNet from Google. It achieved a top-5
error rate of 6.67%! This was very close to human-level performance which the organisers of the
challenge were now forced to evaluate. As it turns out, this was rather hard to do and required some
human training to beat GoogLeNets accuracy. After the few days of training, the human expert
(Andrej Karpathy) was able to achieve the top-5 error rate of 5.1%(single model) and 3.6%
(ensemble). The network used the CNN inspired by LeNet but implemented a novel element which
is dubbed an inception module. It used batch normalisation, image distortions and RMSprop. This
module is based on the several very small convolutions to reduce the number of parameters
drastically. Their architecture consisted of the 22 layer deep CNN but reduced the number of
parametersfrom60million (AlexNet) to4million.
It contains 1×1 Convolution at themiddle of network, andglobal average pooling is used at theend
of the network instead of using the fully connected layers. These two techniques are from another
paper“Network In-Network” (NIN). Another technique, called inceptionmodule,is tohave
different
sizes/typesof convolutions forthesameinput andtostackall theoutputs.
Page 10 | 11
Q10. What is LeNet-5?
Answer:
LeNet-5, a pioneering 7-level convolutional network by the LeCun et al in 1998, that classifies
digits, was applied by several banks to recognise hand-written numbers on checks (cheques)
digitised in 32x32 pixel greyscale input images. The ability to process higher-resolution images
requireslargerandmoreconvolutionallayers,sotheavailabilityofcomputingresourcesconstrains this
technique.
Page 11 | 11
DATA SCIENCE
INTERVIEW
PREPARATION
(30 Days of Interview
Preparation)
# DAY 15
Page 1 of 12
Q1. What is Autoencoder?
Answer:
Autoencoderneuralnetwork:ItisanunsupervisedMachinelearningalgorithmthatapplies
backpropagation,settingthetargetvalues tobeequaltotheinputs.It istrainedtoattempttocopyits
inputtoitsoutput.Internally,ithasthehiddenlayerthatdescribesacodeusedtorepresenttheinput.
2. Sparseautoencoder
An autoencoder takes the input image orvector andlearns codedictionary that changes the
rawinputfromonerepresentationtoanother.Wherein sparseautoencoderswith asparsity
enforcerthatdirectsasingle-layer networktolearn codedictionary which in turnminimizes
theerrorin reproducingtheinputwhilerestricting numberof code words for reconstruction.
Thesparseautoencoderconsistsasingle hidden layer, which is connected totheinput vector
by aweight matrixforming theencoding step.The hidden layer thenoutputs toa
reconstruction vector, using atied weight matrixtoformthedecoder.
Page 3 of 12
Lexical or Word Level Similarity
Whenreferringtotextsimilarity,peoplerefertohowsimilarthetwopiecesoftextareatthesurfacelevel. Example-
howsimilararethephrases“the catatethemouse”with“the mouseatethecat food” byjust looking
atthewords?Onthesurface,ifyouconsideronlyword-levelsimilarity,these twophrases(withdeterminers
disregarded)appearverysimilaras3ofthe4uniquewordsarean exactoverlap.
Semantic Similarity:
Another notion of similarity mostly explored by NLP research community is how similar in meaning
are any two phrases? If we look at the phrases, “ the cat ate the mouse ” and “ the mouse ate
the cat food”. As we know that while the words significantly overlaps, these two phrases have
different
meaning. Meaning out of the phrases is often the more difficult task as it requires deeper level of
analysis.Example, wecan actually look at thesimple aspects like order of
words: “cat==>ate==>mouse” and “mouse==>ate==>cat food”. Words overlap in this
case, the
order of the occurrence is different, and we can tell that, these two phrases have different meaning.
This is just the one example. Most people use the syntactic parsing to help with the semantic
similarity. Let’s have alook attheparse trees for these twophrases.What canyouget fromit?
Page 4 of 12
Q3. What is dropout in neural networks?
Answer:
Whenwetrainingourneuralnetwork(ormodel)byupdatingeachofitsweights,itmightbecome toodependent
onthedatasetweareusing.Therefore,whenthismodelhastomakeapredictionor classification,itwillnotgive
satisfactoryresults.Thisisknownasover-fitting.Wemightunderstand
thisproblemthroughareal-world example:Ifastudentofsciencelearnsonlyonechapterofabook
andthentakesatestonthewholesyllabus,hewillprobablyfail.
To overcome this problem, we use a technique that was introduced by Geoffrey Hinton in
2012.This techniqueisknownasdropout.
Dropoutreferstoignoringunits(i.e.,neurons)duringthetrainingphaseofcertainsetofneurons, which
is chosen at random. By “ignoring”, I mean these units are not considered during a
particular forwardorbackwardpass.
At each training stage, individual nodes are either dropped out of the net with probability 1-
porkept withprobabilityp, sothatareducednetworkis left; incoming andoutgoingedgestoa
dropped-out nodearealsoremoved.
Page 5 of 12
Q4. What is Forward Propagation?
Answer:
InputX providestheinformationthatthenpropagatestohiddenunitsateachlayerandthenfinally
producetheoutputy. Thearchitectureofnetworkentailsdeterminingits depth,width,andthe activation
functionsusedoneachlayer. Depthis thenumberofthehiddenlayers. Widthis the numberof
units(nodes)oneachhiddenlayer since wedon’tcontrolneitherinputlayer noroutput layer
dimensions.Therearequiteafewsetof activation functionssuchRectified Linear Unit, Sigmoid,
Hyperbolic tangent,etc. Researchhasproventhatdeepernetworksoutperformnetworks with
morehiddenunits.Therefore,it’salwaysbetterandwon’thurttotrainadeepernetwork.
Page 6 of 12
Q6. What is Information Extraction?
Answer:
Information extraction (IE): It is the task of automatically extracting structured information from the
unstructured and/or semi-structured machine-readable documents. In most of the cases, this activity
concernsprocessinghumanlanguagetextsusingnaturallanguageprocessing(NLP).
Informationextractiondepends onnamedentityrecognition(NER),a sub-toolusedtofindtargeted
informationtoextract.NERrecognizesentitiesfirstasoneofseveralcategories,suchaslocation
(LOC), persons (PER), or organizations (ORG). Once the information category is
recognized,an informationextractionutility extractsthenamedentity’s relatedinformationand
constructsa machine-readabledocumentfromit, whichalgorithms canfurtherprocesstoextract
meaning.IE findsmeaningbywayofothersubtasks,including co-referenceresolution,relationship
extraction, language,and vocabularyanalysis,and sometimesaudioextraction.
Page 7 of 12
Q7. What is Text Generation?
Answer:
Text Generation: It is a type of the Language Modelling problem. Language Modelling is the core
problem for several of natural language processing tasks such as speech to text, conversational
system, and the text summarization. The trained language model learns the likelihood of occurrence
of the word based on the previous sequence of words used in the text. Language models can be
operatedatthecharacterlevel, n-gramlevel, sentence level orevenparagraphlevel.
A language model is at thecore of many NLP tasks, andis simply a probability distribution over a
sequenceof words:
It can also be used to estimate the conditional probability of the next word in a
sequence:
Page 8 of 12
Q8. What is Text Summarization?
Answer:
Weall interactwiththeapplicationswhichusesthetext summarization. Many of theapplications arefor
theplatformwhichpublishesarticles onthedaily news,entertainment,sports.Withourbusy
schedule,welike toreadthesummaryof thosearticles beforewedecidetojumpin forreadingentire
article. Reading asummaryhelps us toidentify theinterest area, gives abrief context of thestory.
Text summarizationis asubdomainof Natural Language Processing (NLP) thatdeals withextracting
summariesfromhugechunksof texts.Therearetwomaintypesof techniquesusedfor text
summarization: NLP-based techniques anddeeplearning-basedtechniques.
Text summarization:It referstothetechniqueof shorteninglong piecesof text.Theintentionis to
createthecoherent andfluent summaryhaving only themainpoints outlined in thedocument.
Howtext summarizationworks:
The twotypes of summarization, abstractiveandtheextractivesummarization.
1. Abstractive Summarization:It selectwordsbasedonthesemanticunderstanding;eventhose
wordsdid notappearin thesource documents.It aims atproducingimportantmaterial in the
newway.They interpretsandexamines thetext using advanced natural language techniques
togenerate thenewshorter text that conveysthe most critical informationfromthe original
text.
It canbecorrelatedin thewayhumanreadsthetextarticle orblogpostandthensummarizes in
their word.
This approach weights the most important part of sentences and uses the same to
formthe summary.Differentalgorithmandthetechniquesareusedtodefinetheweightsfor
the sentencesandfurtherrankthembasedonimportanceandsimilarityamongeachother.
Page 9 of 12
Q9. What is Topic Modelling?
Answer:
Topic Modelling is the taskof using unsupervised learning toextract the maintopics
(representedas asetofwords)thatoccurinacollectionofdocuments.
Topic modeling, in the context of NaturalLanguageProcessing, is described asa
methodof uncoveringhiddenstructureinacollectionoftexts.
Dimensionality Reduction:
Topic modeling is the form of dimensionality reduction. Rather than representing the text T
in its feature space as {Word_i: count(Word_i, T) for Word_i in V}, we can represent the text
inits topic space as(Topic_i: weight(Topic_i, T) forTopic_i inTopics ).
Unsupervised learning:
Topic modeling can be compared to the clustering. As in the case of clustering, the number of
topics,
like the number of clusters, is the hyperparameter. By doing the topic modeling, we build
clustersof wordsratherthanclustersoftexts.Atextisthusamixtureofallthetopics,eachhavinga
certain weight.
A Form of Tagging
If document classification is assigning a single category to a text, topic modeling is
assigningmultiple tagstoatext.A humanexpertcanlabel theresultingtopicswithhuman-
readablelabelsanduse differentheuristicstoconverttheweightedtopicstoasetoftags.
Page 10 of 12
Q10.What is Hidden Markov Models?
Answer:
HiddenMarkovModels(HMMs)aretheclass ofprobabilistic graphical modelthatallowusto
predictthesequenceof unknown(hidden)variables fromthesetof observedvariables. Thesimple
exampleofanHMMis predictingtheweather(hiddenvariable) basedonthetypeofclothesthat
someonewears(observed).An HMMcanbeviewedastheBayes Netunrolledthroughtimewith
observationsmadeatthesequenceof timestepsbeingusedtopredictthebestsequenceof thehidden
states.
Thebelowdiagram from Wikipedia shows that HMM andits transitions. The scenario is the room
thatcontainsurns X1, X2, andX3, eachof which containsaknownmix of balls, eachball labeled y1,
y2, y3, andy4. Thesequenceoffourballs is randomlydrawn. In this particular case, theuser observes
thesequenceofballs y1,y2,y3, andy4andis attemptingtodiscernthehiddenstate,whichis theright
sequenceof threeurnsthatthesefourballs werepulled from.
WhyHidden,MarkovModel?
ThereasonitiscalledtheHiddenMarkovModelisbecauseweareconstructinganinferencemodel
basedontheassumptionsofaMarkovprocess.TheMarkovprocessassumptionissimplythatthe “futureis
independentofthepastgiventhepresent”.
To make this point clear, let us consider the scenario below where the weather, the hidden
variable, can be hot, mild or cold, and the observed variables are the type of clothing worn. The
arrows representtransitionsfromahiddenstatetoanotherhiddenstateorfromahiddenstatetoan
observedvariable.
Page 11 of 12
Page 12 of 12
DATA SCIENCE
INTERVIEW
PREPARATION
(30 Days of Interview
Preparation)
# Day-16
Q1.What is Statistics Learning?
Answer:
Statistical learning: It is the framework for understanding data based on the statistics, which
canbe classified as the supervisedor unsupervised. Supervisedstatistical learning involves building the
statisticalmodelforpredicting,orestimating,anoutputbasedononeormoreinputs,while
in unsupervised statistical learning, there are inputs but no supervising output, but we can
learn relationships andstructurefromsuchdata.
Y = f(X) + ɛ ,X = (X1,X2, . . .,Xp),
f : It is an unknown function & ɛ is random error (reducible &
irreducible). Prediction&Inference:
In the situations , where the set of inputs X are readily available, but the output Y is not
known,we oftentreatfastheblackbox(notconcernedwiththeexactformof“f”),aslongasityields
the accuratepredictionsforY.Thisistheprediction.
There are the situations where we are interested in understanding the way that Y is affected as X
change. In this type of situation, we wish to estimate f, but our goal is not necessarily to make the
predictionsforY.HerewearemoreinterestedinunderstandingtherelationshipbetweentheXand
Y. Now f cannot be treated as the black box, because we need to know
it’s exact form. This
is inference.
Parametric&Non-parametric
methods
Parametricstatistics: This statistical testsbasedonunderlyingtheassumptionsaboutdata’s distribution.
In other words,It is based on the parameters of the normal curve. Because parametric
statistics arebasedonthenormalcurve,datamustmeetcertainassumptions, orparametric
statistics cannotbecalculated. Before runninganyparametricstatistics, youshouldalways besure
totestthe assumptionsfortheteststhatyouareplanningtorun.
f(X)=β0+β1X1+β2X2+...+βpXp
As by the name, nonparametric statistics are not based on parameters of the normal
curve. Therefore, if our data violate the assumptions of a usual parametric and nonparametric
statistics mightbetterdefinethedata,tryrunningthenonparametricequivalentoftheparametric
test.We shouldalsoconsiderusingnonparametricequivalenttestswhenwehavelimitedsamplesizes
(e.g., n<30).Thoughthenonparametricstatisticaltestshavemoreflexibilitythandoparametric
statistical tests, nonparametric tests are not as robust; therefore, most statisticians recommend
thatwhenappropriate,parametricstatisticsarepreferred.
.
PredictionAccuracy andModel Interpretability:
Out of many methods that we use for the statistical learning, some are less flexible and more
restrictive . When inference is the goal, then there are clear advantages of using the simple and
relatively inflexible statistical learning methods. Whenwe are only interested in theprediction, we
useflexible modelsavailable.
Q2. What is ANOVA?
Answer:
ANOVA: it stands for “ Analysis of Variance ” is an extremely important tool for
analysis ofdata (bothOneWayandTwoWayANOVA is used).It is astatistical methodto
comparethepopulation meansoftwoormoregroupsbyanalyzingvariance.Thevariancewoulddifferonly
whenthemeans aresignificantlydifferent.
ANOVA test is the way to find out if survey or experiment results are significant. In
otherwords,It helpsustofigureout ifweneedtorejectthenullhypothesisoracceptthealternatehypothesis.
We
aretestinggroupstoseeifthere’sadifferencebetweenthem.Examplesofwhenwemightwantto test
differentgroups:
Thegroupof psychiatricpatientsaretrying threedifferenttherapies:counseling,medication,
andbiofeedback.Wewant toseeifonetherapyisbetterthan theothers.
The manufacturerhastwodifferentprocessestomakelight bulbsif theywanttoknowwhich
oneisbetter.
Studentsfromthedifferentcollegestakethesameexam.Wewanttoseeifonecollege
outperformstheother.
Types of ANOVA:
One-wayANOVA
Two-way
ANOVA
One-wayANOVAisthehypothesistestinwhichonlyonecategoricalvariableorthesingle factor
is taken into consideration. With the help of F-distribution, it enables us to
compare
meansofthreeormoresamples.TheNullhypothesis(H0)istheequityinallpopulation meanswhile
anAlternativehypothesisisthedifferenceinatleastonemean.
There are two-ways ANOVA examines the effect of two independent factors on a dependent
variable. It also studies the inter-relationship between independent variables influencing the
values of thedependent variable, if any.
Whencomparingthetwoormorecontinuousresponsevariablesbythesinglefactor,aone-way MANOVAis
appropriate(e.g.comparing‘testscore’and‘annualincome’togetherby‘levelof
education’). The two-way MANOVA also entails two or more continuous response variables, but
compares themby at least two factors (e.g. comparing ‘test score’ and ‘annual income’ together by
both‘level of education’ and‘zodiac sign’).
So themain difference is the fact that for the classifier approach, the algorithm assumes the outcome
as the class of more presence, and on the regression approach, the response is the average value of
thenearest neighbors.
Covariance: It measuresthedirectionalrelationshipbetweenthereturnsontwoassets.The
positive covariancemeansthatassetreturnsmovetogetherwhileanegative covariance means
theymove inversely.Covarianceiscalculatedbyanalyzingat-returnsurprises(standarddeviationsfrom
the expectedreturn)orbymultiplyingthecorrelationbetweenthetwovariablesbythestandard deviationof
eachvariable.
DATA SCIENCE
INTERVIEW
PREPARATION
(30 Days of Interview
Preparation)
# DAY 17
Page 1 | 11
Q1. What is ERM (Empirical Risk Minimization)?
Answer:
Empirical risk minimization (ERM): It is a principle in statistical learning theory which defines a
family of learning algorithms andis used to give theoretical bounds on their performance. The idea
is that wedon’t know exactly howwell an algorithm will work in practice (the true "risk") because
wedon't know thetrue distribution of data that thealgorithm will work on, but as an alternative we
canmeasureits performanceonaknownsetof training data.
Weassumedthat our samples comefrom this distribution anduse our dataset as anapproximation.
If wecomputetheloss usingthedatapointsin ourdataset, it’s called empirical risk. It is “empirical”
andnot“true” because weareusing adataset that’s asubsetof thewholepopulation.
Whenour learning model is built, wehave topick afunction that minimizes theempirical risk that
is thedeltabetweenpredictedoutputandactualoutputfordatapointsin thedataset.This processof
finding this function is called empirical risk minimization(ERM). Wewanttominimize thetruerisk.
Wedon’thaveinformationthatallowsustoachievethat,sowehopethatthis empirical risk will
almost bethesameasthetrueempirical risk.
Let’s get abetter understanding by Example
Wewouldwanttobuild amodelthatcandifferentiatebetweenamaleandafemalebasedonspecific
features.If weselect 150 randompeoplewherewomenare really short, andmenare really tall, then
themodelmightincorrectly assumethatheightis thedifferentiating feature.Forbuilding atruly
accuratemodel,wehavetogatherall thewomenandmenin theworldtoextractdifferentiating
features.Unfortunately,thatis notpossible!So weselectasmall numberof peopleandhopethatthis
sampleis representative of thewholepopulation.
Page 2 | 11
Q2. What is PAC (Probably Approximately Correct)?
Answer:
PAC: In computational learning theory, probably approximately correct (PAC)
learning is a frameworkformathematicalanalysisofmachinelearning.
Thelearnerreceivessamplesandmusthavetopickageneralizationfunction(calledthehypothesis)
fromaspecificclassofpossiblefunctions.Ourgoalisthat,withhighprobability,theselected function
will have low generalization error. The learner must be able to learn the concept given
any arbitraryapproximationratio,probabilityofsuccess,ordistributionofthesamples.
Hypothesis class is PAC(Probably Approximately Correct) learnable if there exists a function m_H and
algorithmthatforanylabelingfunctionf,distributionDoverthedomainofinputsX,
delta and epsilon that with m ≥ m_H produces a hypothesis h like that with probability 1-delta it
returnsatrueerrorlowerthanepsilon.Labelingfunctionisnothingotherthansayingthatwehavea specific
function f that labels the data in the domain.
Page 3 | 11
Q3. What is ELMo?
Answer:
ELMo is anovel waytorepresent wordsin vectorsorembeddings. These wordembeddingshelp
achieve state-of-the-art (SOTA) results in several NLP tasks:
Page 4 | 11
Q4. What is Pragmatic Analysis in NLP?
Answer:
Pragmatic Analysis(PA): It deals with outside word knowledge, which means understanding i.e
externaltodocumentsandqueries.PAthatfocusesonwhatwasdescribedisreinterpretedbywhatit
actuallymeant,derivingthevariousaspectsoflanguagethatrequirereal-worldknowledge.
It deals with overall communicative and social content and its effect on interpretation. It
meansabstractingthemeaningfuluseoflanguageinsituations.Inthisanalysis,themainfocusalwaysonwhat
wassaidinreinterpretedonwhatisintended.
It helps users to discover this intended effect by applying a set of rules that characterize
cooperative dialogues.
E.g., "close the window?" should be interpreted as a request instead of an order.
Page 5 | 11
Q6. What is ULMFit?
Answer:
TransferLearning in NLP(Natural languageProcessing) is anareathathadnotbeenexploredwith
greatsuccess.But,inMay2018,JeremyHowardandSebastianRudercameupwiththepaper
–UniversalLanguageModelFine-tuningforTextClassification(ULMFit)whichexploresthe
benefitsofusingapretrainedmodelontextclassification. It proposesULMFiT(Universal Language Model
Fine-tuningforText Classification), atransferlearningmethodthatcouldbeappliedtoany taskin
NLP. In this method outperforms the state-of-the-art on six text classification tasks.
ULMFiT uses a regular LSTMwhich is the state-of-the-art language model architecture
(AWD- LSTM). The LSTM network has three layers. Single architecture is used throughout –
forpre-training aswellasforfine-tuning.
ULMFiT achieves the state-of-the-art result using novel techniques like:
Discriminative fine-tuning
Slantedtriangularlearningrates
Gradual unfreezing
Discriminative Fine-Tuning
Page 6 | 11
Different layers of a neural network capture different types of information so they should be fine-
tuned to varying extents. Instead of using the same learning rates for all layers of the model,
discriminative fine-tuning allows ustotuneeachlayer withdifferent learningrates.
Slantedtriangularlearning
Q8.What is XLNet?
Answer:
XLNet is a BERT-like model instead of a totally different one. But it is an auspicious and
potential one.Inoneword,XLNetisageneralizedautoregressivepretrainingmethod.
Autoregressive(AR) languagemodel:It is akind ofmodelthatusing thecontextwordtopredictthe
nextword.Butherethecontextwordisconstrainedtotwodirections,eitherforwardorbackwards.
The advantages of AR language model are good at generative Natural language Process(NLP) tasks.
Because when generating context, usually is the forward direction. AR language model naturally
workswell onsuch NLP tasks.
Page 8 | 11
But Autoregressive language modelhassomedisadvantages, andit only canuseforwardcontext
or backwardcontext,whichmeansitcan'tuseforwardandbackwardcontextatthesametime.
Page 9 | 11
“The Transformers”is aJapaneseband.Thatbandwasformedin 1968,duringtheheightof the
Japanese music history.”
In theaboveexample, theword“the band” in thesecondsentence refers totheband“The
Transformers” introduced in thefirst sentence. Whenyoureadaboutthebandin thesecondsentence,
youknowthatit is referencing tothe“The Transformers” band. That maybeimportantfor translation.
For translating other sentences like that, a model needs to figure out these sort of dependencies and
connections. Recurrent Neural Networks (RNNs) and Convolutional Neural Networks (CNNs) have
beenusedtodeal withthis problembecauseof their properties.
Page 10 | 11
Exampl
e:
Page 11 | 11
DATA SCIENCE
INTERVIEW
PREPARATION
(30 Days of Interview
Preparation)
# Day-18
Q1. What is Levenshtein Algorithm?
Answer:
Levenshtein distance is a string metric for measuring the difference between two sequences. The
Levenshtein distance between two words is the minimum number of single-character edits (i.e.
insertions,deletions orsubstitutions) required tochange onewordinto theother.
By Mathematically, the Levenshtein distance between the two strings a, b (of length |a|and |b|
respectively) is given by theleva, b( |a|, |b|) where:
Where,1(ai≠bi): This is the indicator function equal to zero when ai≠ bi and
equalto1
otherwise,
andleva,b(i,j)isthedistancebetweenthefirsticharactersofaandthefirstjcharactersofb. Example:
TheLevenshteindistancebetween"HONDA"and"HYUNDAI"is3,sincethefollowingthreeedits
changeoneintotheother,and thereisno waytodo itwithfewerthan threeedits:
Q2. What is Soundex?
Answer:
Soundexattemptstofind similar namesorhomophonesusing phoneticnotation.Theprogramretains
lettersaccordingtodetailedequations,tomatchindividual titles for purposesofamplevolume
research.
Soundexphoneticalgorithm:Its indexesstrings dependontheirEnglish pronunciation.The algorithmis
usedtodescribehomophones,wordsthatarepronouncedthesame,butspeltdifferently.
Suppose we have the following sourceDF.
Aboveapproachesconverttheparsetreeintoasequencefollowingadepth-firsttraversaltobeable
toapplysequence-to-sequencemodelstoit.Thelinearizedversionoftheaboveparsetreelooksas follows:(S
(N) (VP V N)).
twosetsofprobabilitydistributions:
Thecollectionofdistributionsoftopicsforeachdocument
Thecollectionofdistributionsofwordsforeachtopic.
Q5.What is LSA?
Answer:
Latent SemanticAnalysis(LSA):Itisatheory andthe methodforextract andrepresents the
contextualusagemeaningofwordsbystatisticalcomputationappliedtolargecorpusoftexts.
It is an informationretrieval technique which analyzes and identifies the pattern in an
unstructured collectionoftextand relationshipbetweenthem.
LatentSemanticAnalysis itself is anunsupervisedwayofuncoveringsynonymsin acollection of
documents.
WhyLSA(LatentSemanticAnalysis)?
LSA is a technique for creating vector representation of the document. Having a
vector representation of the document gives us a way to compare documents for their similarity by
calculating the distance between vectors. In turn, means we can do handy things such as classify
documentstofindoutwhichofasetknowstopicstheymostlikelyresideto.
Classification implies we have some known topics that we want to group documents into,
andthat youhavesomelabelledtrainingdata.Ifyou'regoingtoidentifynaturalgroupingsofthedocuments
withoutanylabelleddata,youcanuseclustering
Q6. What is PLSA?
Answer:
PLSA standsforProbabilistic LatentSemantic Analysis, uses aprobabilistic methodinstead of SVD
totackle problem. The main idea is tofind theprobabilistic model with latent topics thatwe
can generate dataweobserve in ourdocument termmatrix. Specifically, wewantamodelP(D, W)
suchthatfor any documentdandwordw,P(d,w) correspondstothatentryin document-termmatrix.
Each documentis found in themixture of topics, andeachtopic consists of thecollection of words.
PLSA addstheprobabilisticspin totheseassumptions:
Given documentd, topicz is available in thatdocumentwiththeprobability P(z|d)
Given thetopicz, wordwis drawnfromz withprobability P(w|z)
Thejointprobabilityofseeingthegivendocumentandwordtogetheris:
In the above case, P(D), P(Z|D), and P(W|Z) are the parameters of our models.P(D)
canbe determineddirectlyfromcorpus.P(Z|D)andtheP(W|Z)aremodelledasmultinomial
distributionsandcanbetrainedusingtheexpectation-maximisationalgorithm(EM).
SentimentAnalysis:Itistheprocessofunderstandingifagiventextistalkingpositivelyor
negativelyabout agivensubject(e.g.forbrandmonitoringpurposes).
Topic Detection: Inthis, the task ofidentifying the theme ortopic ofapiece oftext
(e.g. knowifaproductreviewisaboutEaseofUse,CustomerSupport,orPricingwhen
analysing
customerfeedback).
LanguageDetection:theprocedureof detectingthelanguageof agiven text(e.g.knowif an
incomingsupportticketiswritteninEnglishorSpanishforautomaticallyroutingticketsto
theappropriateteam).
Q10. What is Word Sense Disambiguation (WSD)?
Answer:
WSD (Word Sense Disambiguation) is a solution to the ambiguity which arises due to
different meaningofwordsinadifferentcontext.
In natural language processing, word sense disambiguation (WSD) is the problem of
determining which"sense"(meaning)ofawordisactivatedbytheuseofthewordinaparticular
context,a
processwhichappearstobemostlyunconsciousinpeople.WSDisthenaturalclassificationproblem: Given
awordanditspossiblesenses,asdefinedbythedictionary,classifyanoccurrenceoftheword inthecontextinto
oneormoreofitssenseclasses.Thefeaturesofthecontext(suchasthe neighbouringwords)providetheevidencefor
classification.
Forexample,considerthese twobelowsentences.
“ The bank will notbe accepting the cashon
Saturdays.” “Theriveroverflowedthebank.”
The word “ bank “ in the given sentence refers to commercial (finance) banks, while in the second
sentence,itreferstoariverbank.Theuncertaintythatarises,duetothisistoughforthemachineto detectand
resolve.Detectionofchangeisthefirstissueandfixingitanddisplayingthecorrectoutput isthesecond
issue.
DATA SCIENCE
INTERVIEW
PREPARATION
(30 Days of Interview
Preparation)
# DAY 19
Page 1 | 12
Q1. What is LSI(Latent Semantic Indexing)?
Answer:
Latent Semantic Indexing (LSI): It is an indexing and retrieval method that uses a
mathematical technique called SVD(Singular value decomposition) to find patterns in relationships
betweenterms andconceptscontainedinanunstructuredcollectionoftext.Itisbasedontheprinciplethatwords
thatareusedinthesamecontextstendtohavesimilarmeanings.
For example, Tiger and Woods are associated with men instead of an animal, and a Wood,
Parris, and Hilton are associated with the singer.
Example:
If you use LSI to index a collection of articles and the words “fan” and “regulator” appear
togetherfrequently enough,thesearchalgorithmwouldnoticethatthetwotermsaresemanticallyclose.Asearchfor
“fan” will, therefore, return a set of items containing that phrase, but also items that contain just the word
“regulator”.It doesn't understandworddistance,but by examiningasufficient numberof documents,itonly
knowsthetwo termsareinterrelated.Itthenusesthatinformationtoprovideanexpandedsetofresultswithbetter
recallthan anunderstandable keywordsearch.
ThediagrambelowdescribestheeffectbetweenLSIandkeywordsearches.Wstandsforadocument.
Page 2 | 12
Q2. What is Named Entity Recognition? And tell some use cases of
NER?
Answer:
Named-entityrecognition (NER): It is also knownasentityextraction,andentityidentification is a
subtaskof informationextractionthatexploretolocate andclassify atomicelements in
textintopredefinedcategorieslike thenamesof persons,organizations,places, expressionsof
times,quantities, monetary values, percentagesandmore.
In eachtext document, particular termsrepresent specific entities thatare moreinformative andhave
adifferent context. These entities arecalled namedentities, which moreaccurately refer toconditions
that represent real-world objects like people, places, organizations or institutions, andso on, which
areoftenexpressedbypropernames.The naive approach could betofind these by having alook at
thenounphrasesin textdocuments.It also is knownasentitychunking/extraction,whichis apopular
technique used in information extraction toanalyze andsegment thenamedentities andcategorize
orclassify themundervarious predefined classes.
Page 3 | 12
Now, if we pass it through the Named Entity Recognition API, it pulls out the entities Bangalore
(location) and Fitbit (Product). This can be then used to categorize the complaint and assign it to
therelevantdepartmentwithintheorganizationthatshouldbehandlingthis.
Page 4 | 12
Q4. What is the language model?
Answer:
LanguageModelling (LM): It is oneof theessentialpartsof modernNLP. Therearemanysortsof
applicationsforLanguageModelling, like Machine Translation, Spell Correction Speech
Recognition, Summarization, Question Answering, Sentiment analysis, etc. Each of those tasks
requirestheuseof thelanguagemodel.The languagemodelis neededtorepresentthetexttoaform
understandablefromthemachinepointof view.
Thestatistical languagemodelis aprobability distributionoveraseriesof words.Given sucha
series,
say of lengthm, itassigns aprobability tothewholeseries.
It provides context todistinguish betweenphrases andwords that sounds aresimilar. For example,
mineAanmericanEnglish, thephrases " wreckanice beach" and"recognizespeech" soundalike but
differentthings.
Data sparsity is a significant problem in building language models. Most possible word sequences
are notnoticed in training. One solution is to makethe inference that theprobability of a word only
depends ontheprevious nwords. This is called as ann-gram model or unigram model whenn = 1.
The unigrammodelis also knownasthebag of wordsmodel.
Howdoesthis L anguageModel help in NL P Tasks?
The probabilities restoration by a language model is most useful to compare the likelihood that
different sentences are"good sentences." This was useful in manypractical tasks, for example:
Spell checking: You observe a word that is not identified as a known word as part of a sentence.
aUrseingtheedit distance algorithm, wefind theclosestknownwordstotheunknownwords.These
the candidate corrections. For example, we observe the word "wurd" in the context of the sentence,
"I like to write this wurd." The candidate corrections are ["word", "weird", "wind"]. How can we
select amongthesecandidates themostlikely correctionforthesuspectederror"weird"?
AutomaticSpeechRecognition:wereceive as input astring of phonemes; a first modelpredicts
for
sub-sequencesof thestreamof phonemescandidatewords;thelanguagemodelhelpsin rankingthe
mostlikely sequenceof wordscompatiblewiththecandidate wordsproducedby theacoustic
model.
Machine Translation: each word from the source language is mappedto multiple candidate words
in thetarget language; the language model in thetarget language can rank the most likely sequence
of candidate target words.
Page 5 | 12
Q5. What is Word Embedding?
Answer:
A word embedding is a learned representation for text where words that have
thesamemeaninghave asimilarobservation.
It is basically a form of word representation that bridges the human understanding of
languagetothat ofamachine.Wordembeddingsdivide representationsoftextinann-dimensional
space.Theseare essentialforsolvingmost NLPproblems.
And the other point worth considering is how we obtain word embeddings as no two sets of
word embeddingsaresimilar.Wordembeddingsaren'trandom;they'redevelopedbytrainingtheneural
network. A recent powerful word embedding usage comes from Google named Word2Vec, which is
trainedbypredictingseveralwordsthatappearnexttootherwordsinalanguage.Forexample,the word
"cat", the neural network would predict the words like "kitten" and "feline." This intuition of
wordscomesout"near"eachotherallowsustoplacetheminvectorspace.
Page 6 | 12
windowoverawordbecausenointernalstructureofthewordistakenintoaccount.Aslongasthe
charactersarewithinthiswindow,theorderofthen-gramsdoesn’tmatter.
fastText workswell withrarewords.So evenif awordwasn’tseenduringtraining, it canbebroken
downinton-gramstogetitsembeddings.
Word2vecandGloVe bothfail toprovideanyvectorrepresentationforwordsthatarenotin the
modeldictionary.Thisisahugeadvantageofthismethod.
Page 7 | 12
GloVe aims to achieve two goals:
(1) Createwordvectorsthatcapturemeaninginvectorspace
(2) Takes advantage of globalcountstatisticsinstead of only local
information Unlike word2vec – which learns by streaming sentences – GloVe
determinesbasedonaco- occurrence matrix andtrainswordvectors,sotheirdifferencespredictco-
occurrence ratios GloVe weights the loss based on wordfrequency.
Somewhatsurprisingly,word2vecandGloVeturnouttoberemarkablysimilar,despitestartingoff from
entirelydifferent startingpoints.
Page 8 | 12
It is an excellent library package for processing texts, working with word vector models (such as
FastText, Word2Vec, etc) and for building the topic models. Another significant advantage with
gensimis: it lets us handle large text files without having toloadtheentire file in memory.
Wecan also tell as It is an open-source library for unsupervised topic modeling and natural
language processing,using modernstatistical machinelearning.
Page 9 | 12
Q9. What is Encoder-Decoder Architecture?
Answer:
Encoder:
Encodersimply takestheinputdata,andtrainsonit, thenit passesthefinal stateof its
recurrent layer asaninitial statetothefirst recurrent layer of thedecoder part.
Decoder:
The decoder takes the final state of encoder’s final recurrent layer and uses it as an initial
statetoitsinitial,recurrentlayer,theinputofthedecoderissequencesthatwewanttoget French
sentences.
Page 10 | 12
Q10. What is Context2Vec?
Answer:
Assumeacasewhereyouhaveasentencelike. Ican’t find May. WordMaymaybereferstoamonth's
nameoraperson'sname.Youusethewordssurroundit(context)tohelpyourselftodeterminethe bestsuitableoption.
Actually, this problem refers tothe WordSense Disambiguation task, onwhich youinvestigate
theactualsemanticsofthewordbasedonseveralsemanticandlinguistictechniques. TheContext2Vecideaistaken
fromtheoriginal CBOWWord2Vecmodel,butinsteadofrelying on averagingtheembeddingofthewords,
it relies on a much more complex parametric model that is based on one layer of Bi-LSTM.
Figure1 shows the architecture of the CBOW model.
Figure1
Page 11 | 12
Context2Vecappliedthesameconceptof windowing,butinsteadofusing asimpleaverage
function,ituses3stagestolearncomplexparametricnetworks.
A Bi-LSTM layer that takes left-to-right and right-to-left representations
A feedforward network that takes the concatenated hidden
representationandproducesa hiddenrepresentationthroughlearningthenetwork
parameters.
Finally, we apply the objective function to the network output.
Page 12 | 12
DATA SCIENCE
INTERVIEW
PREPARATION
(30 Days of Interview
Preparation)
# DAY 20
Q1. Do you have any idea about Event2Mind in NLP?
Answer:
Yes, it is basedonNLP research papertounderstand thecommon-senseinference fromsentences.
Event2Mind: Common-senseInferenceonEvents,Intents, and Reactions
The study of “Commonsense Reasoning” in NLP deals with teaching computers howto gain and
employcommonsenseknowledge.NLP systemsrequirecommonsensetoadaptquickly and
understand humansaswetalk toeachotherin anatural environment.
This paper proposesa newtask toteachsystemscommonsensereasoning: given aneventdescribed
in ashort“eventphrase”(e.g. “PersonX drinks coffee in themorning”), theresearchers teach asystem
toreasonaboutthelikely intents(“PersonX wantstostay awake”) and reactions (“PersonX feels
alert”) of theevent’s participants.
Understandinganarrativerequirescommon-sensereasoningaboutthementalstatesof peoplein
relationtoevents.Forexample,if “Robertis dragginghis feetatwork,”pragmaticimplications about
Robert’s intentarethat“Robertwantstoavoiddoingthings”(AboveFig).Youcanalsoinferthat Robert’s
emotionalreactionmightbefeeling “bored”or“lazy.” Furthermore,whilenotexplicitly mentioned,you
canassumethatpeopleotherthanRobertareaffectedby thesituation, andthese peoplearelikelyto
feel“impatient”or“frustrated.”
Page 1 | 12
This type of pragmatic inference can likely be useful for a wide range of NLP applications that
require accurateanticipationofpeople’sintentsandemotionalreactions,evenwhentheyarenotexpressly
mentioned.Forexample,anidealdialoguesystemshouldreactinempatheticwaysbyreasoning aboutthe
humanuser’smentalstatebasedontheeventstheuserhasexperienced,withouttheuser explicitlystatinghowthey
arefeeling.Furthermore,advertisementsystemsonsocialmediashould beabletoreasonabouttheemotional
reactionsof peopleaftereventssuchasmassshootingsand removeadsforguns,whichmight
increasesocial distress.Also, thepragmaticinferenceis a necessarysteptowardautomatic
narrativeunderstandingandgeneration.However,this typeof commonsensesocial reasoninggoes
far beyondthewidely studied entailment tasksandthusfalls outside thescopeofexisting
benchmarks.
Page 3 | 12
Q3. What is the Pix2Pix network?
Answer:
Pix2Pix network: It is a Conditional GANs (cGAN) that learn the mapping from an input
imageto outputan image.
Image-To-Image Translation is the process for translating one representation of the image into
anotherrepresentation.
The image-to-image translation is another example of a task that GANs (Generative Adversarial
Networks) areideally suitedfor. These aretasks in whichit is nearly impossible tohard-code aloss
function. Studies on GANs are concerned with novel image synthesis, translating from a random
vector z into an image. Image-to-Image translation converts one image to another like the edges of
thebag belowtothephotoimage.Anotherexciting example of this is shownbelow:
Page 4 | 12
authors dofind somevalue totheL1 loss function as aweighted sidekick totheadversarial loss
function.
The Conditional-Adversarial Loss (Generator versus Discriminator) is very popularly formatted as
follows:
TheL1lossfunctionpreviouslymentionedisshownbelow:
In the experiments, the authors report that they foundthe mostsuccess with the lambda
parameter equalto100.
Answer:
U-Netarchitecture: It is built upon the Fully Convolutional Network and modified in a way
thatit yieldsbettersegmentationinmedicalimaging.ComparedtoFCN-8,thetwomaindifferencesare(a) U-
netis symmetricand(b)theskipconnectionsbetweenthedownsamplingpathandupsampling path
apply aconcatenationoperatorinsteadofasum.Theseskip connectionsintendtoprovidelocal
informationtotheglobalinformationwhileupsampling.Becauseofitssymmetry,thenetworkhasa
largenumberoffeaturemapsin theupsampling path,whichallowstransferring information.By
comparison,theunderlyingFCNarchitectureonlyhadthenumberofclassesfeaturemapsinits
upsamplingway.
Page 5 | 12
Howdoesit work?
Page 6 | 12
Q5. What is pair2vec?
Answer:
This paper pre trains wordpair representationsbymaximizing pointwise mutualinformationof
pairs of wordswiththeircontext.This encouragesamodeltolearnmoremeaningfulrepresentations
ofwordpairsthanwithmoregeneralobjectives, like modeling.Thepre-trainedrepresentationsare
useful in taskslike SQuAD andMultiNLI thatrequirecross-sentenceinference. You canexpect to see
morepretrainingtasksthatcapturepropertiesparticularlysuitedtospecificdownstreamtasksand are
complementarytomoregeneral-purposetaskslikelanguagemodeling.
Reasoningaboutimpliedrelationshipsbetweenpairsofwordsiscrucialforcrosssentencesinference
problemslikequestionanswering(QA)andnaturallanguageinference(NLI).InNLI,e.g.,givena
premise such as “golf is prohibitively expensive,” inferring that the hypothesis “golf is a cheap
pastime”isacontradictionrequiresonetoknowthatexpensiveandcheapareantonyms.Recent work
hasshownthatcurrentmodels,whichrelyheavilyonunsupervisedsingle-wordembeddings, struggleto
grasp such relationships. In this pair2vec paper, we show that they can be learned with word
pair2vec(pair vector), which are trained, unsupervised, at a huge scale, and which significantly
improveperformancewhenaddedtoexistingcross-sentenceattentionmechanisms.
Page 7 | 12
Unlike single wordrepresentations,whicharetypically trainedbymodelingtheco-occurrenceofa
targetwordx withits contextc, ourword-pairrepresentationsarelearnedby modelingthethree-way co-
occurrencebetweentwowords(x,y) andthecontextc thattiesthemtogether, as illustrated in above
Table. While similar training signal hasbeenusedtolearnmodelsfor ontology
construction and knowledge base completion, this paper shows, for the first time, that considerable
scale learning of pairwise embeddings can be used to improve the performance of neural cross-
sentence inference modelsdirectly.
Page 8 | 12
The goal of meta-learning is to train the model on a variety of learning tasks, such that it can solve
newlearningtaskswithonlyasmall numberoftrainingsamples.Ittendstofocusonfindingmodel agnostic
solutions,whereasmulti-tasklearningremainsdeeplytiedtomodelarchitecture.
Thus, meta-level AI algorithms make AI systems:
· Learn faster
· Generalizable to many tasks
· Adaptableto environmentalchangeslikeinReinforcementLearning
Onecansolveanyproblemwithasinglemodel,butmeta-learningshouldnotbeconfusedwithone- shot
learning.
Page 9 | 12
resultin visualization. Morethan20commonlyusedactivelearning(AL) methodshavebeen
implemented in the toolbox, providing users many choices.
Page 10 | 12
Q9. What is Dropout Neural Networks?
Answer:
Theterm“dropout”referstodroppingoutunits(bothhiddenandvisible)inaneuralnetwork.
At each training stage, individual nodes are either dropped out of the net with probability 1-
p or kept with probability p, so that a reduced network is left; incoming and outgoing edges to a
dropped-outnodearealsoremoved.
WhydoweneedDropout?
Theanswertothesequestionsis“topreventover-fitting.”
A fully connected layer occupies most of the parameters, and hence, neurons
develop co-dependency amongst each other during training, which curbs the individual power of each
neuronleadingtoover-fittingoftrainingdata.
Page 11 | 12
Q10. What is GAN?
Answer:
A generativeadversarial network(GAN): It is aclass of machinelearningsystemsinventedbyIan
Goodfellowandhis colleaguesin 2014.Twoneuralnetworksarecontestingwitheach other in a
game(in theideaofgametheory,oftenbutnotalwaysin theformofazero-sumgame).Given a
training set,thistechniquelearnstogeneratenewdata with thesamestatistics as thetraining set.
E.g., aGAN trainedonphotographscanproduceoriginal picturesthatlookatleastsuperficially
authentictohumanobservers,havingmanyrealistic characteristics.Thoughinitially proposedasa
formofagenerativemodelforunsupervisedlearning, GANs havealsoprovenusefulforsemi-
supervisedlearning,[2] fully supervisedlearning, andreinforcement learning.
Exampleof GAN
GenerativeAdversarialNetworkstakesupagame-theoreticapproach,unlike aconventionalneural
network.Thenetworklearnstogeneratefromatrainingdistributionthrougha2-playergame.The
twoentitiesareGeneratorandDiscriminator.Thesetwoadversariesareinconstantbattlethroughout thetraining
process.
Page 12 | 12
DATA
SCIENCE
INTERVIEW
PREPARATION
(30 Days of Interview
Preparation)
# Day21
Page 1 | 15
Q1. Explain Grad-CAM architecture?
Answer:
According totheresearchpaper,“Weproposeatechniquefor makingConvolutional NeuralNetwork
(CNN)-based models moretransparent by visualizing input regions thatare‘important’ for
predictions–producingvisual explanations.Our approachis called Gradient-weighted Class
Activation Mapping(Grad-CAM), whichusesclass-specific gradientinformationtolocalize the
crucial regions.Theselocalizations arecombinedwiththeexisting pixel-spacevisualizations to
createanewhigh-resolution,andclass-discriminative display calledtheGuidedGrad-CAM. These
methodshelpbettertounderstandCNN-based models,including imagecaptioningandtheapparent
questionanswering(VQA) models.Weevaluateourvisual explanationsby measuringtheability to
discriminate betweentheclasses andtoinspire trustin humans,andtheircorrelationwiththe
occlusion maps.Grad-CAM providesanewwaytounderstand theCNN-based models.”
A technique for making CNN(Convolutional Neural Network)-based models more transparent by
visualizing theregions of input that are “important” for predictions from these models — or visual
explanations.
This visualization is both high-resolution (when the class of interest is ‘tiger cat,’ it identifies
crucial ‘tiger cat’featureslike stripes, pointy ears andeyes)andclass-discriminative(it shows the
‘tiger cat’ but notthe ‘boxer(dog)’).
Page 2 | 15
Q2.Explain squeeze-net architecture?
Answer:
Nowadays, technology is at its peak. Self-driving cars andIoT is going to be household talks in the
next few years to come. Therefore, everything is controlled remotely, say, e.g., in self-driving cars,
we will need our system to communicate with the servers regularly. So accordingly, if we have a
model that has a small size, thenwecan quickly deploy it in the cloud. So that’s why weneededan
architecture that is less in size and also achieves the same level of accuracy that other architecture
achieves.
It’s Architecture
Replace 3x3 filters with 1x1 filter- Weplan to use themaximum numberof 1x1 filters as
using a 1X1 filter rather than a 3X3 filter can reduce the number of parameters by 9X. We
may think that replacing 3X3 filters with 1X1 filters may perform badly as it has less
information to work on. But this is not a case. Typically 3X3 filter may capture the spatial
informationof pixels close toeachother while the1X1 filter zeros in onpixel andcaptures
features amongstits channels.
Decrease number of input channels to 3x3 filters- to maintain a small total number of
parametersin aCNN, andit is crucial notonly todecreasethenumberof 3x3 filters, butalso
to decrease the number of input channels to 3x3 filters. We reduce the number of input
channels to 3x3 filters using squeeze layers. The author of this paper has used a termcalled
the“fire module,”inwhichthereis asqueezelayerandanexpandedlayer.Inthesqueeze layer,
weareusing1X1 filters, whileintheexpandedlayer,weareusingacomboof3X3 filters and
1X1 filters. The author is trying to limit the number of inputs to 3X3 filters to
reducethenumber of parameters inthelayer.
Page 3 | 15
Downsamplelatein anetworksothatconvolutionlayers havealargeactivation map-
Having got an intuition about contracting the sheer number ofparameterswe are
workingwith,howthemodelis getting mostoutoftheremainingsetofparameters.The
authorinthispaperhasdownsampledthefeaturemapinlaterlayers,andthisincreases the
accuracy.Butthisisanexcellent contrasttonetworkslikeVGGwherealargefeature map
is taken, and then it gets smaller as network approach towards the end. This
different approachis toointeresting,andtheycite thepaperbyK. HeandH. Sunthat
similarly applies delayed downsampling that leads to higher classification accuracy.
This architecture consists of the fire module, which enables it to bring down the
number ofparameters.
And other thing that surprises me is the lack of fully connected layers or dense layers at the end,
which one will see in a typical CNN architecture. The dense layers, in the end, learn all the
relationships between the high-level features and the classes it is trying to identify. The fully
connectedlayersaredesignedtolearnthatnosesandearsmakeupaface,andwheelsandlights indicate
cars. However, in this architecture, that extra learning step seems to be embedded within the
transformationsbetweenvarious“firemodules.”
Page 4 | 15
The squeeze-net can accomplish an accuracy nearly equal to AlexNet with 50X less number of
parameters. The most impressive part is that if we apply Deep compression to the already smaller
model, thenit can reducethesize of thesqueeze-netmodelto510x times thatof AlexNet.
Q3.ZFNet architecture
Answer:
The architecture of the network is an optimized version of the last year’s winner - AlexNet. The
authors spent some time to find out the bottlenecks of AlexNet and removing them, achieving
superiorperformance.
Page 5 | 15
receptive fields of convolutional neurons overlap, and neighboring neurons learn similar structures.
(e): 2ndlayer features for ZFNet. Notethattherearenoaliasing artifacts. Source:original paper.
In particular, they reduced the filter size in the 1st convolutional layer from 11x11 to 7x7, which
resulted in fewerdeadfeatures learned in thefirst layer (see theimage belowforanexample of that).
A deadfeature is a situation wherea convolutional kernel fails to learn any significant representation.
Visually it looks like amonotonicsingle-color image,whereall thevalues areclose toeachother.
In addition to changing the filter size, the authors of FZNet have doubled thenumberof filters in all
convolutional layers andthenumberof neurons in thefully connected layers as compared tothe
AlexNet. In the AlexNet, there were 48-128-192-192-128-2048-2048 kernels/neurons, and in the
ZFNet, all these doubled to 96-256-384-384-256-4096-4096. This modification allowed the network
to increase the complexity of internal representations and as a result, decrease the error rate from
15.4% forlast year’s winner,to14.8% tobecomethewinnerin 2013.
Answer:
Developing the neural network models often requires significant architecture engineering. We
can sometimesgetbywithtransferlearning,butifwewantthebestpossibleperformance,it’susually
besttodesignyournetwork.Thisrequiresspecializedskillsandischallengingingeneral;wemay noteven
know the limits of the current state-of-the-art(SOTA) techniques. Its a lot of trial and error, and
experimentationitselfistime-consumingandexpensive.
This is the NAS(Neural Architecture Search) comes in. NAS(Neural Architecture Search)
is an algorithm that searches for the best neural network architecture. Most of the
algorithms work in the following way. Start off by defining the set of “building blocks” that can be
usedforournetwork. E.g.,thestate-of-the-art(SOTA)NASNetpaperproposesthesecommonlyusedblocks
foranimage recognition network-
Page 6 | 15
In the NAS algorithm, the controller Recurrent Neural Network (RNN) samples the building
blocks, putting them together to create some end to end architecture. Architecture generally
combinesthe samestyleasstate-of-the-art(SOTA)networks,suchasDenseNetsorResNets,butusesa
much differentcombinationandtheconfigurationofblocks.
This new network architecture is then trained to convergence to obtain the least accuracy
ontheheld-outvalidationset.Theresultingefficienciesareusedtoupdatethecontrollersothatthecontroller
will generate better architectures over time, perhaps by selecting better blocks or
makingbetter connections.Thecontrollerweightsareupdatedwithapolicygradient.Thewhole
end-to-endsetup isshownbelow.
It’s a reasonably intuitive approach! In simple means: have an algorithm grab different
blocks and put those blocks together to make the network. Train and test out that network.
Based on our results, adjust the blocks we used to make the network and how you put them
together!
SENets stands for Squeeze-and-Excitation Networks introduces a building block for CNNs
that improveschannelinterdependenciesatalmostnocomputationalcost.Theyhaveusedinthe2017
ImageNet competition and helped to improve the result from last year by 25%. Besides this large
performanceboost,theycanbeeasilyaddedtoexistingarchitectures.Theideaisthis:
Page 7 | 15
Let’s add parameters to each channel of the convolutional block so that the network can adaptively
adjust theweightingof eachfeature map.
As simple asmayit sound, this is it. So, let’s takeacloser look atwhythis workssowell.
Why it workstoowell?
CNN's uses its convolutional filters to extract hierarchal information from the images. Lower layers
find little pieces of context like high frequencies or edges, while upper layers can detect faces, text,
orothercomplex geometrical shapes. They extract whatever is necessary tosolve thetask precisely.
All of this worksbyfusing spatialandchannelinformationof animage.Thedifferent filters will first
find thespatial featuresin eachinputchannelbeforeaddingtheinformation across all available
output channels.
All we need to understand for now is that the network weights each of its channels equally when
creating output feature maps. It is all about changing this by adding a content-aware mechanism to
weight each channel adaptively. In its too basic form, this could meanadding a single parameter to
eachchannel andgiving it linear scalar howrelevant eachoneis.
However, the authors push it a little further. First, they get the global understanding of each channel
by squeezing feature mapstoasingle numeric value. This results in thevectorof size n, wherenis
equal to the number of convolutional channels. Afterward, it is fed through a two-layer neural
network, which outputsa vector of thesamesize. These n values can nowbeusedas weights onthe
original features maps,scaling eachchannel basedonits importance.
Page 8 | 15
The Bottom-Up Pathway
The bottom-up pathway is feedforward computation of backbone ConvNet. It is known as one
pyramidlevelisforeachstage.Theoutputoflastlayerofeachstepwillbeusedasthereferenceset of
featuremapsforenrichingthetop-downpathwaybylateralconnection.
Top-Down Pathway and Lateral Connection
Thehigherresolutionfeaturesareupsampledspatiallycoarser,butsemanticallystronger,
featuremapsfromhigherpyramidlevels.Moreparticularly,thespatialresolution
is upsampled by a factor of 2 using nearest neighbor for simplicity.
Eachlateralconnectionaddsfeaturemapsofthesamespatialsizefromthebottom-
uppathwayand top-downpathway.
Specifically, the feature maps from the bottom-up pathway undergo 1×1
convolutionstoreducechanneldimensions.
Andfeaturemapsfromthebottom-uppathwayandtop-downpathwayaremerged
byelement-wiseaddition.
Predictionin FPN
Finally, the3×3 convolution is appended on each merged map to generate a final feature
map,which istoreducethealiasingeffectofupsampling.This last set of feature maps is
called{P2,P3,P4,P5},correspondingto{C2,C3,C4,C5}thatarerespectivelyofsame
spatial sizes.
Because all levels of pyramid use shared classifiers/regressors as in a traditional featured
imagepyramid,featuredimensionatoutputdis fixed with d= 256. Thus, all extra
convolutionallayershave256channeloutputs.
Answer:
A new def-pooling (deformable constrained pooling) layer is used to model
thedeformationofthe objectpartswithgeometricconstraintsandpenalties.Thatmeans,exceptdetectingthe
wholeobject directly,itisalsoimportanttoidentifyobjectparts,whichcanthenassistindetectingthewhole
object.
Page 9 | 15
Thestepsinblack coloraretheoldstuffthatexistedinR-CNN. The stages in redcolordonot
appearin R-CNN.
1.Selective Search
2. Box Rejection
R-CNN is usedtoreject bounding boxesthataremostlikely tobethebackground.
Page 10 | 15
3. Pre train Using Object-Level Annotations
Page 11 | 15
For thedef-poolingpath,output fromconv5,goesthroughtheConv layer, thengoesthrough
the def-
poolinglayer,andthenhasamax-poolinglayer.
In simple terms, the summation of ac multiplied by dc,n, is the 5×5deformationpenaltyinthe
figure
above.Thepenaltyofplacingobjectpartfromassumedthecentralposition.
By training the DeepID-Net, object parts of the object to be detected will give a high
activation
value
afterthedef-poolinglayeriftheyareclosedtotheiranchorplaces.Andthisoutputwillconnectto 200-
classscoresforimprovement.
5.Context Modeling
In object detection tasks in ILSVRC, there are 200 classes.And there is also the
classification mcoomrepetitiontaskinILSVRCforclassifyingandlocalizing1000-classobjects.Thecontentsare
diversecomparedwiththeobjectdetectiontask. Hence,1000-class scores, obtainedby
classificationnetwork,areusedtorefine200-classscores.
Page 12 | 15
6.The Model Averaging-
Multiplemodelsareusedtoincreasetheaccuracy,andtheresultsfromallmodelsareaveraged. This
7.BoundingBoxRegression
Boundingboxregressionistofine-tunetheboundingboxlocation,whichhasbeenusedinR-CNN.
In the above picture: A Simple Fractal Expansion (on Left), Recursively Stacking of
Fractal ExpansionasOneBlock(intheMiddle),5BlocksCascadedasFractalNet(onthe
Right)
For the base case, f1(z) is the convolutional layer:
Page 13 | 15
WhereC is anumberof columns as in themiddle of theabovefigure. The numberof the
convolutionallayers atthedeepest pathwithin theblock will have 2^(C-1). In this case, C=4, thereby,
anumber of convolutional layers are2³=8layers.
For thejoin layer (green), theelement-wise meanis computed. It is notconcatenation oraddition.
Withfive blocks (B=5) cascadedasFractalNet attheright ofthefigure, thenthenumberof
convolutionallayersatthemostprofoundpathwithinthewholenetworkis B×2^(C-1), i.e., 5×2³=40
layers.
In between2 blocks, 2×2 max pooling is donetoreduce thesize of feature maps. Batch Normand
ReLU areusedaftereachconvolution.
Page 15 | 15
DATA SCIENCE
INTERVIEW
PREPARATION
(30 Days of Interview Preparation)
# Day22
Page 1 | 16
Q1. Explain V-Net (Volumetric Convolution) Architecture with related to
Biomedical Image Segmentation?
Answer:
There were several medical data used in clinical practice consists of 3D volumes, such as MRI
volumes illustrate prostate, while most approaches are only able to process 2D images. A 3D image
segmentationbasedonavolumetric,fullyconvolutionalneuralnetworkisproposedinthiswork.
Page 2 | 16
V-Net, justifies by its name, it isshown as V-shape. The left part ofthe network
consistsof acompressionpath,whileontherightpartdecompressessignaluntilitsoriginal
size is
reached.
This is the same as U-Net, but with some difference.
On Left
Theleftsideofthenetworkisdividedintodifferentstagesthat operateatvariousresolutions.
Each stage comprises one to 3 convolutional layers.
At eachstage,a residual function is learned.Theinputofeachstageis used
in convolutional layers andprocessedthroughnon-linearities andadded tothe output
of thelast convolutional layerofthatstagetoenablelearningaresidualfunction.This V-net
architectureensuresconvergencecomparedwithnon-residuallearningnetworkssuchasU-
Net.
Theconvolutionsperformedineachstageusevolumetrickernelshavingthesizeof5×5×5
voxels.(Avoxelrepresents avalue onaregular gridin3D-space. Thetermvoxelis
commonlyusedin3Dmuch 3Dspace,justlikevoxelizationinapointcloud.)
Along the compression path, the resolution is reduced by convolution with 2×2×2
voxels full kernels applied with stride 2. Thus, the size of the resulting feature maps is
halved, with a similar purpose as pooling layers. And number of feature channels
doublesat
eachstageofthecompressionpathofV-Net.
Replacing pooling operations with convolutional ones helps to have a smaller
memoryfootprintduringtrainingbecausenoswitchesmappingtheoutput ofpoolinglayersbackto
theirinputsareneededforback-propagation.
Downsamplinghelpstoincreasethereceptivefield.
PReLU is used as a non-linearity activation function.
On Right Part
Page 3 | 16
At each stage, a deconvolutionoperationisemployedtoincrease thesize oftheinputs
followedbyonetothreeconvolutional layers,involving half thenumberof5×5×5 kernels
appliedinthepreviouslayer.
Theresidualfunctionis learned, similar to left part of the network.
The2featuresmapscomputed byaverylastconvolutionallayer, having 1×1×1kernel
sizeand producingoutputsofthesamesizeasinputvolume.
Thesetwooutputfeaturemapsaretheprobabilistic segmentationoftheforegroundand
backgroundregionsbyapplyingsoft-maxvoxelwise.
Answer:
It is found that difficulties are optimizing a very deep neural network.However, it’s still an
open problemwithwhyitisdifficulttooptimizeadeepnetwork.(itisduetogradientvanishingproblem.) Inspired
byLSTM (Long Short-TermMemory),authorsthereby makeuseofgating functionto adaptively
bypassor transformthesignalsothatthenetworkcangodeeper.Thedeepnetwork withmore
than1000layerscanalsobeoptimized.
Plain Network
Beforegoing intoHighway Networks,Let usstartwithplain networkwhichconsists of L layers
wherethel-thlayer(withomittingthesymbolforthelayer):
Wherexis input, WHis the weight, H is the transform function followed by an activation
function, andyis the output. And for i-thunit:
Wecomputetheyiandpassittothenextlayer.
Highway Network
Page 4 | 16
In a highway network, 2 non-linear transforms T andC are
introduced:
whichisconnectedtothenextlayer.
Formally,T(x) is the sigmoid function:
Page 5 | 16
Sigmoidfunctioncapstheoutput between 0to1.Whentheinputhasa too-smallvalue, it
becomes
0.Whentheinputhasatoo-largeamount,itbecomes1.Therefore,by learningWTandbT, a
networkcanadaptivelypassH(x) or pass x tothenextlayer.
Andthe authorclaimsthat this helpstohavethe simpleinitialization schemeforWTwhichis
independentofnatureofH.
bTcanbeinitializedwiththenegativevalue(e.g.,-1,-3,etc.)suchthat thenetworkisinitially
biased
towardscarryingbehaviour.
LSTM inspires the above idea as the authors mentioned.
QHo3w.evWer,hthaetexaisctreDsGradientDescent)did
AndSGD(Stochastic uetltsNhAavSe:notNbeeeunrparlovA
iderdc.hitecture
notstall fornetworks withmorethan1000 layers.
Search(NAS) on Object
Detection?
Answer:
Objectdetectionis oneofthemostfundamentalcomputervision(OpenCV) tasksandhasbeenwidely
usedinreal-worldapplications.Theperformanceofobjectdetectorshighlyreliesonfeatures extractedby
backbones.However,mostworksonobjectdetectiondirectlyusenetworksdesigned forclassificationas
abackbonethefeatureextractors,e.g.,ResNet.Thearchitecturesoptimizedon imageclassificationcannot
guaranteeperformanceonobjectdetection.Itisknownthatthereisan essentialgapbetweenthesetwo
different tasks. Imageclassification basically focuseson”What” mainobjectoftheimageis,while
objectdetectionaimsatfinding”Where”and”What”eachobject
Page 6 | 16
instance in an image. There have beenlittle works focusing onbackbone design for object detector,
except thehand-craft network, DetNet.
Neural architecture search (NAS) has achieved significant progress in image classification and
semantic segmentation. The networks produced by search have reached or even surpassed the
performance of thehand-crafted ones on this task. But object detection has never beensupported by
NAS before. Some NAS(Neural architecture search) work directly applies architecture searched on
CIFAR-10classificationonobjectdetection.
In this work, we present the first effort towards learning a backbone network for object
detection tasks. Unlike previous NAS works, our method does not involve any architecture-level
transfer. We proposeDetNAS to conduct neural architecture search directlyonthe target tasks. The
quests are even performed with precisely the same settings to the target task. Training an objector
detector usuallyneedsseveraldaysandGPUs,nomatterusingapre-train-finetuneschemeortraining
from scratch.Thus,itisnotaffordabletodirectlyusereinforcementlearning(RL)orevolutionalgorithm
(EA) to search the architectures independently. To overcome this obstacle, we
formulatethis
problem
intosearchingtheoptimalpathin thelarge graphorsupernet.In simpleterms,DetNAS consists of
threesteps:(1)trainingasupernetthatincludesallsub-networksinsearchspace;(2)searchingfor
Answer:
Emotioncauseextraction(ECE)aimsatextractingpotentialcausesthatleadtoemotionexpressions in
thetext.TheECEtaskwasfirstproposedanddefinedasaword-levelsequencelabelingproblem inLeeetal.To
solvetheshortcomingofextractingcausesatthewordlevel,Guietal.2016released anewcorpus
whichhasreceivedmuchattentioninthefollowingstudyandbecomesabenchmark datasetforECE
research.
Page 7 | 16
BelowFig.Displaysanexamplefromthiscorpus,therearefiveclausesinadocument.Theemotion
“happy” is contained in fourth clause. We denote this clause as an emotion clause, which refers to a
termthatincludesemotions.Ithastwocorrespondingcauses:“apolicemanvisitedtheoldmanwith thelost
money”inthesecondclauseand,“toldhimthatthethiefwascaught”inthethirdclause.We namethemas
cause clause, which refers to a term that contains causes.
In this work,we proposea new task: emotion-causepair extraction (ECPE), which aims to
extract all potentialpairsof emotionsandcorrespondingcausesin thedocument.In AboveFig, we
showthe differencebetweenthetraditionalECEtaskandournewECPEtask.ThegoalofECEistoextract the
correspondingcauseclauseofthegiven emotion.In additiontoadocumentastheinput,ECE needsto
provideannotated feelingatfirstbeforecauseextraction.
In contrast, the output of our ECPE task is a pair of emotion-cause, without the need of
providing emotionannotationin advance.FromAbovefig., e.g., given theannotationof feeling:
“happy,”the goal of ECE is totrackthetwocorrespondingcauseclauses:“apolicemanvisited theold
manwiththelostmoney”and“andtoldhimthatthethiefwascaught.”WhileintheECPEtask,thegoalistodirectly
extractall pairsofemotionclauseandcauseclause,including (“The oldmanwasdelighted”, “apoliceman
visited theoldmanwiththelostmoney”)and(“The oldmanwaspleased”, “andtold himthat
thethiefwascaught”),withoutprovidingtheemotionannotation“happy”.
To address this new ECPE task, we propose a two-step framework. Step 1 converts the
emotion- cause pair extraction task to two individual sub-tasks (emotion extraction and cause
extraction respectively)viatwokindsofmulti-tasklearningnetworks,intendingtoextractasetof
emotion clauses and a set of cause clauses. Step 2 performs emotion-cause pairing and filtering. We
combine alltheelementsofthetwosetsintopairsandfinallytrainafiltertoeliminatethecouplesthatdonot
containacausalrelationship.
Page 8 | 16
Dialogue state tracking (DST) is a core component in task-oriented dialogue systems, such as
restaurant reservations or ticket bookings. The goal of DST is to extract user goals expressed during
conversation andto encodethemas a compact set of thedialogue states, i.e., a set of slots andtheir
corresponding values. E.g., asshownin belowfig., (slot, value) pairs such as(price,
cheap)and(area, centre)areextractedfromtheconversation. Accurate DST performanceis
importantfor appropriatedialogue management,whereuserintentiondeterminesthenextsystem
actionandthecontent toquery fromthedatabases.
Statetrackingapproachesarebasedontheassumptionthatontologyisdefinedinadvance,where all
slotsandtheirvaluesareknown.Havingapredefined ontologycansimplifyDSTintoa
classification problem and improve performance (Henderson et al., 2014b; Mrkšić et al., 2017;
Zhong et al., 2018). However, there are two significant drawbacks to this approach: 1) A full
ontology is hard to obtain in advance (Xu and Hu, 2018). In the industry, databases are usually
exposedthroughanexternalAPIonly,whichisownedandmaintainedbyothers.Itisnotfeasibleto gain
accesstoenumerateallthepossiblevaluesforeachslot.2)Evenifafullontologyexists,the numberof
possible slot values could be significant and variable. For example, a restaurant name or a train
departuretimecancontainalargenumberofpossiblevalues.Therefore,manyoftheprevious worksthatare
basedonneuralclassificationmodelsmaynotbeapplicableinrealscenarios.
Page 9 | 16
Thekeybenefittotheapproachisthatthesinglesystemcanbetraineddirectlyonthesourceand
targettext,nolongerrequiringthepipelineofspecializedmethodsusedinstatistical(ML)machine learning.
Unlike the traditional phrase-based translation system which consists of many sub-
componentsthat aretunedseparately,neuralmachinetranslationattemptstobuildandtrainasingle,large
neural networkthatreadsasentenceand outputsacorrecttranslation.
As such, neural machine translation(NMT) systems are said to be end-to-end systems
asonlyone modelisrequiredforthetranslation.
In Encoder
Thetaskoftheencoderistoprovidetherepresentationofainputsentence.Theinputsentenceisa
sequenceofwords,forwhichwefirstconsultembeddingmatrix.Then,asintheprimarylanguage model
described previously,we processthesewordswitha recurrent neural network(RNN).This results in
hiddenstatesthatencodeeachwordwithitsleftcontext,i.e.,alltheprecedingwords.To alsogettheright
context,wealsobuildarecurrentneuralnetwork(RNN)thatrunsright-to-left,or, fromtheendofthe
sentencetobeginning.Havingtworecurrentneuralnetworks(RNN)runningin twodirectionsisknownas
thebidirectionalrecurrentneuralnetwork(RNN).
In Decoder
Thedecoderis therecurrentneuralnetwork(RNN). It takessomerepresentationof inputcontext
(moreonthatinthenextsectionontheattentionmechanism)andprevioushiddenstateandtheoutput word
prediction,andgeneratesanewhiddendecoderstateandthenewoutputwordprediction.
If you use LSTMs for the encoder, then you also use LSTMs for the decoder. From
hiddenstate.You nowpredicttheoutputword.Thispredictiontakestheformoftheprobabilitydistribution
overentire outputvocabulary.Ifyouhaveavocabularyof,say,50,000words,thenthepredictionisa
50,000 dimensionalvector,eachelementcorrespondingtotheprobability predictedforonewordin
the vocabulary.
Page 12 | 16
This architecture consists of two sets of convolutional and average pooling layers,
followedbythe flatteningconvolutionallayer,then2fully-connectedlayersandfinallythesoftmax
classifier.
In the First Layer:
The input for LeNet-5 is the 32×32 grayscale image which passes through first convolutional layer
with6featuremapsorfiltershavingsize5×5andthestrideofone.Imagedimensionschanges from
32x32x1to28x28x6.
In Second Layer:
Thenit appliesaveragepoolinglayer orsub-samplinglayer withthefilter size 2×2andstrideof two. The
resultingimagedimensionwillbereducedto14x14x6.
ThirdL ayer:
Next, thereis thesecondconvolutional layer with 16feature mapshaving size 5×5 andthestrideof
1. In this layer, only tenoutof sixteen feature mapsareconnected to6feature mapsof previous layer
asshownbelow.
Page 13 | 16
The main reason is to break symmetry in the network and keeps a number of connections within
reasonable bounds. That is why the numberof training parameters in this layers are 1516 instead of
2400 andsimilarly, numberof connections are151600 insteadof 240000.
Fourth Layer:
In the fourth layer (S4) is an average pooling layer with filter size 2×2andstrideof2.Thislayer
is same as second layer (S2) except it has 16 feature maps so output will be reduced to
5x5x16.
Page 14 | 16
Fifth Layer:
Thefifthlayer(C5)isthefullyconnectedconvolutionallayerwith120featuremapseachofthesize
1×1.Eachof120unitsinC5isconnectedtoallthe400nodes(5x5x16)inthefourthlayerS4.
SixthL ayer:
The sixth layer is also fully connected layer (F6) with84units.
Page 15 | 16
OutputLayer:
Finally, there is fully connected softmax output layer ŷ with 10possible values
correspondingto digits from0to9.
Page 16 | 16
DATA SCIENCE
INTERVIEW
PREPARATION
(30 Days of Interview
Preparation)
# DAY 23
Page 1 of 18
Q1.Explain Overfeat in Object detection.
Answer:
Overfeat: It is a typical model of integrating object detection, localization, and classification tasks whole
into one convolutional neural network(CNN). The main idea is to do image classification at different
locations on regions of multiple scales of the image in a sliding window fashion, and second, predict
bounding boxlocationswiththeregressor trained ontopof thesameconvolution layers.
This modelarchitecture is toosimilar toAlexNet. This modelis trained asfollows:
1. Traina CNNmodel(identicaltoAlexNet)onimageclassificationtasks.
2. Then,wereplacetopclassifierlayers bytheregressionnetworkandtrainedit topredictobject
boundingboxesateachspatiallocationandscale.Regressorisclass-specific,eachgeneratedfor
oneclassimage.
•Input: Images with classification and bounding box.
• Output:(xleft,xright,ytop,ybottom)(xleft,xright,ytop,ybottom), 4valuesintotal,
representingthecoordinatesoftheboundingboxedges.
• Loss: The regressor is trained to minimize l2 norm between the generated
boundingbox andtruthforeachtrainingexample.
At the detection time,
1.It Performs classification at each location using the pretrained CNN model.
2.It Predicts object bounding boxes on all classified regions generated by the classifier.
3.Mergeboundingboxeswithsufficientoverlapfromlocalizationandsufficientconfidenceofbeing thesame
objectfromtheclassifier.
Page 2 of 18
Q2. What is Multipath: Multiple Probabilistic Anchor Trajectory
Hypotheses for Behavior Prediction?
Answer:
In this paper, we focus on problem of predicting future agent states, which is the crucial task
for robot planning in real-world environments. We are specifically interested in addressing this
problemforself- drivingvehicles,applicationwithapotentiallyenormoussocietalimpact.Mainly,predictingthe
future
ofother agents in this domain is vital for safe, comfortable, andefficient operation. E.g., it is important
toknowwhethertoyieldtothevehicleiftheyaregoingtocutinfrontofourrobotorwhenwouldbe thebesttime
toaddintotraffic.Suchfuturepredictionrequiresanunderstandingofastaticanddynamic worldcontext:road
semantics (like lane connectivity, stop lines), traffic light informations, and past observations of other
agents,asinbelowFig.
A fundamental aspect of the future state prediction is that it is inherently stochastic,
asagentscan’tknow
eachother’smotivations.Whenwearedriving,wecanneverreallybesurewhatotherdriverswilldo next,and
it is essential to considermultiple outcomesand their likelihood.
Weseekthemodelofthefuturethat canprovideboth(i)aweighted,parsimonioussetofdiscrete
trajectoriesthatcoversspaceoflikelyoutcomesand(ii)aclosed-formevaluationofthelikelihoodof
anytrajectory.Thesetwoattributesenableefficient reasoningin relevantplanninguse-cases,e.g.,
human-likereactionstodiscretetrajectoryhypotheses (e.g., yielding, following), and probabilistic
queriessuchastheexpectedriskofcollisioninaspace-timeregion.
This model addresses these issues with critical insight: it employs a fixed set of trajectory anchors
asthe basisofourmodeling.Thisletsusfactorstochasticuncertaintyhierarchically:First,intent
uncertaintycapturestheuncertaintyofwhatanagentintendstodoandisencodedasadistributionover thesetof
anchortrajectories.Second,givenanintent,control uncertainty representsouruncertainty overhowthey
mightachieveit.Weassumecontroluncertaintyisnormallydistributedateachfuture timestep[Thrun05],
parameterized such that the mean corresponds to a context-specific offset from the anchor state, with the
associated covariance capturing the unimodal aleatoric uncertainty [Kendall17]. In Fig. Illustrates a typical
scenariowheretherearethreelikelyintentsgiventhescenecontext,withcontrol meanoffsetrefinements
respectingroadgeometry,andcontroluncertaintyintuitivelygrowingovertime.
Ourtrajectoryanchorsaremodesfoundinourtrainingdatainstate-sequencespaceviaunsupervised
learning. Theseanchorsprovidetemplatesfor coarse-granularityfuturesfor anagentandmight
correspondtosemanticconceptslike“changelanes,”or“slowdown”(althoughtobeclear,wedon’tuse any
semanticconceptsinourmodeling).
Page 3 of 18
Ourcompletemodelpredicts aGaussian mixturemodel(GMM) ateachtimestep,withthemixture
weights(intentdistribution) fixed overtime.Given suchaparametricdistributionmodel,wecandirectly
evaluatethelikelihood of anyfuturetrajectoryandhaveasimplewaytoobtainacompact,diverse
weightedsetoftrajectorysamples: theMAP samplefromeachanchor-intent.
Multi-Region CNN (MR-CNN): Object representation using multiple regions to capture several
different aspectsofoneobject.
Page 4 of 18
Network Architectureof MR-CNN
• For eachbounding box candidate B, aset of regions {Ri}, with i=1 tok, aregenerated,thatis why
it is knownasmulti-region. More details about thechoicesof multipleareasaredescribedin next
sub-section.
Page 5 of 18
• Therearecloseconnectionsbetweensegmentationanddetection.Andsegmentation
relatedquesareempiricallyknowntohelpobjectdetectionoften.
Twomodulesareadded:Activationmapsmoduleforsemanticsegmentation-
•
aware
features,andregionsadaptationmodulefor grammarly segmentation-aware
feature.
•
Thereisnoadditionalannotationusedfortraininghere.
• FCN is usedforanactivationmapmodule.
ThelastFC7layerchannelsnumberischangedfrom4096to512.
•
• Theweaklysupervisedtrainingstrategyisused.Artificial foregroundclass-specific
segmentationmaskiscreatedusingboundingboxannotations.
Page 6 of 18
• More particularly, the ground truth bounding boxes of an image are projected on the spatial
domain of the last hidden layer of theFCN, and the”pixels” that lay inside theprojected boxes
arelabelled asforeground while therest arelabelledasbackground.
• In proposal generation, there is still a large proportion of background regions. The existence
of manybackgroundsamplecausesmanyfalsepositives.
Page 7 of 18
In CRAFT(Cascade Region-proposal-network), asshownabove,anotherCNN(Convolutional neural
network)is addedafterRPN togeneratefewer proposals (i.e., 300 here). Then, classification is performed
on300proposalsandoutputsabout20first detectionresults. For eachprimitive result, refinedobject
detection is performedusing one-vs-restclassification.
CascadeProposal Generation
BaselineRPN
•An ideal proposalgeneratorshould generateas few proposal as possiblewhile covering almost all
object instances.Duetoresolutionloss causedby CNN pooling operationand thefixed aspect ratio
of thesliding window,RPN is weakatcovering objects withextremeshapesorscales.
Page 8 of 18
Recall Rates (is in %), Overall is 94.87%, lowerthan94.87% is boldin thetext.
• TheaboveresultsarebaselineRPNbasedonVGG_MtrainedusingPASCALVOC
2007train+val,andtestedonthetestset.
• Therecall rateoneachobjectcategoryvaries alot. Objectwithextremeaspectratioandscale are
hardtobedetected,suchasboatandbottle.
Proposed Cascade
Structure
Page 9 of 18
TheconcatenationclassificationnetworkafterRPNisdenotedasFRCNNethere
By just looking image once, thedetection speedis in real-time(45 fps). Fast YOLOv1 achieves155 fps.
Theinputimageis dividedintotheS×S grid (S=7). If the center of the object falls into the grid
cell, that grid cell is responsible for detecting that object.
Each grid cell predict B boundingboxes(B=2) andconfidencescoresfor thoseboxes.Theseconfidence
scorereflecthowconfidentmodelisthattheboxcontainsanobject,i.e.,anyobjectsinthebox,P(Objects). Each
boundingboxconsistsoffivepredictions:x,y,w,h,andconfidence.
Page 11 of 18
•The(x,y)coordinatesrepresentcenteroftheboxrelativetotheboundofthegridcell.
•Theheighthandwidthwarepredictedrelativetowholeimage.
Page 12 of 18
Themodelconsistsof 24convolutionallayers, followedbytwofully connectedlayers. Alternating 1×1
convolutionallayersreducefeaturesspacefrompreceding layers. (1×1)Conv has beenusedin GoogLeNet for
reducing thenumberof parameters.)
Fast YOLO fewerconvolutional layers (9 instead of 24) andfewer filters in thoselayers. The network
pipelineis summarizedlike below:
Therefore,wecanseethattheinputimagegoesthroughnetworkonceandthenobjectscanbedetected.
Andwecanhaveend-to-endlearning.
Page 13 of 18
Q7. Adversarial Examples Improve Image Recognition
Answer:
Page 14 of 18
Above Fig.: AdvProp improves image recognition. By training model on ImageNet, AdvProp helps
EfficientNet-B7 to achieve 85.2% accuracy on ImageNet, 52.9% mCE (mean corruption error,
loweris better)onImageNet-C,44.7%accuracyonImageNet-Aand26.6%accuracyonStylized-ImageNet,
beating its vanilla counterpart by 0.7%, 6.5%, 7.0% and 4.8%, respectively. Theses sample images are
randomlyselectedfromcategory“goldfinch.”
In this paper, rather than focusing on defending against adversarial examples, we shift our
attentionto
leveragingadversarialexamplestoimproveaccuracy.Previousworksshowthattrainingwithadversarial
examplescanenhancemodelgeneralizationbutarerestrictedtocertainsituations—theimprovementis only
observedeitheronsmalldatasets(e.g., MNIST) in the fully-supervisedsetting [5], oronlarger
datasetsbutinthesemi-supervisedsetting[21,22].Meanwhile,recentworks[15,13,31]alsosuggest that
training withadversarialexamples onlargedatasets,e.g., ImageNet[23], withsupervisedlearning
resultsin performancedegradationoncleanimages.To summarize,it remainsanopenquestionof how
adversarialexamplescanbeusedeffectivelytohelpvisionmodels.
Weobserveallpreviousmethodsjointlytrainovercleanimagesandadversarialexampleswithout
distinction, even though they should be drawn from different underlying distributions. We
hypothesize thisdistributionmismatchbetweenfreshexamplesandadversarialexamplesisakeyfactor
thatcauses performancedegradationinpreviousworks.
Whenreading,humansprocesslanguage“automatically”withoutreflectingoneachstep—Humans
string words together into sentences, understand the meaning of spoken and written ideas, and
process languagewithoutoverthinkingabouthowtheunderlyingcognitiveprocesshappens.Thisprocess
generatescognitivesignalsthatcouldpotentiallyfacilitatenaturallanguageprocessingtasks.
In recent years, collecting these signals has become increasingly accessible and less
expensivePapoutsakietal.(2016); asaresult,using cognitivefeaturestoimproveNLPtaskshasbecome
morepopular.Forexample,researchershaveproposedarangeofworkthatuseseye-trackingorgaze signalsto
improvepart-of-speechtagging(Barrettetal.,2016),sentimentanalysis(Mishraetal.,2017), namedentity
recognitionHollensteinand Zhang(2019),amongothertasks.Moreover,thesesignalshave beenused
successfully toregularizeattention inneuralnetworksforNLP Barrettetal.(2018).
However,mostpreviousworkleveragesonlyeye-tracking data,presumablybecauseitisthemost
accessibleformofcognitive languageprocessingsignal. Also, moststate-of-the-artwork(SOTA)focused
onimprovingasingletaskwithasingletypeofcognitivesignal.Butcancognitiveprocessingsignals bringconsistent
improvementsacrossmodality(e.g.,eye-trackingandEEG)andacrossvariousNLP
Page 15 of 18