0% found this document useful (0 votes)
12 views185 pages

Data Science and Machine Learning - Interview Questions

The document outlines the differences between Data Analytics and Data Science, highlighting that Data Analytics focuses on drawing insights from existing data while Data Science involves asking questions and building models. It also provides a comprehensive list of common data science interview questions, covering topics such as supervised vs unsupervised learning, logistic regression, decision trees, and feature selection methods. Additionally, it discusses various concepts like dimensionality reduction, recommender systems, and model evaluation metrics, emphasizing the importance of understanding these topics for data science interviews.

Uploaded by

DOUGLASMENDES82
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views185 pages

Data Science and Machine Learning - Interview Questions

The document outlines the differences between Data Analytics and Data Science, highlighting that Data Analytics focuses on drawing insights from existing data while Data Science involves asking questions and building models. It also provides a comprehensive list of common data science interview questions, covering topics such as supervised vs unsupervised learning, logistic regression, decision trees, and feature selection methods. Additionally, it discusses various concepts like dimensionality reduction, recommender systems, and model evaluation metrics, emphasizing the importance of understanding these topics for data science interviews.

Uploaded by

DOUGLASMENDES82
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 185

Data Science and Machine Learning - Interview

Questions

Differentiate Between Data Analytics and Data Science

Data Analytics Data Science

Data Analytics use data to draw meaningful Data Science is used in


insights and solves problems. asking questions, writing
algorithms, coding and
building statistical models.

Data analytics tools include data mining, data Machine Learning, Hadoop,
modelling, database management and data Java, Python, software
analysis. development etc., are the
tools of Data Science.

Use the existing information to uncover the As a result, data Science


actionable data. discovers new Questions to
drive innovation.
Check data from the given information using a This field uses scientific
specialised system and software. methods and algorithms to
extract knowledge from
unstructured data.

Basic and Advanced Data Science Interview Questions

Here's a list of the most popular data science interview questions on the technical
concept which you can expect to face, and how to frame your answers.

1. What are the differences between supervised and


unsupervised learning?

Supervised Learning Unsupervised Learning


● Uses known and labeled data as input ● Uses unlabeled data
● Supervised learning has a feedback as input
mechanism ● Unsupervised
● The most commonly used supervised learning has no
learning algorithms are decision trees, feedback
logistic regression, and support vector mechanism
machine ● The most
commonly used
unsupervised
learning algorithms
are k-means
clustering,
hierarchical
clustering, and
apriori algorithm

2. How is logistic regression done?


Logistic regression measures the relationship between the dependent variable (our label
of what we want to predict) and one or more independent variables (our features) by
estimating probability using its underlying logistic function (sigmoid).

The image shown below depicts how logistic regression works:


The formula and graph for the sigmoid function are as shown:

3. Explain the steps in making a decision tree.


1. Take the entire data set as input
2. Calculate entropy of the target variable, as well as the predictor attributes
3. Calculate your information gain of all attributes (we gain information on
sorting different objects from each other)
4. Choose the attribute with the highest information gain as the root node
5. Repeat the same procedure on every branch until the decision node of each
branch is finalized
For example, let's say you want to build a decision tree to decide whether you should
accept or decline a job offer. The decision tree for this case is as shown:

It is clear from the decision tree that an offer is accepted if:

● Salary is greater than $50,000


● The commute is less than an hour
● Incentives are offered

4. How do you build a random forest model?


A random forest is built up of a number of decision trees. If you split the data into
different packages and make a decision tree in each of the different groups of data, the
random forest brings all those trees together.

Steps to build a random forest model:

1. Randomly select 'k' features from a total of 'm' features where k << m
2. Among the 'k' features, calculate the node D using the best split point
3. Split the node into daughter nodes using the best split
4. Repeat steps two and three until leaf nodes are finalized
5. Build forest by repeating steps one to four for 'n' times to create 'n' number of
trees

5. How can you avoid overfitting your model?


Overfitting refers to a model that is only set for a very small amount of data and ignores
the bigger picture. There are three main methods to avoid overfitting:

1. Keep the model simple—take fewer variables into account, thereby removing
some of the noise in the training data
2. Use cross-validation techniques, such as k folds cross-validation
3. Use regularization techniques, such as LASSO, that penalize certain model
parameters if they're likely to cause overfitting

6. Differentiate between univariate, bivariate, and


multivariate analysis.
Univariate

Univariate data contains only one variable. The purpose of the univariate analysis is to
describe the data and find patterns that exist within it.

Example: height of students

Height (in cm)


164

167.3

170

174.2

178

180

The patterns can be studied by drawing conclusions using mean, median, mode,
dispersion or range, minimum, maximum, etc.

Bivariate
Bivariate data involves two different variables. The analysis of this type of data deals
with causes and relationships and the analysis is done to determine the relationship
between the two variables.

Example: temperature and ice cream sales in the summer season

Temperature (in Celcius) Sales

20 2,000

25 2,100

26 2,300

28 2,400
30 2,600

36 3,100

Here, the relationship is visible from the table that temperature and sales are directly
proportional to each other. The hotter the temperature, the better the sales.

Multivariate

Multivariate data involves three or more variables, it is categorized under multivariate. It


is similar to a bivariate but contains more than one dependent variable.

Example: data for house price prediction

No. of rooms Floors Area (sq ft) Price

2 0 900 $4000,00

3 2 1,100 $600,000
3.5 5 1,500 $900,000

4 3 2,100 $1,200,000

The patterns can be studied by drawing conclusions using mean, median, and mode,
dispersion or range, minimum, maximum, etc. You can start describing the data and
using it to guess what the price of the house will be.

7. What are the feature selection methods used to select


the right variables?
There are two main methods for feature selection, i.e, filter, and wrapper methods.

Filter Methods

This involves:

● Linear discrimination analysis


● ANOVA
● Chi-Square

The best analogy for selecting features is "bad data in, bad answer out." When we're
limiting or selecting the features, it's all about cleaning up the data coming in.

Wrapper Methods
This involves:

● Forward Selection: We test one feature at a time and keep adding them until
we get a good fit
● Backward Selection: We test all the features and start removing them to see
what works better
● Recursive Feature Elimination: Recursively looks through all the different
features and how they pair together

Wrapper methods are very labor-intensive, and high-end computers are needed if a lot of
data analysis is performed with the wrapper method.

8. In your choice of language, write a program that prints


the numbers ranging from one to 50.
But for multiples of three, print "Fizz" instead of the number, and for the multiples of five,
print "Buzz." For numbers which are multiples of both three and five, print "FizzBuzz"

The code is shown below:


Note that the range mentioned is 51, which means zero to 50. However, the range asked
in the question is one to 50. Therefore, in the above code, you can include the range as
(1,51).

The output of the above code is as shown:

9. You are given a data set consisting of variables with


more than 30 percent missing values. How will you deal
with them?
The following are ways to handle missing data values:

If the data set is large, we can just simply remove the rows with missing data values. It
is the quickest way; we use the rest of the data to predict the values.
For smaller data sets, we can substitute missing values with the mean or average of the
rest of the data using the pandas' data frame in python. There are different ways to do
so, such as df.mean(), df.fillna(mean).

10. For the given points, how will you calculate the
Euclidean distance in Python?
plot1 = [1,3]

plot2 = [2,5]

The Euclidean distance can be calculated as follows:

euclidean_distance = sqrt( (plot1[0]-plot2[0])**2 + (plot1[1]-plot2[1])**2 )

Check out the Simplilearn's video on "Data Science Interview Question" curated by
industry experts to help you prepare for an interview.

11. What are dimensionality reduction and its benefits?


The Dimensionality reduction refers to the process of converting a data set with vast
dimensions into data with fewer dimensions (fields) to convey similar information
concisely.

This reduction helps in compressing data and reducing storage space. It also reduces
computation time as fewer dimensions lead to less computing. It removes redundant
features; for example, there's no point in storing a value in two different units (meters
and inches).
12. How will you calculate eigenvalues and eigenvectors
of the following 3x3 matrix?

-2 -4 2

-2 1 2

4 2 5

The characteristic equation is as shown:

Expanding determinant:

(-2 – λ) [(1-λ) (5-λ)-2x2] + 4[(-2) x (5-λ) -4x2] + 2[(-2) x 2-4(1-λ)] =0

- λ3 + 4λ2 + 27λ – 90 = 0,

λ3 - 4 λ2 -27 λ + 90 = 0

Here we have an algebraic equation built from the eigenvectors.

By hit and trial:

33 – 4 x 32 - 27 x 3 +90 = 0
Hence, (λ - 3) is a factor:

λ3 - 4 λ2 - 27 λ +90 = (λ – 3) (λ2 – λ – 30)

Eigenvalues are 3,-5,6:

(λ – 3) (λ2 – λ – 30) = (λ – 3) (λ+5) (λ-6),

Calculate eigenvector for λ = 3

For X = 1,

-5 - 4Y + 2Z =0,

-2 - 2Y + 2Z =0

Subtracting the two equations:

3 + 2Y = 0,

Subtracting back into second equation:

Y = -(3/2)

Z = -(1/2)

Similarly, we can calculate the eigenvectors for -5 and 6.

13. How should you maintain a deployed model?


The steps to maintain a deployed model are:

Monitor

Constant monitoring of all models is needed to determine their performance accuracy.


When you change something, you want to figure out how your changes are going to
affect things. This needs to be monitored to ensure it's doing what it's supposed to do.

Evaluate

Evaluation metrics of the current model are calculated to determine if a new algorithm
is needed.

Compare

The new models are compared to each other to determine which model performs the
best.

Rebuild

The best-performing model is re-built on the current state of data.

14. What are recommender systems?


A recommender system predicts what a user would rate a specific product based on
their preferences. It can be split into two different areas:

Collaborative Filtering

As an example, Last.fm recommends tracks that other users with similar interests play
often. This is also commonly seen on Amazon after making a purchase; customers may
notice the following message accompanied by product recommendations: "Users who
bought this also bought…"

Content-based Filtering

As an example: Pandora uses the properties of a song to recommend music with similar
properties. Here, we look at content, instead of looking at who else is listening to music.

15. How do you find RMSE and MSE in a linear regression


model?
RMSE and MSE are two of the most common measures of accuracy for a linear
regression model.

RMSE indicates the Root Mean Square Error.

MSE indicates the Mean Square Error.

16. How can you select k for k-means?


We use the elbow method to select k for k-means clustering. The idea of the elbow
method is to run k-means clustering on the data set where 'k' is the number of clusters.

Within the sum of squares (WSS), it is defined as the sum of the squared distance
between each member of the cluster and its centroid.

17. What is the significance of p-value?


p-value typically ≤ 0.05

This indicates strong evidence against the null hypothesis; so you reject the null
hypothesis.

p-value typically > 0.05

This indicates weak evidence against the null hypothesis, so you accept the null
hypothesis.

p-value at cutoff 0.05

This is considered to be marginal, meaning it could go either way.

18. How can outlier values be treated?


You can drop outliers only if it is a garbage value.

Example: height of an adult = abc ft. This cannot be true, as the height cannot be a
string value. In this case, outliers can be removed.
If the outliers have extreme values, they can be removed. For example, if all the data
points are clustered between zero to 10, but one point lies at 100, then we can remove
this point.

If you cannot drop outliers, you can try the following:

● Try a different model. Data detected as outliers by linear models can be fit by
nonlinear models. Therefore, be sure you are choosing the correct model.
● Try normalizing the data. This way, the extreme data points are pulled to a
similar range.
● You can use algorithms that are less affected by outliers; an example would
be random forests.

19. How can time-series data be declared as stationery?


It is stationary when the variance and mean of the series are constant with time.

Here is a visual example:


In the first graph, the variance is constant with time. Here, X is the time factor and Y is
the variable. The value of Y goes through the same points all the time; in other words, it
is stationary.

In the second graph, the waves get bigger, which means it is non-stationary and the
variance is changing with time.

20. How can you calculate accuracy using a confusion


matrix?
Consider this confusion matrix:

You can see the values for total data, actual values, and predicted values.

The formula for accuracy is:

Accuracy = (True Positive + True Negative) / Total Observations

= (262 + 347) / 650

= 609 / 650
= 0.93

As a result, we get an accuracy of 93 percent.

21. Write the equation and calculate the precision and


recall rate.
Consider the same confusion matrix used in the previous question.

Precision = (True positive) / (True Positive + False Positive)

= 262 / 277

= 0.94

Recall Rate = (True Positive) / (Total Positive + False Negative)

= 262 / 288

= 0.90
22. 'People who bought this also bought…'
recommendations seen on Amazon are a result of which
algorithm?
The recommendation engine is accomplished with collaborative filtering. Collaborative
filtering explains the behavior of other users and their purchase history in terms of
ratings, selection, etc.

The engine makes predictions on what might interest a person based on the
preferences of other users. In this algorithm, item features are unknown.

For example, a sales page shows that a certain number of people buy a new phone and
also buy tempered glass at the same time. Next time, when a person buys a phone, he
or she may see a recommendation to buy tempered glass as well.

23. Write a basic SQL query that lists all orders with
customer information.
Usually, we have order tables and customer tables that contain the following columns:

● Order Table
● Orderid
● customerId
● OrderNumber
● TotalAmount
● Customer Table
● Id
● FirstName
● LastName
● City
● Country
● The SQL query is:
● SELECT OrderNumber, TotalAmount, FirstName, LastName, City, Country
● FROM Order
● JOIN Customer
● ON Order.CustomerId = Customer.Id

24. You are given a dataset on cancer detection. You have


built a classification model and achieved an accuracy of
96 percent. Why shouldn't you be happy with your model
performance? What can you do about it?
Cancer detection results in imbalanced data. In an imbalanced dataset, accuracy should
not be based as a measure of performance. It is important to focus on the remaining
four percent, which represents the patients who were wrongly diagnosed. Early
diagnosis is crucial when it comes to cancer detection, and can greatly improve a
patient's prognosis.

Hence, to evaluate model performance, we should use Sensitivity (True Positive Rate),
Specificity (True Negative Rate), F measure to determine the class wise performance of
the classifier.
25. Which of the following machine learning algorithms
can be used for inputting missing values of both
categorical and continuous variables?
● K-means clustering
● Linear regression
● K-NN (k-nearest neighbor)
● Decision trees

The K nearest neighbor algorithm can be used because it can compute the nearest
neighbor and if it doesn't have a value, it just computes the nearest neighbor based on
all the other features.

When you're dealing with K-means clustering or linear regression, you need to do that in
your pre-processing, otherwise, they'll crash. Decision trees also have the same
problem, although there is some variance.

26. Below are the eight actual values of the target variable
in the train file. What is the entropy of the target variable?
[0, 0, 0, 1, 1, 1, 1, 1]

Choose the correct answer.

1. -(5/8 log(5/8) + 3/8 log(3/8))


2. 5/8 log(5/8) + 3/8 log(3/8)
3. 3/8 log(5/8) + 5/8 log(3/8)
4. 5/8 log(3/8) – 3/8 log(5/8)
The target variable, in this case, is 1.

The formula for calculating the entropy is:

Putting p=5 and n=8, we get

Entropy = A = -(5/8 log(5/8) + 3/8 log(3/8))

27. We want to predict the probability of death from heart


disease based on three risk factors: age, gender, and blood
cholesterol level. What is the most appropriate algorithm
for this case?
Choose the correct option:

1. Logistic Regression
2. Linear Regression
3. K-means clustering
4. Apriori algorithm

The most appropriate algorithm for this case is A, logistic regression.

28. After studying the behavior of a population, you have


identified four specific individual types that are valuable to
your study. You would like to find all users who are most
similar to each individual type. Which algorithm is most
appropriate for this study?
Choose the correct option:
1. K-means clustering
2. Linear regression
3. Association rules
4. Decision trees

As we are looking for grouping people together specifically by four different similarities,
it indicates the value of k. Therefore, K-means clustering (answer A) is the most
appropriate algorithm for this study.

29. You have run the association rules algorithm on your


dataset, and the two rules {banana, apple} => {grape} and
{apple, orange} => {grape} have been found to be relevant.
What else must be true?
Choose the right answer:

1. {banana, apple, grape, orange} must be a frequent itemset


2. {banana, apple} => {orange} must be a relevant rule
3. {grape} => {banana, apple} must be a relevant rule
4. {grape, apple} must be a frequent itemset

The answer is A: {grape, apple} must be a frequent itemset

30. Your organization has a website where visitors


randomly receive one of two coupons. It is also possible
that visitors to the website will not receive a coupon. You
have been asked to determine if offering a coupon to
website visitors has any impact on their purchase
decisions. Which analysis method should you use?
1. One-way ANOVA
2. K-means clustering
3. Association rules
4. Student's t-test

The answer is A: One-way ANOVA

31. What do you understand about true positive rate and


false-positive rate?
● The True Positive Rate (TPR) defines the probability that an actual positive will
turn out to be positive.

The True Positive Rate (TPR) is calculated by taking the ratio of the [True Positives (TP)]
and [True Positive (TP) & False Negatives (FN) ].

The formula for the same is stated below -

TPR=TP/TP+FN

● The False Positive Rate (FPR) defines the probability that an actual negative
result will be shown as a positive one i.e the probability that a model will
generate a false alarm.

The False Positive Rate (FPR) is calculated by taking the ratio of the [False Positives
(FP)] and [True Positives (TP) & False Positives(FP)].
The formula for the same is stated below -

FPR=FP/TN+FP

32. What is the ROC curve?


The graph between the True Positive Rate on the y-axis and the False Positive Rate on
the x-axis is called the ROC curve and is used in binary classification.

The False Positive Rate (FPR) is calculated by taking the ratio between False Positives
and the total number of negative samples, and the True Positive Rate (TPR) is
calculated by taking the ratio between True Positives and the total number of positive
samples.

In order to construct the ROC curve, the TPR and FPR values are plotted on multiple
threshold values. The area range under the ROC curve has a range between 0 and 1. A
completely random model, which is represented by a straight line, has a 0.5 ROC. The
amount of deviation a ROC has from this straight line denotes the efficiency of the
model.
The image above denotes a ROC curve example.

33. What is a Confusion Matrix?


The Confusion Matrix is the summary of prediction results of a particular problem. It is
a table that is used to describe the performance of the model. The Confusion Matrix is
an n*n matrix that evaluates the performance of the classification model.
34. What do you understand about the true-positive rate
and false-positive rate?
TRUE-POSITIVE RATE: The true-positive rate gives the proportion of correct predictions
of the positive class. It is also used to measure the percentage of actual positives that
are accurately verified.

FALSE-POSITIVE RATE: The false-positive rate gives the proportion of incorrect


predictions of the positive class. A false positive determines something is true when
that is initially false.

35. How is Data Science different from traditional


application programming?
The primary and vital difference between Data Science and traditional application
programming is that in traditional programming, one has to create rules to translate the
input to output. In Data Science, the rules are automatically produced from the data.

36. What is the difference between the long format data


and wide format data?
LONG FORMAT DATA: It contains values that repeat in the first column. In this format,
each row is a one-time point per subject.

WIDE FORMAT DATA: In the Wide Format Data, the data’s repeated responses will be in
a single row, and each response can be recorded in separate columns.

Long format Table:


NAME ATTRIBUTE VALUE

RAMA HEIGHT 182

SITA HEIGHT 160

Wide format Table:

NAME HEIGHT

RAMA 182

SITA 160

37. Mention some techniques used for sampling. What is


the main advantage of sampling?
Sampling is the selection of individual members or a subset of the population to
estimate the characters of the whole population. There are two types of Sampling,
namely Probability and Non-Probability Sampling.

38. Why is Python used for Data Cleaning in DS?


Data Scientists and technical analysts must convert a huge amount of data into
effective ones. Data Cleaning includes removing malwared records, outliners,
inconsistent values, redundant formatting etc. Matplotlib, Pandas etc are the most used
Python Data Cleaners.

39. What are the popular libraries used in Data Science?


The popular libraries used in Data Science are

● Tensor Flow
● Pandas
● NumPy
● SciPy
● Scrapy
● Librosa
● MatPlotLib

40. What is variance in Data Science?


Variance is the value that depicts the individual figures in a set of data which distributes
themselves about the mean and describes the difference of each value from the mean
value. Data Scientists use variance to understand the distribution of a data set.
41. What is pruning in a decision tree algorithm?
In Data Science and Machine Learning, Pruning is a technique which is related to
decision trees. Pruning simplifies the decision tree by reducing the rules. Pruning helps
to avoid complexity and improves accuracy. Reduced error Pruning, cost complexity
pruning etc. are the different types of Pruning.

42. What is entropy in a decision tree algorithm?


Entropy is the measure of randomness or disorder in the group of observations. It also
determines how a decision tree switches to split data. Entropy is also used to check the
homogeneity of the given data. If the entropy is zero, then the sample of data is entirely
homogeneous, and if the entropy is one, then it indicates that the sample is equally
divided.

43. What information is gained in a decision tree


algorithm?
Information gain is the expected reduction in entropy. Information gain decides the
building of the tree. Information Gain makes the decision tree smarter. Information gain
includes parent node R and a set E of K training examples. It calculates the difference
between entropy before and after the split.

44. What is k-fold cross-validation?


The k-fold cross validation is a procedure used to estimate the model's skill in new data.
In k-fold cross validation, every observation from the original dataset may appear in the
training and testing set. K-fold cross-validation estimates the accuracy but does not
help you to improve the accuracy.

45. What is a normal distribution?


Normal Distribution is also known as the Gaussian Distribution. The normal distribution
shows the data near the mean and the frequency of that particular data. When
represented in graphical form, normal distribution appears like a bell curve. The
parameters included in the normal distribution are Mean, Standard Deviation, Median
etc.

46. What is Deep Learning?


Deep Learning is one of the essential factors in Data Science, including statistics. Deep
Learning makes us work more closely with the human brain and reliable with human
thoughts. The algorithms are sincerely created to resemble the human brain. In Deep
Learning, multiple layers are formed from the raw input to extract the high-level layer
with the best features.

47. What is an RNN (recurrent neural network)?


RNN is an algorithm that uses sequential data. RNN is used in language translation,
voice recognition, image capturing etc. There are different types of RNN networks such
as one-to-one, one-to-many, many-to-one and many-to-many. RNN is used in Google’s
Voice search and Apple’s Siri.

Basic Data Science Interview Questions


Let us begin with a few basic data science interview questions!

48. What are the feature vectors?


A feature vector is an n-dimensional vector of numerical features that represent an
object. In machine learning, feature vectors are used to represent numeric or symbolic
characteristics (called features) of an object in a mathematical way that's easy to
analyze.

49. What are the steps in making a decision tree?


1. Take the entire data set as input.
2. Look for a split that maximizes the separation of the classes. A split is any
test that divides the data into two sets.
3. Apply the split to the input data (divide step).
4. Re-apply steps one and two to the divided data.
5. Stop when you meet any stopping criteria.
6. This step is called pruning. Clean up the tree if you went too far doing splits.

50. What is root cause analysis?


Root cause analysis was initially developed to analyze industrial accidents but is now
widely used in other areas. It is a problem-solving technique used for isolating the root
causes of faults or problems. A factor is called a root cause if its deduction from the
problem-fault-sequence averts the final undesirable event from recurring.

51. What is logistic regression?


Logistic regression is also known as the logit model. It is a technique used to forecast
the binary outcome from a linear combination of predictor variables.

52. What are recommender systems?


Recommender systems are a subclass of information filtering systems that are meant
to predict the preferences or ratings that a user would give to a product.

53. Explain cross-validation.


Cross-validation is a model validation technique for evaluating how the outcomes of a
statistical analysis will generalize to an independent data set. It is mainly used in
backgrounds where the objective is to forecast and one wants to estimate how
accurately a model will accomplish in practice.

The goal of cross-validation is to term a data set to test the model in the training phase
(i.e. validation data set) to limit problems like overfitting and gain insight into how the
model will generalize to an independent data set.

54. What is collaborative filtering?


Most recommender systems use this filtering process to find patterns and information
by collaborating perspectives, numerous data sources, and several agents.

55. Do gradient descent methods always converge to


similar points?
They do not, because in some cases, they reach a local minima or a local optima point.
You would not reach the global optima point. This is governed by the data and the
starting conditions.

56. What is the goal of A/B Testing?


This is statistical hypothesis testing for randomized experiments with two variables, A
and B. The objective of A/B testing is to detect any changes to a web page to maximize
or increase the outcome of a strategy.

57. What are the drawbacks of the linear model?


● The assumption of linearity of the errors
● It can't be used for count outcomes or binary outcomes
● There are overfitting problems that it can't solve

58. What is the law of large numbers?


It is a theorem that describes the result of performing the same experiment very
frequently. This theorem forms the basis of frequency-style thinking. It states that the
sample mean, sample variance, and sample standard deviation converge to what they
are trying to estimate.

59. What are the confounding variables?


These are extraneous variables in a statistical model that correlates directly or inversely
with both the dependent and the independent variable. The estimate fails to account for
the confounding factor.
60. What is star schema?
It is a traditional database schema with a central table. Satellite tables map IDs to
physical names or descriptions and can be connected to the central fact table using the
ID fields; these tables are known as lookup tables and are principally useful in real-time
applications, as they save a lot of memory. Sometimes, star schemas involve several
layers of summarization to recover information faster.

61. How regularly must an algorithm be updated?


You will want to update an algorithm when:

● You want the model to evolve as data streams through infrastructure


● The underlying data source is changing
● There is a case of non-stationarity

62. What are eigenvalue and eigenvector?


Eigenvalues are the directions along which a particular linear transformation acts by
flipping, compressing, or stretching.

Eigenvectors are for understanding linear transformations. In data analysis, we usually


calculate the eigenvectors for a correlation or covariance matrix.

63. Why is resampling done?


Resampling is done in any of these cases:
● Estimating the accuracy of sample statistics by using subsets of accessible
data, or drawing randomly with replacement from a set of data points
● Substituting labels on data points when performing significance tests
● Validating models by using random subsets (bootstrapping, cross-validation)

64. What is selection bias?


Selection bias, in general, is a problematic situation in which error is introduced due to a
non-random population sample.

65. What are the types of biases that can occur during
sampling?
1. Selection bias
2. Undercoverage bias
3. Survivorship bias

66. What is survivorship bias?


Survivorship bias is the logical error of focusing on aspects that support surviving a
process and casually overlooking those that did not because of their lack of
prominence. This can lead to wrong conclusions in numerous ways.

67. How do you work towards a random forest?


The underlying principle of this technique is that several weak learners combine to
provide a strong learner. The steps involved are:

1. Build several decision trees on bootstrapped training samples of data


2. On each tree, each time a split is considered, a random sample of mm
predictors is chosen as split candidates out of all pp predictors
3. Rule of thumb: At each split m=p√m=p
4. Predictions: At the majority rule

This exhaustive list is sure to strengthen your preparation for data science interview
questions.

68. What is a bias-variance trade-off?


Bias: Due to an oversimplification of a Machine Learning Algorithm, an error occurs in
our model, which is known as Bias. This can lead to an issue of underfitting and might
lead to oversimplified assumptions at the model training time to make target functions
easier and simpler to understand.

Some of the popular machine learning algorithms which are low on the bias scale are -

Support Vector Machines (SVM), K-Nearest Neighbors (KNN), and Decision Trees.

Algorithms that are high on the bias scale -

Logistic Regression and Linear Regression.

Variance: Because of a complex machine learning algorithm, a model performs really


badly on a test data set as the model learns even noise from the training data set. This
error that occurs in the Machine Learning model is called Variance and can generate
overfitting and hyper-sensitivity in Machine Learning models.
While trying to get over bias in our model, we try to increase the complexity of the
machine learning algorithm. Though it helps in reducing the bias, after a certain point, it
generates an overfitting effect on the model hence resulting in hyper-sensitivity and high
variance.

Bias-Variance trade-off: To achieve the best performance, the main target of a


supervised machine learning algorithm is to have low variance and bias.

The following things are observed regarding some of the popular machine learning
algorithms -

● The Support Vector Machine algorithm (SVM) has high variance and low bias.
In order to change the trade-off, we can increase the parameter C. The C
parameter results in a decrease in the variance and an increase in bias by
influencing the margin violations allowed in training datasets.
● In contrast to the SVM, the K-Nearest Neighbors (KNN) Machine Learning
algorithm has a high variance and low bias. To change the trade-off of this
algorithm, we can increase the prediction influencing neighbors by increasing
the K value, thus increasing the model bias.

69. Describe Markov chains?


Markov Chains defines that a state’s future probability depends only on its current state.

Markov chains belong to the Stochastic process type category.

The below diagram explains a step-by-step model of the Markov Chains whose output
depends on their current state.
A perfect example of the Markov Chains is the system of word recommendation. In this
system, the model recognizes and recommends the next word based on the
immediately previous word and not anything before that. The Markov Chains take the
previous paragraphs that were similar to training data-sets and generates the
recommendations for the current paragraphs accordingly based on the previous word.

70. Why is R used in Data Visualization?


R is widely used in Data Visualizations for the following reasons-

● We can create almost any type of graph using R.


● R has multiple libraries like lattice, ggplot2, leaflet, etc., and so many inbuilt
functions as well.
● It is easier to customize graphics in R compared to Python.
● R is used in feature engineering and in exploratory data analysis as well.

71. What is the difference between a box plot and a


histogram?
The frequency of a certain feature’s values is denoted visually by both box plots

and histograms.

Boxplots are more often used in comparing several datasets and compared to
histograms, take less space and contain fewer details. Histograms are used to know
and understand the probability distribution underlying a dataset.
The diagram above denotes a boxplot of a dataset.
72. What does NLP stand for?
NLP is short for Natural Language Processing. It deals with the study of how computers
learn a massive amount of textual data through programming. A few popular examples
of NLP are Stemming, Sentimental Analysis, Tokenization, removal of stop words, etc.

73. Difference between an error and a residual error


The difference between a residual error and error are defined below -

Error Residual Error


The difference between the actual value and the The difference between the
predicted value is called an error. arithmetic mean of a group of
values and the observed
group of values is called a
Some of the popular means of calculating data residual error.
science errors are -

● Root Mean Squared Error (RMSE)

● Mean Absolute Error (MAE)

● Mean Squared Error (MSE)

An error is generally unobservable. A residual error can be


represented using a graph.
A residual error is used to show how the sample An error is how actual
population data and the observed data differ from population data and observed
each other. data differ from each other.

74. Difference between Normalisation and Standardization

Standardization Normalization

● The technique of converting data in such ● The technique of


a way that it is normally distributed and converting all data
has a standard deviation of 1 and a mean values to lie
of 0. between 1 and 0 is
known as
Normalization. This
is also known as
min-max scaling.

● Standardization takes care that the ● The data returning


standard normal distribution is followed into the 0 to 1 range
by the data. is taken care of by
Normalization.
● Normalization formula - ● Standardization
formula -
X’ = (X - Xmin) / (Xmax - Xmin)
X’ = (X - 𝞵) / 𝞼
Here,

Xmin - feature’s minimum value,

Xmax - feature’s maximum value.

75. Difference between Point Estimates and Confidence


Interval
Confidence Interval: A range of values likely containing the population parameter is
given by the confidence interval. Further, it even tells us how likely that particular interval
can contain the population parameter. The Confidence Coefficient (or Confidence level)
is denoted by 1-alpha, which gives the probability or likeness. The level of significance is
given by alpha.

Point Estimates: An estimate of the population parameter is given by a particular value


called the point estimate. Some popular methods used to derive Population Parameters’
Point estimators are - Maximum Likelihood estimator and the Method of Moments.
To conclude, the bias and variance are inversely proportional to each other, i.e., an
increase in bias results in a decrease in the variance, and an increase in variance results
in a decrease in bias.

One-on-One Data Science Interview Questions

To crack a data science interview is no walk in the park. It requires in-depth knowledge
and expertise in various topics. Furthermore, the projects that you have worked on can
significantly boost your potential in a lot of interviews. In order to help you with your
interviews, we have compiled a set of questions for you to relate to. Since data science
is an extensive field, there are no limitations on the type of questions that can be
inquired. With that being said, you can answer each of these questions depending on
the projects you have worked on and the industries you have been in. Try to answer
each one of these sample questions and then share your answer with us through the
comments.

Pro Tip: No matter how basic a question may seem, always try to view it from a
technical perspective and use each question to demonstrate your unique technical skills
and abilities.

76. Which is your favorite machine learning algorithm and why?

77. Which according to you is the most important skill that makes a good data
scientist?

78. Why do you think data science is so popular today?


79. Explain the most challenging data science project that you worked on.

80. How do you usually prefer working on a project - individually, small team, or large
team?

81. Based on your experience in the industry, tell me about your top 5 predictions for the
next 10 years.

82. What are some unique skills that you can bring to the team as a data scientist?

83. Were you always in the data science field? If not, what made you change your career
path and how did you upgrade your skills?

84. If we give you a random data set, how will you figure out whether it suits the
business needs or not?

85. Given a chance, if you could pick a career other than being a data scientist, what
would you choose?

86. Given the constant change in the data science field, how quickly can you adapt to
new technologies?

87. Have you ever been in a conflict with your colleagues regarding different strategies
to go about a project? How were you able to resolve it?

88. Can you break down an algorithm you have used on a recent project?

89. What tools did you use in your last project and why?
90. Think of the last technical problem that you solved. If you had no limitations with the
project’s budget, what would be the first thing you would do to solve the same problem?

91. When you are assigned multiple projects at the same time, how best do you
organize your time?

92. Tell me about a time when your project didn’t go according to plan and what you
learned from it.

93. Have you ever created an original algorithm? How did you go about doing that and
for what purpose?

94. What is your most favored strategy to clean a big data set and why?

95. Do you contribute to any open source projects?


Top Machine Learning Interview Questions

Let's start with some commonly asked machine learning interview questions and
answers.

1. What Are the Different Types of Machine Learning?


There are three types of machine learning:

Supervised Learning

In supervised machine learning, a model makes predictions or decisions based on past


or labeled data. Labeled data refers to sets of data that are given tags or labels, and
thus made more meaningful.

Unsupervised Learning

In unsupervised learning, we don't have labeled data. A model can identify patterns,
anomalies, and relationships in the input data.
Reinforcement Learning

Using reinforcement learning, the model can learn based on the rewards it received for
its previous action.

Consider an environment where an agent is working. The agent is given a target to


achieve. Every time the agent takes some action toward the target, it is given positive
feedback. And, if the action taken is going away from the goal, the agent is given
negative feedback.

Also Read: Supervised and Unsupervised Learning in Machine Learning

2. What is Overfitting, and How Can You Avoid It?


The Overfitting is a situation that occurs when a model learns the training set too well,
taking up random fluctuations in the training data as concepts. These impact the
model’s ability to generalize and don’t apply to new data.

When a model is given the training data, it shows 100 percent accuracy—technically a
slight loss. But, when we use the test data, there may be an error and low efficiency.
This condition is known as overfitting.

There are multiple ways of avoiding overfitting, such as:

● Regularization. It involves a cost term for the features involved with the
objective function
● Making a simple model. With lesser variables and parameters, the variance
can be reduced
● Cross-validation methods like k-folds can also be used
● If some model parameters are likely to cause overfitting, techniques for
regularization like LASSO can be used that penalize these parameters

Also Read: Overfitting and Underfitting in Machine Learning


3. What is ‘training Set’ and ‘test Set’ in a Machine
Learning Model? How Much Data Will You Allocate for
Your Training, Validation, and Test Sets?
There is a three-step process followed to create a model:

1. Train the model


2. Test the model
3. Deploy the model

Training Set Test Set

● The training set is examples given to the ● The test set is used
model to analyze and learn to test the accuracy
● 70% of the total data is typically taken as of the hypothesis
the training dataset generated by the
● This is labeled data used to train the model
model ● Remaining 30% is
taken as testing
dataset
● We test without
labeled data and
then verify results
with labels

Consider a case where you have labeled data for 1,000 records. One way to train the
model is to expose all 1,000 records during the training process. Then you take a small
set of the same data to test the model, which would give good results in this case.
But, this is not an accurate way of testing. So, we set aside a portion of that data called
the ‘test set’ before starting the training process. The remaining data is called the
‘training set’ that we use for training the model. The training set passes through the
model multiple times until the accuracy is high, and errors are minimized.

Now, we pass the test data to check if the model can accurately predict the values and
determine if training is effective. If you get errors, you either need to change your model
or retrain it with more data.
Regarding the question of how to split the data into a training set and test set, there is
no fixed rule, and the ratio can vary based on individual preferences.

4. How Do You Handle Missing or Corrupted Data in a


Dataset?
One of the easiest ways to handle missing or corrupted data is to drop those rows or
columns or replace them entirely with some other value.

There are two useful methods in Pandas:

● IsNull() and dropna() will help to find the columns/rows with missing data and
drop them
● Fillna() will replace the wrong values with a placeholder value
5. How Can You Choose a Classifier Based on a Training
Set Data Size?
When the training set is small, a model that has a right bias and low variance seems to
work better because they are less likely to overfit.

For example, Naive Bayes works best when the training set is large. Models with low
bias and high variance tend to perform better as they work fine with complex
relationships.

6. Explain the Confusion Matrix with Respect to Machine


Learning Algorithms.
A confusion matrix (or error matrix) is a specific table that is used to measure the
performance of an algorithm. It is mostly used in supervised learning; in unsupervised
learning, it’s called the matching matrix.

The confusion matrix has two parameters:


● Actual
● Predicted

It also has identical sets of features in both of these dimensions.

Consider a confusion matrix (binary matrix) shown below:

Here,

For actual values:

Total Yes = 12+1 = 13

Total No = 3+9 = 12

Similarly, for predicted values:

Total Yes = 12+3 = 15


Total No = 1+9 = 10

For a model to be accurate, the values across the diagonals should be high. The total
sum of all the values in the matrix equals the total observations in the test data set.

For the above matrix, total observations = 12+3+1+9 = 25

Now, accuracy = sum of the values across the diagonal/total dataset

= (12+9) / 25

= 21 / 25

= 84%

7. What Is a False Positive and False Negative and How


Are They Significant?
False positives are those cases that wrongly get classified as True but are False.

False negatives are those cases that wrongly get classified as False but are True.

In the term ‘False Positive,’ the word ‘Positive’ refers to the ‘Yes’ row of the predicted
value in the confusion matrix. The complete term indicates that the system has
predicted it as a positive, but the actual value is negative.
So, looking at the confusion matrix, we get:

False-positive = 3

True positive = 12

Similarly, in the term ‘False Negative,’ the word ‘Negative’ refers to the ‘No’ row of the
predicted value in the confusion matrix. And the complete term indicates that the
system has predicted it as negative, but the actual value is positive.

So, looking at the confusion matrix, we get:

False Negative = 1

True Negative = 9

8. What Are the Three Stages of Building a Model in


Machine Learning?
The three stages of building a machine learning model are:

● Model Building
Choose a suitable algorithm for the model and train it according to the
requirement
● Model Testing
Check the accuracy of the model through the test data
● Applying the Model
Make the required changes after testing and use the final model for real-time
projects

Here, it’s important to remember that once in a while, the model needs to be checked to
make sure it’s working correctly. It should be modified to make sure that it is up-to-date.

9. What is Deep Learning?


The Deep learning is a subset of machine learning that involves systems that think and
learn like humans using artificial neural networks. The term ‘deep’ comes from the fact
that you can have several layers of neural networks.

One of the primary differences between machine learning and deep learning is that
feature engineering is done manually in machine learning. In the case of deep learning,
the model consisting of neural networks will automatically determine which features to
use (and which not to use).

This is a commonly asked question asked in both Machine Learning Interviews as well
as Deep Learning Interview Questions
10. What Are the Differences Between Machine Learning
and Deep Learning?
Learn more: Difference Between AI,ML and Deep Learning

Machine Learning Deep Learning

● Enables machines to take decisions on ● Enables machines


their own, based on past data to take decisions
● It needs only a small amount of data for with the help of
training artificial neural
● Works well on the low-end system, so you networks
don't need large machines ● It needs a large
● Most features need to be identified in amount of training
advance and manually coded data
● The problem is divided into two parts and ● Needs high-end
solved individually and then combined machines because
it requires a lot of
computing power
● The machine learns
the features from
the data it is
provided
● The problem is
solved in an
end-to-end manner
11. What Are the Applications of Supervised Machine
Learning in Modern Businesses?
Applications of supervised machine learning include:

● Email Spam Detection


Here we train the model using historical data that consists of emails
categorized as spam or not spam. This labeled information is fed as input to
the model.
● Healthcare Diagnosis
By providing images regarding a disease, a model can be trained to detect if a
person is suffering from the disease or not.
● Sentiment Analysis
This refers to the process of using algorithms to mine documents and
determine whether they’re positive, neutral, or negative in sentiment.
● Fraud Detection
By training the model to identify suspicious patterns, we can detect instances
of possible fraud.

Related Interview Questions and Answers


AI | Data Science

12. What is Semi-supervised Machine Learning?


Supervised learning uses data that is completely labeled, whereas unsupervised
learning uses no training data.

In the case of semi-supervised learning, the training data contains a small amount of
labeled data and a large amount of unlabeled data.
13. What Are Unsupervised Machine Learning Techniques?
There are two techniques used in unsupervised learning: clustering and association.

Clustering

Clustering problems involve data to be divided into subsets. These subsets, also called
clusters, contain data that are similar to each other. Different clusters reveal different
details about the objects, unlike classification or regression.

Association
In an association problem, we identify patterns of associations between different
variables or items.

For example, an e-commerce website can suggest other items for you to buy, based on
the prior purchases that you have made, spending habits, items in your wishlist, other
customers’ purchase habits, and so on.

14. What is the Difference Between Supervised and


Unsupervised Machine Learning?
● Supervised learning - This model learns from the labeled data and makes a
future prediction as output
● Unsupervised learning - This model uses unlabeled input data and allows the
algorithm to act on that information without guidance.

15. What is the Difference Between Inductive Machine


Learning and Deductive Machine Learning?
Inductive Learning Deductive Learning

● It observes instances based on defined ● It concludes


principles to draw a conclusion experiences
● Example: Explaining to a child to keep ● Example: Allow the
away from the fire by showing a video child to play with
where fire causes damage fire. If he or she gets
burned, they will
learn that it is
dangerous and will
refrain from making
the same mistake
again

16. Compare K-means and KNN Algorithms.

K-means KNN

● K-Means is unsupervised ● KNN is supervised


● K-Means is a clustering algorithm in nature
● The points in each cluster are similar to ● KNN is a
each other, and each cluster is different classification
from its neighboring clusters algorithm
● It classifies an
unlabeled
observation based
on its K (can be any
number)
surrounding
neighbors

17. What Is ‘naive’ in the Naive Bayes Classifier?


The classifier is called ‘naive’ because it makes assumptions that may or may not turn
out to be correct.

The algorithm assumes that the presence of one feature of a class is not related to the
presence of any other feature (absolute independence of features), given the class
variable.

For instance, a fruit may be considered to be a cherry if it is red in color and round in
shape, regardless of other features. This assumption may or may not be right (as an
apple also matches the description).

18. Explain How a System Can Play a Game of Chess


Using Reinforcement Learning.
Reinforcement learning has an environment and an agent. The agent performs some
actions to achieve a specific goal. Every time the agent performs a task that is taking it
towards the goal, it is rewarded. And, every time it takes a step that goes against that
goal or in the reverse direction, it is penalized.
Earlier, chess programs had to determine the best moves after much research on
numerous factors. Building a machine designed to play such games would require many
rules to be specified.

With reinforced learning, we don’t have to deal with this problem as the learning agent
learns by playing the game. It will make a move (decision), check if it’s the right move
(feedback), and keep the outcomes in memory for the next step it takes (learning).
There is a reward for every correct decision the system takes and punishment for the
wrong one.

19. How Will You Know Which Machine Learning Algorithm


to Choose for Your Classification Problem?
While there is no fixed rule to choose an algorithm for a classification problem, you can
follow these guidelines:

● If accuracy is a concern, test different algorithms and cross-validate them


● If the training dataset is small, use models that have low variance and high
bias
● If the training dataset is large, use models that have high variance and little
bias

20. How is Amazon Able to Recommend Other Things to


Buy? How Does the Recommendation Engine Work?
Once a user buys something from Amazon, Amazon stores that purchase data for future
reference and finds products that are most likely also to be bought, it is possible
because of the Association algorithm, which can identify patterns in a given dataset.
21. When Will You Use Classification over Regression?
Classification is used when your target is categorical, while regression is used when
your target variable is continuous. Both classification and regression belong to the
category of supervised machine learning algorithms.

Examples of classification problems include:

● Predicting yes or no
● Estimating gender
● Breed of an animal
● Type of color

Examples of regression problems include:

● Estimating sales and price of a product


● Predicting the score of a team
● Predicting the amount of rainfall

22. How Do You Design an Email Spam Filter?


Building a spam filter involves the following process:

● The email spam filter will be fed with thousands of emails


● Each of these emails already has a label: ‘spam’ or ‘not spam.’
● The supervised machine learning algorithm will then determine which type of
emails are being marked as spam based on spam words like the lottery, free
offer, no money, full refund, etc.
● The next time an email is about to hit your inbox, the spam filter will use
statistical analysis and algorithms like Decision Trees and SVM to determine
how likely the email is spam
● If the likelihood is high, it will label it as spam, and the email won’t hit your
inbox
● Based on the accuracy of each model, we will use the algorithm with the
highest accuracy after testing all the models

23. What is a Random Forest?


A ‘random forest’ is a supervised machine learning algorithm that is generally used for
classification problems. It operates by constructing multiple decision trees during the
training phase. The random forest chooses the decision of the majority of the trees as
the final decision.

24. Considering a Long List of Machine Learning


Algorithms, given a Data Set, How Do You Decide Which
One to Use?
There is no master algorithm for all situations. Choosing an algorithm depends on the
following questions:

● How much data do you have, and is it continuous or categorical?


● Is the problem related to classification, association, clustering, or regression?
● Predefined variables (labeled), unlabeled, or mix?
● What is the goal?

Based on the above questions, the following algorithms can be used:


25. What is Bias and Variance in a Machine Learning
Model?
Bias
Bias in a machine learning model occurs when the predicted values are further from the
actual values. Low bias indicates a model where the prediction values are very close to
the actual ones.

Underfitting: High bias can cause an algorithm to miss the relevant relations between
features and target outputs.

Variance

Variance refers to the amount the target model will change when trained with different
training data. For a good model, the variance should be minimized.

Overfitting: High variance can cause an algorithm to model the random noise in the
training data rather than the intended outputs.

26. What is the Trade-off Between Bias and Variance?


The bias-variance decomposition essentially decomposes the learning error from any
algorithm by adding the bias, variance, and a bit of irreducible error due to noise in the
underlying dataset.

Necessarily, if you make the model more complex and add more variables, you’ll lose
bias but gain variance. To get the optimally-reduced amount of error, you’ll have to trade
off bias and variance. Neither high bias nor high variance is desired.

High bias and low variance algorithms train models that are consistent, but inaccurate
on average.

High variance and low bias algorithms train models that are accurate but inconsistent.
27. Define Precision and Recall.
Precision

Precision is the ratio of several events you can correctly recall to the total number of
events you recall (mix of correct and wrong recalls).

Precision = (True Positive) / (True Positive + False Positive)

Recall

A recall is the ratio of the number of events you can recall the number of total events.

Recall = (True Positive) / (True Positive + False Negative)

28. What is a Decision Tree Classification?


A decision tree builds classification (or regression) models as a tree structure, with
datasets broken up into ever-smaller subsets while developing the decision tree, literally
in a tree-like way with branches and nodes. Decision trees can handle both categorical
and numerical data.

29. What is Pruning in Decision Trees, and How Is It Done?


Pruning is a technique in machine learning that reduces the size of decision trees. It
reduces the complexity of the final classifier, and hence improves predictive accuracy by
the reduction of overfitting.

Pruning can occur in:


● Top-down fashion. It will traverse nodes and trim subtrees starting at the root
● Bottom-up fashion. It will begin at the leaf nodes

There is a popular pruning algorithm called reduced error pruning, in which:

● Starting at the leaves, each node is replaced with its most popular class
● If the prediction accuracy is not affected, the change is kept
● There is an advantage of simplicity and speed

30. Briefly Explain Logistic Regression.


Logistic regression is a classification algorithm used to predict a binary outcome for a
given set of independent variables.

The output of logistic regression is either a 0 or 1 with a threshold value of generally 0.5.
Any value above 0.5 is considered as 1, and any point below 0.5 is considered as 0.

31. Explain the K Nearest Neighbor Algorithm.


K nearest neighbor algorithm is a classification algorithm that works in a way that a new
data point is assigned to a neighboring group to which it is most similar.

In K nearest neighbors, K can be an integer greater than 1. So, for every new data point,
we want to classify, we compute to which neighboring group it is closest.

Let us classify an object using the following example. Consider there are three clusters:

● Football
● Basketball
● Tennis ball

Let the new data point to be classified is a black ball. We use KNN to classify it. Assume
K = 5 (initially).

Next, we find the K (five) nearest data points, as shown.


Observe that all five selected points do not belong to the same cluster. There are three
tennis balls and one each of basketball and football.

When multiple classes are involved, we prefer the majority. Here the majority is with the
tennis ball, so the new data point is assigned to this cluster.

32. What is a Recommendation System?


Anyone who has used Spotify or shopped at Amazon will recognize a recommendation
system: It’s an information filtering system that predicts what a user might want to hear
or see based on choice patterns provided by the user.

33. What is Kernel SVM?


Kernel SVM is the abbreviated version of the kernel support vector machine. Kernel
methods are a class of algorithms for pattern analysis, and the most common one is
the kernel SVM.
34. What Are Some Methods of Reducing Dimensionality?
You can reduce dimensionality by combining features with feature engineering,
removing collinear features, or using algorithmic dimensionality reduction.

Now that you have gone through these machine learning interview questions, you must
have got an idea of your strengths and weaknesses in this domain.

35. What is Principal Component Analysis?


Principal Component Analysis or PCA is a multivariate statistical technique that is used
for analyzing quantitative data. The objective of PCA is to reduce higher dimensional
data to lower dimensions, remove noise, and extract crucial information such as
features and attributes from large amounts of data.

36. What do you understand by the F1 score?


The F1 score is a metric that combines both Precision and Recall. It is also the weighted
average of precision and recall.

The F1 score can be calculated using the below formula:

F1 = 2 * (P * R) / (P + R)

The F1 score is one when both Precision and Recall scores are one.

37. What do you understand by Type I vs Type II error?


Type I Error: Type I error occurs when the null hypothesis is true and we reject it.
Type II Error: Type II error occurs when the null hypothesis is false and we accept it.

38. Explain Correlation and Covariance?


Correlation: Correlation tells us how strongly two random variables are related to each
other. It takes values between -1 to +1.

Formula to calculate Correlation:

Covariance: Covariance tells us the direction of the linear relationship between two
random variables. It can take any value between - ∞ and + ∞.

Formula to calculate Covariance:


39. What are Support Vectors in SVM?
Support Vectors are data points that are nearest to the hyperplane. It influences the
position and orientation of the hyperplane. Removing the support vectors will alter the
position of the hyperplane. The support vectors help us build our support vector
machine model.

40. What is Ensemble learning?


Ensemble learning is a combination of the results obtained from multiple machine
learning models to increase the accuracy for improved decision-making.

Example: A Random Forest with 100 trees can provide much better results than using
just one decision tree.
41. What is Cross-Validation?
Cross-Validation in Machine Learning is a statistical resampling technique that uses
different parts of the dataset to train and test a machine learning algorithm on different
iterations. The aim of cross-validation is to test the model’s ability to predict a new set
of data that was not used to train the model. Cross-validation avoids the overfitting of
data.

K-Fold Cross Validation is the most popular resampling technique that divides the whole
dataset into K sets of equal sizes.

42. What are the different methods to split a tree in a


decision tree algorithm?
Variance: Splitting the nodes of a decision tree using the variance is done when the
target variable is continuous.
Information Gain: Splitting the nodes of a decision tree using Information Gain is
preferred when the target variable is categorical.

Gini Impurity: Splitting the nodes of a decision tree using Gini Impurity is followed when
the target variable is categorical.

43. How does the Support Vector Machine algorithm


handle self-learning?
The SVM algorithm has a learning rate and expansion rate which takes care of
self-learning. The learning rate compensates or penalizes the hyperplanes for making all
the incorrect moves while the expansion rate handles finding the maximum separation
area between different classes.

44. What are the assumptions you need to take before


starting with linear regression?
There are primarily 5 assumptions for a Linear Regression model:

● Multivariate normality
● No auto-correlation
● Homoscedasticity
● Linear relationship
● No or little multicollinearity

45. What is the difference between Lasso and Ridge


regression?
Lasso(also known as L1) and Ridge(also known as L2) regression are two popular
regularization techniques that are used to avoid overfitting of data. These methods are
used to penalize the coefficients to find the optimum solution and reduce complexity.
The Lasso regression works by penalizing the sum of the absolute values of the
coefficients. In Ridge or L2 regression, the penalty function is determined by the sum of
the squares of the coefficients.

Looking forward to a successful career in AI and Machine learning. Enrol in our Artificial
Intelligence Course in collaboration with Caltech University now.
Become Part of the Machine Learning Talent Pool

With technology ramping up, jobs in the field of data science and AI will continue to be
in demand. Candidates who upgrade their skills and become well-versed in these
emerging technologies can find many job opportunities with impressive salaries.
Looking forward to becoming a Machine Learning Engineer? Enroll in Simplilearn's AI
and ML Course and get certified today. Based on your experience level, you may be
asked to demonstrate your skills in machine learning, additionally, but this depends
mostly on the role you’re pursuing. These machine learning interview questions and
answers will prepare you to clear your interview on the first attempt!

Apart from the above mentioned interview questions, it is also important to have a fair
understanding of frequently asked Data Science interview questions.

Considering this trend, Simplilearn offers AI and Machine Learning certification course
to help you gain a firm hold of machine learning concepts. This course is well-suited for
those at the intermediate level, including:

● Analytics managers
● Business analysts
● Information architects
● Developers looking to become data scientists
● Graduates seeking a career in data science and machine learning

Facing the machine learning interview questions would become much easier after you
complete this course.
Top Data Science Interview Questions And
Answers
Data Science is among the leading and most popular technologies in the world today.

Major organizations are hiring professionals in this field. With the high demand and low

availability of these professionals, Data Scientists are among the highest-paid IT

professionals. This Data Science Interview preparation blog includes the most

frequently asked questions in Data Science job interviews. Here is a list of these

popular Data Science interview questions:

Q1. What is Data Science?

Q2. Differentiate between Data Analytics and Data Science

Q3. What do you understand about linear regression?

Q4. What do you understand by logistic regression?

Q5. What is a confusion matrix?

Q6. What do you understand by true-positive rate and false-positive rate?

Q7. How is Data Science different from traditional application programming?

Q8. Explain the difference between Supervised and Unsupervised Learning.

Q9. What is the difference between the long format data and wide format data?

Q10. Mention some techniques used for sampling. What is the main advantage of

sampling?

Q11. What is bias in Data Science?

Basic Data Science Interview Questions


1. What is Data Science?

Data Science is a field of computer science that explicitly deals with turning data into

information and extracting meaningful insights out of it. The reason why Data Science is

so popular is that the kind of insights it allows us to draw from the available data has led

to some major innovations in several products and companies. Using these insights, we

are able to determine the taste of a particular customer, the likelihood of a product

succeeding in a particular market, etc.

Check out Our Data Science Course in Kolkata and become a certified Data Scientist!

2. Differentiate between Data Analytics and Data Science

Data Analytics Data Science

Data Analytics is a subset of Data Data Science is a broad technology that


Science. includes various subsets such as Data
Analytics, Data Mining, Data
Visualization, etc.

The goal of data analytics is to illustrate The goal of data science is to discover
the precise details of retrieved insights. meaningful insights from massive
datasets and derive the best possible
solutions to resolve business issues.

Requires just basic programming Requires knowledge in advanced


languages. programming languages.
It focuses on just finding the solutions. Data Science not only focuses on finding
the solutions but also predicts the future
with past patterns or insights.

A data analyst’s job is to analyse data in A data scientist’s job is to provide


order to make decisions. insightful data visualizations from raw
data that are easily understandable.

Become an expert in Data Scientist. Enroll now in PG program in Data Science and

Machine Learning from MITxMicroMasters

3. What do you understand about linear regression?

Linear regression helps in understanding the linear relationship between the dependent

and the independent variables. Linear regression is a supervised learning algorithm,

which helps in finding the linear relationship between two variables. One is the predictor

or the independent variable and the other is the response or the dependent variable. In

Linear Regression, we try to understand how the dependent variable changes w.r.t the

independent variable. If there is only one independent variable, then it is called simple

linear regression, and if there is more than one independent variable then it is known as

multiple linear regression.

4. What do you understand by logistic regression?

Logistic regression is a classification algorithm that can be used when the dependent

variable is binary. Let’s take an example. Here, we are trying to determine whether it will

rain or not on the basis of temperature and humidity.


Temperature and humidity are the independent variables, and rain would be our

dependent variable. So, the logistic regression algorithm actually produces an S shape

curve.

Now, let us look at another scenario: Let’s suppose that x-axis represents the runs

scored by Virat Kohli and the y-axis represents the probability of the team India winning

the match. From this graph, we can say that if Virat Kohli scores more than 50 runs,

then there is a greater probability for team India to win the match. Similarly, if he scores

less than 50 runs then the probability of team India winning the match is less than 50

percent.

So, basically in logistic regression, the Y value lies within the range of 0 and 1. This is

how logistic regression works.


5. What is a confusion matrix?

The confusion matrix is a table that is used to estimate the performance of a model. It

tabulates the actual values and the predicted values in a 2×2 matrix.

True Positive (d): This denotes all of those records where the actual values are true and

the predicted values are also true. So, these denote all of the true positives. False

Negative (c): This denotes all of those records where the actual values are true, but the

predicted values are false. False Positive (b): In this, the actual values are false, but the

predicted values are true. True Negative (a): Here, the actual values are false and the

predicted values are also false. So, if you want to get the correct values, then correct

values would basically represent all of the true positives and the true negatives. This is

how the confusion matrix works.

6. What do you understand about the true-positive rate and

false-positive rate?

True positive rate: In Machine Learning, true-positive rates, which are also referred to as

sensitivity or recall, are used to measure the percentage of actual positives which are

correctly identified. Formula: True Positive Rate = True Positives/Positives False

positive rate: False positive rate is basically the probability of falsely rejecting the null
hypothesis for a particular test. The false-positive rate is calculated as the ratio between

the number of negative events wrongly categorized as positive (false positive) upon the

total number of actual events. Formula: False-Positive Rate =

False-Positives/Negatives.

Check out this comprehensive Data Science Course in India!

7. How is Data Science different from traditional application

programming?

Data Science takes a fundamentally different approach in building systems that provide

value than traditional application development.

In traditional programming paradigms, we used to analyze the input, figure out the

expected output, and write code, which contains rules and statements needed to

transform the provided input into the expected output. As we can imagine, these rules

were not easy to write, especially, for data that even computers had a hard time

understanding, e.g., images, videos, etc.

Data Science shifts this process a little bit. In it, we need access to large volumes of

data that contain the necessary inputs and their mappings to the expected outputs.

Then, we use Data Science algorithms, which use mathematical analysis to generate

rules to map the given inputs to outputs.

This process of rule generation is called training. After training, we use some data that

was set aside before the training phase to test and check the system’s accuracy. The

generated rules are a kind of a black box, and we cannot understand how the inputs are

being transformed into outputs.


However, If the accuracy is good enough, then we can use the system (also called a

model).

As described above, in traditional programming, we had to write the rules to map the

input to the output, but in Data Science, the rules are automatically generated or

learned from the given data. This helped solve some really difficult challenges that were

being faced by several companies.

Interested to learn Data Science skills? Check our Data Science course in Kottayam

Now!

8. Explain the differences between supervised and

unsupervised learning.

Supervised and unsupervised learning are two types of Machine Learning techniques.

They both allow us to build models. However, they are used for solving different kinds of

problems.

Supervised Learning Unsupervised Learning

Works on the data that contains both Works on the data that contains no
inputs and the expected output, i.e., the mappings from input to output, i.e., the
labeled data unlabeled data

Used to create models that can be Used to extract meaningful information out
employed to predict or classify things of large volumes of data
Commonly used supervised learning Commonly used unsupervised learning
algorithms: Linear regression, decision algorithms: K-means clustering, Apriori
tree, etc. algorithm, etc.

9. What is the difference between the long format data and

wide format data?

Long Format Data Wide Format Data

A long format data has a column for Whereas, Wide data has a column for
possible variable types and a column for each variable.
the values of those variables.

Each row in the long format represents The repeated responses of a subject will
one time point per subject. As a result, be in a single row, with each response in
each topic will contain many rows of data. its own column, in the wide format.

This data format is most typically used in This data format is most widely used in
R analysis and for writing to log files at data manipulations, stats programmes for
the end of each experiment. repeated measures ANOVAs and is
seldom used in R analysis.

A long format contains values that do A wide format contains values that do not
repeat in the first column. repeat in the first column.
Use df.melt() to convert wide form to long use df.pivot().reset_index() to convert
form long form into wide form

10. Mention some techniques used for sampling. What is

the main advantage of sampling?

Sampling is defined as the process of selecting a sample from a group of people or from

any particular kind for research purposes. It is one of the most important factors which

decides the accuracy of a research/survey result.

Mainly, there are two types of sampling techniques:

Probability sampling: It involves random selection which makes every element get a

chance to be selected. Probability sampling has various subtypes in it, as mentioned

below:

● Simple Random Sampling

● Stratified sampling

● Systematic sampling

● Cluster Sampling

● Multi-stage Sampling

● Non- Probability Sampling: Non-probability sampling follows non-random

selection which means the selection is done based on your ease or any

other required criteria. This helps to collect the data easily. The following are

various types of sampling in it:

○ Convenience Sampling

○ Purposive Sampling
○ Quota Sampling

○ Referral /Snowball Sampling

11. What is bias in Data Science?

Bias is a type of error that occurs in a Data Science model because of using an

algorithm that is not strong enough to capture the underlying patterns or trends that

exist in the data. In other words, this error occurs when the data is too complicated for

the algorithm to understand, so it ends up building a model that makes simple

assumptions. This leads to lower accuracy because of underfitting. Algorithms that can

lead to high bias are linear regression, logistic regression, etc.==

12. What is dimensionality reduction?

Dimensionality reduction is the process of converting a dataset with a high number of

dimensions (fields) to a dataset with a lower number of dimensions. This is done by

dropping some fields or columns from the dataset. However, this is not done

haphazardly. In this process, the dimensions or fields are dropped only after making

sure that the remaining information will still be enough to succinctly describe similar

information.

13. Why is Python used for Data Cleaning in DS?

Data Scientists have to clean and transform the huge data sets in a form that they can

work with. It is important to deal with the redundant data for better results by removing

nonsensical outliers, malformed records, missing values, inconsistent formatting, etc.


Python libraries such as Matplotlib, Pandas, Numpy, Keras, and SciPy are extensively

used for Data cleaning and analysis. These libraries are used to load and clean the data

and do effective analysis. For example, a CSV file named “Student” has information

about the students of an institute like their names, standard, address, phone number,

grades, marks, etc.

14. Why is R used in Data Visualization?

R provides the best ecosystem for data analysis and visualization with more than

12,000 packages in Open-source repositories. It has huge community support, which

means you can easily find the solution to your problems on various platforms like

StackOverflow.

It has better data management and supports distributed computing by splitting the

operations between multiple tasks and nodes, which eventually decreases the

complexity and execution time of large datasets.

15. What are the popular libraries used in Data Science?

Below are the popular libraries used for data extraction, cleaning, visualization, and

deploying DS models:

● TensorFlow: Supports parallel computing with impeccable library

management backed by Google.

● SciPy: Mainly used for solving differential equations, multidimensional

programming, data manipulation, and visualization through graphs and

charts.
● Pandas: Used to implement the ETL(Extracting, Transforming, and Loading

the datasets) capabilities in business applications.

● Matplotlib: Being free and open-source, it can be used as a replacement for

MATLAB, which results in better performance and low memory consumption.

● PyTorch: Best for projects which involve Machine Learning algorithms and

Deep Neural Networks.

Interested to learn more about Data Science, check out our Data Science Course in

New York!

16. What is variance in Data Science?

Variance is a type of error that occurs in a Data Science model when the model ends up

being too complex and learns features from data, along with the noise that exists in it.

This kind of error can occur if the algorithm used to train the model has high complexity,

even though the data and the underlying patterns and trends are quite easy to discover.

This makes the model a very sensitive one that performs well on the training dataset but

poorly on the testing dataset, and on any kind of data that the model has not yet seen.

Variance generally leads to poor accuracy in testing and results in overfitting.

17. What is pruning in a decision tree algorithm?

Pruning a decision tree is the process of removing the sections of the tree that are not

necessary or are redundant. Pruning leads to a smaller decision tree, which performs

better and gives higher accuracy and speed.

18. What is entropy in a decision tree algorithm?


In a decision tree algorithm, entropy is the measure of impurity or randomness. The

entropy of a given dataset tells us how pure or impure the values of the dataset are. In

simple terms, it tells us about the variance in the dataset.

For example, suppose we are given a box with 10 blue marbles. Then, the entropy of

the box is 0 as it contains marbles of the same color, i.e., there is no impurity. If we need

to draw a marble from the box, the probability of it being blue will be 1.0. However, if we

replace 4 of the blue marbles with 4 red marbles in the box, then the entropy increases

to 0.4 for drawing blue marbles.

19. What information is gained in a decision tree

algorithm?

When building a decision tree, at each step, we have to create a node that decides

which feature we should use to split data, i.e., which feature would best separate our

data so that we can make predictions. This decision is made using information gain,

which is a measure of how much entropy is reduced when a particular feature is used to

split the data. The feature that gives the highest information gain is the one that is

chosen to split the data.

Explore this Data Science Course in Delhi and master decision tree algorithm.

20. What is k-fold cross-validation?

In k-fold cross-validation, we divide the dataset into k equal parts. After this, we loop

over the entire dataset k times. In each iteration of the loop, one of the k parts is used

for testing, and the other k − 1 parts are used for training. Using k-fold cross-validation,
each one of the k parts of the dataset ends up being used for training and testing

purposes.

21. Explain how a recommender system works.

A recommender system is a system that many consumer-facing, content-driven, online

platforms employ to generate recommendations for users from a library of available

content. These systems generate recommendations based on what they know about

the users’ tastes from their activities on the platform.

For example, imagine that we have a movie streaming platform, similar to Netflix or

Amazon Prime. If a user has previously watched and liked movies from action and

horror genres, then it means that the user likes watching the movies of these genres. In

that case, it would be better to recommend such movies to this particular user. These

recommendations can also be generated based on what users with a similar taste like

watching.

22. What is a normal distribution?

Data distribution is a visualization tool to analyze how data is spread out or distributed.

Data can be distributed in various ways. For instance, it could be with a bias to the left

or the right, or it could all be jumbled up.

Data may also be distributed around a central value, i.e., mean, median, etc. This kind

of distribution has no bias either to the left or to the right and is in the form of a

bell-shaped curve. This distribution also has its mean equal to the median. This kind of

distribution is called a normal distribution.


23. What is Deep Learning?

Deep Learning is a kind of Machine Learning, in which neural networks are used to

imitate the structure of the human brain, and just like how a brain learns from

information, machines are also made to learn from the information that is provided to

them.

Deep Learning is an advanced version of neural networks to make the machines learn

from data. In Deep Learning, the neural networks comprise many hidden layers (which

is why it is called ‘deep’ learning) that are connected to each other, and the output of the

previous layer is the input of the current layer.

24. What is an RNN (recurrent neural network)?

A recurrent neural network, or RNN for short, is a kind of Machine Learning algorithm

that makes use of the artificial neural network. RNNs are used to find patterns from a

sequence of data, such as time series, stock market, temperature, etc. RNNs are a kind

of feedforward network, in which information from one layer passes to another layer,

and each node in the network performs mathematical operations on the data. These

operations are temporal, i.e., RNNs store contextual information about previous

computations in the network. It is called recurrent because it performs the same

operations on some data every time it is passed. However, the output may be different

based on past computations and their results.

25. Explain selection bias.


Selection bias is the bias that occurs during the sampling of data. This kind of bias

occurs when a sample is not representative of the population, which is going to be

analyzed in a statistical study.

Intermediate Data Science Interview Questions

26. What is the ROC curve?

It stands for Receiver Operating Characteristic. It is basically a plot between a true

positive rate and a false positive rate, and it helps us to find out the right tradeoff

between the true positive rate and the false positive rate for different probability

thresholds of the predicted values. So, the closer the curve to the upper left corner, the

better the model is. In other words, whichever curve has greater area under it that would
be the better model. You can see this in the below graph:

27. What do you understand by a decision tree?

A decision tree is a supervised learning algorithm that is used for both classification and

regression. Hence, in this case, the dependent variable can be both a numerical value

and a categorical value.


Here, each node denotes the test on an attribute, and each edge denotes the outcome

of that attribute, and each leaf node holds the class label. So, in this case, we have a

series of test conditions which give the final decision according to the condition.

Are you interested in learning Data Science from experts? Enroll in our Data Science

Course in Bangalore now!

28. What do you understand by a random forest model?

It combines multiple models together to get the final output or, to be more precise, it

combines multiple decision trees together to get the final output. So, decision trees are

the building blocks of the random forest model.

29. Two candidates, Aman and Mohan appear for a Data

Science Job interview. The probability of Aman cracking

the interview is 1/8 and that of Mohan is 5/12. What is the

probability that at least one of them will crack the

interview?

The probability of Aman getting selected for the interview is 1/8


P(A) = 1/8

The probability of Mohan getting selected for the interview is 5/12

P(B)=5/12

Now, the probability of at least one of them getting selected can be denoted at the

Union of A and B, which means

P(A U B) =P(A)+ P(B) – (P(A ∩ B)) ………………………(1)

Where P(A ∩ B) stands for the probability of both Aman and Mohan getting selected for

the job.

To calculate the final answer, we first have to find out the value of P(A ∩ B)

So, P(A ∩ B) = P(A) * P(B)

1/8 * 5/12

5/96

Now, put the value of P(A ∩ B) into equation (1)

P(A U B) =P(A)+ P(B) – (P(A ∩ B))

1/8 + 5/12 -5/96

So, the answer will be 47/96.

30. How is Data modeling different from Database design?


Data Modeling: It can be considered as the first step towards the design of a database.

Data modeling creates a conceptual model based on the relationship between various

data models. The process involves moving from the conceptual stage to the logical

model to the physical schema. It involves the systematic method of applying data

modeling techniques.

Database Design: This is the process of designing the database. The database design

creates an output which is a detailed data model of the database. Strictly speaking,

database design includes the detailed logical model of a database but it can also

include physical design choices and storage parameters.

31. What is precision?

Precision: When we are implementing algorithms for the classification of data or the

retrieval of information, precision helps us get a portion of positive class values that are

positively predicted. Basically, it measures the accuracy of correct positive predictions.

Below is the formula to calculate precision:

32. What is a recall?

Recall: It is the set of all positive predictions out of the total number of positive

instances. Recall helps us identify the misclassified positive predictions. We use the

below formula to calculate recall:


33. What is the F1 score and how to calculate it?

F1 score helps us calculate the harmonic mean of precision and recall that gives us the

test’s accuracy. If F1 = 1, then precision and recall are accurate. If F1 < 1 or equal to 0,

then precision or recall is less accurate, or they are completely inaccurate. See below

for the formula to calculate the F1 score:

34. What is a p-value?

P-value is the measure of the statistical importance of an observation. It is the

probability that shows the significance of output to the data. We compute the p-value to

know the test statistics of a model. Typically, it helps us choose whether we can accept

or reject the null hypothesis.

35. Why do we use p-value?

We use the p-value to understand whether the given data really describes the observed

effect or not. We use the below formula to calculate the p-value for the effect ‘E’ and the

null hypothesis ‘H0’ is true:


36. What is the difference between an error and a residual

error?

An error occurs in values while the prediction gives us the difference between the

observed values and the true values of a dataset. Whereas, the residual error is the

difference between the observed values and the predicted values. The reason we use

the residual error to evaluate the performance of an algorithm is that the true values are

never known. Hence, we use the observed values to measure the error using residuals.

It helps us get an accurate estimate of the error.

37. Why do we use the summary function?

The summary function in R gives us the statistics of the implemented algorithm on a

particular dataset. It consists of various objects, variables, data attributes, etc. It

provides summary statistics for individual objects when fed into the function. We use a

summary function when we want information about the values present in the dataset. It

gives us the summary statistics in the following form:


Here, it gives the minimum and maximum values from a specific column of the dataset.

Also, it provides the median, mean, 1st quartile, and 3rd quartile values that help us

understand the values better.

38. How are Data Science and Machine Learning related to

each other?

Data Science and Machine Learning are two terms that are closely related but are often

misunderstood. Both of them deal with data. However, there are some fundamental

distinctions that show us how they are different from each other.

Data Science is a broad field that deals with large volumes of data and allows us to

draw insights out of this voluminous data. The entire process of Data Science takes

care of multiple steps that are involved in drawing insights out of the available data. This

process includes crucial steps such as data gathering, data analysis, data manipulation,

data visualization, etc.

Machine Learning, on the other hand, can be thought of as a sub-field of Data Science.

It also deals with data, but here, we are solely focused on learning how to convert the

processed data into a functional model, which can be used to map inputs to outputs,

e.g., a model that can expect an image as an input and tell us if that image contains a

flower as an output.

In short, Data Science deals with gathering data, processing it, and finally, drawing

insights from it. The field of Data Science that deals with building models using

algorithms is called Machine Learning. Therefore, Machine Learning is an integral part

of Data Science.
39. Explain univariate, bivariate, and multivariate analyses.

When we are dealing with data analysis, we often come across terms such as

univariate, bivariate, and multivariate. Let’s try and understand what these mean.

● Univariate analysis: Univariate analysis involves analyzing data with only

one variable or, in other words, a single column or a vector of the data. This

analysis allows us to understand the data and extract patterns and trends

out of it. Example: Analyzing the weight of a group of people.

● Bivariate analysis: Bivariate analysis involves analyzing the data with exactly

two variables or, in other words, the data can be put into a two-column table.

This kind of analysis allows us to figure out the relationship between the

variables. Example: Analyzing the data that contains temperature and

altitude.

● Multivariate analysis: Multivariate analysis involves analyzing the data with

more than two variables. The number of columns of the data can be anything

more than two. This kind of analysis allows us to figure out the effects of all

other variables (input variables) on a single variable (the output variable).

Example: Analyzing data about house prices, which contains information about the

houses, such as locality, crime rate, area, the number of floors, etc.

40. How can we handle missing data?

To be able to handle missing data, we first need to know the percentage of data missing

in a particular column so that we can choose an appropriate strategy to handle the

situation.
For example, if in a column the majority of the data is missing, then dropping the column

is the best option, unless we have some means to make educated guesses about the

missing values. However, if the amount of missing data is low, then we have several

strategies to fill them up.

One way would be to fill them all up with a default value or a value that has the highest

frequency in that column, such as 0 or 1, etc. This may be useful if the majority of the

data in that column contains these values.

Another way is to fill up the missing values in the column with the mean of all the values

in that column. This technique is usually preferred as the missing values have a higher

chance of being closer to the mean than to the mode.

Finally, if we have a huge dataset and a few rows have values missing in some

columns, then the easiest and fastest way is to drop those columns. Since the dataset is

large, dropping a few columns should not be a problem anyway.

41. What is the benefit of dimensionality reduction?

Dimensionality reduction reduces the dimensions and size of the entire dataset. It drops

unnecessary features while retaining the overall information in the data intact.

Reduction in dimensions leads to faster processing of the data.

The reason why data with high dimensions is considered so difficult to deal with is that it

leads to high time consumption while processing the data and training a model on it.

Reducing dimensions speeds up this process, removes noise, and also leads to better

model accuracy.

42. What is a bias-variance trade-off in Data Science?


When building a model using Data Science or Machine Learning, our goal is to build

one that has low bias and variance. We know that bias and variance are both errors that

occur due to either an overly simplistic model or an overly complicated model.

Therefore, when we are building a model, the goal of getting high accuracy is only going

to be accomplished if we are aware of the tradeoff between bias and variance.

Bias is an error that occurs when a model is too simple to capture the patterns in a

dataset. To reduce bias, we need to make our model more complex. Although making

the model more complex can lead to reducing bias, and if we make the model too

complex, it may end up becoming too rigid, leading to high variance. So, the tradeoff

between bias and variance is that if we increase the complexity, the bias reduces and

the variance increases, and if we reduce complexity, the bias increases and the

variance reduces. Our goal is to find a point at which our model is complex enough to

give low bias but not so complex to end up having high variance.

43. What is RMSE?

RMSE stands for the root mean square error. It is a measure of accuracy in regression.

RMSE allows us to calculate the magnitude of error produced by a regression model.

The way RMSE is calculated is as follows:

First, we calculate the errors in the predictions made by the regression model. For this,

we calculate the differences between the actual and the predicted values. Then, we

square the errors.

After this step, we calculate the mean of the squared errors, and finally, we take the

square root of the mean of these squared errors. This number is the RMSE, and a

model with a lower value of RMSE is considered to produce lower errors, i.e., the model

will be more accurate.


44. What is a kernel function in SVM?

In the SVM algorithm, a kernel function is a special mathematical function. In simple

terms, a kernel function takes data as input and converts it into a required form. This

transformation of the data is based on something called a kernel trick, which is what

gives the kernel function its name. Using the kernel function, we can transform the data

that is not linearly separable (cannot be separated using a straight line) into one that is

linearly separable.

45. How can we select an appropriate value of k in

k-means?

Selecting the correct value of k is an important aspect of k-means clustering. We can

make use of the elbow method to pick the appropriate k value. To do this, we run the

k-means algorithm on a range of values, e.g., 1 to 15. For each value of k, we compute

an average score. This score is also called inertia or the inter-cluster variance.

This is calculated as the sum of squares of the distances of all values in a cluster. As k

starts from a low value and goes up to a high value, we start seeing a sharp decrease in

the inertia value. After a certain value of k, in the range, the drop in the inertia value

becomes quite small. This is the value of k that we need to choose for the k-means

clustering algorithm.

46. How can we deal with outliers?

Outliers can be dealt with in several ways. One way is to drop them. We can only drop

the outliers if they have values that are incorrect or extreme. For example, if a dataset
with the weights of babies has a value 98.6-degree Fahrenheit, then it is incorrect. Now,

if the value is 187 kg, then it is an extreme value, which is not useful for our model.

In case the outliers are not that extreme, then we can try:

● A different kind of model. For example, if we were using a linear model, then

we can choose a non-linear model

● Normalizing the data, which will shift the extreme values closer to other data

points

● Using algorithms that are not so affected by outliers, such as random forest,

etc.

47. How to calculate the accuracy of a binary classification

algorithm using its confusion matrix?

In a binary classification algorithm, we have only two labels, which are True and False.

Before we can calculate the accuracy, we need to understand a few key terms:

● True positives: Number of observations correctly classified as True

● True negatives: Number of observations correctly classified as False

● False positives: Number of observations incorrectly classified as True

● False negatives: Number of observations incorrectly classified as False

To calculate the accuracy, we need to divide the sum of the correctly classified

observations by the number of total observations. This can be expressed as follows:

48. What is ensemble learning?


When we are building models using Data Science and Machine Learning, our goal is to

get a model that can understand the underlying trends in the training data and can

make predictions or classifications with a high level of accuracy.

However, sometimes some datasets are very complex, and it is difficult for one model to

be able to grasp the underlying trends in these datasets. In such situations, we combine

several individual models together to improve performance. This is what is called

ensemble learning.

49. Explain collaborative filtering in recommender systems.

Collaborative filtering is a technique used to build recommender systems. In this

technique, to generate recommendations, we make use of data about the likes and

dislikes of users similar to other users. This similarity is estimated based on several

varying factors, such as age, gender, locality, etc.

If User A, similar to User B, watched and liked a movie, then that movie will be

recommended to User B, and similarly, if User B watched and liked a movie, then that

would be recommended to User A.

In other words, the content of the movie does not matter much. When recommending it

to a user what matters is if other users similar to that particular user liked the content of

the movie or not.

50. Explain content-based filtering in recommender

systems.
Content-based filtering is one of the techniques used to build recommender systems. In

this technique, recommendations are generated by making use of the properties of the

content that a user is interested in.

For example, if a user is watching movies belonging to the action and mystery genre

and giving them good ratings, it is a clear indication that the user likes movies of this

kind. If shown movies of a similar genre as recommendations, there is a higher

probability that the user would like those recommendations as well.

In other words, here, the content of the movie is taken into consideration when

generating recommendations for users.

51. Explain bagging in Data Science.

Bagging is an ensemble learning method. It stands for bootstrap aggregating. In this

technique, we generate some data using the bootstrap method, in which we use an

already existing dataset and generate multiple samples of the N size. This bootstrapped

data is then used to train multiple models in parallel, which makes the bagging model

more robust than a simple model.

Once all the models are trained, when it’s time to make a prediction, we make

predictions using all the trained models and then average the result in the case of

regression, and for classification, we choose the result, generated by models, that have

the highest frequency.

52. Explain boosting in Data Science.


Boosting is one of the ensemble learning methods. Unlike bagging, it is not a technique

used to parallelly train our models. In boosting, we create multiple models and

sequentially train them by combining weak models iteratively in a way that training a

new model depends on the models trained before it.

In doing so, we take the patterns learned by a previous model and test them on a

dataset when training the new model. In each iteration, we give more importance to

observations in the dataset that are incorrectly handled or predicted by previous

models. Boosting is useful in reducing bias in models as well.

53. Explain stacking in Data Science.

Just like bagging and boosting, stacking is also an ensemble learning method. In

bagging and boosting, we could only combine weak models that used the same learning

algorithms, e.g., logistic regression. These models are called homogeneous learners.

However, in stacking, we can combine weak models that use different learning

algorithms as well. These learners are called heterogeneous learners. Stacking works

by training multiple (and different) weak models or learners and then using them

together by training another model, called a meta-model, to make predictions based on

the multiple outputs of predictions returned by these multiple weak models.

54. Explain how Machine Learning is different from Deep

Learning.

A field of computer science, Machine Learning is a subfield of Data Science that deals

with using existing data to help systems automatically learn new skills to perform

different tasks without having rules to be explicitly programmed.


Deep Learning, on the other hand, is a field in Machine Learning that deals with building

Machine Learning models using algorithms that try to imitate the process of how the

human brain learns from the information in a system for it to attain new capabilities. In

Deep Learning, we make heavy use of deeply connected neural networks with many

layers.

55. What does the word ‘Naive’ mean in Naive Bayes?

Naive Bayes is a Data Science algorithm. It has the word ‘Bayes’ in it because it is

based on the Bayes theorem, which deals with the probability of an event occurring

given that another event has already occurred.

It has ‘naive’ in it because it makes the assumption that each variable in the dataset is

independent of the other. This kind of assumption is unrealistic for real-world data.

However, even with this assumption, it is very useful for solving a range of complicated

problems, e.g., spam email classification, etc.

56. From the below given ‘diamonds’ dataset, extract only

those rows where the ‘price’ value is greater than 1000 and

the ‘cut’ is ideal.


First, we will load the ggplot2 package:

library(ggplot2)

Next, we will use the dplyr package:

library(dplyr)// It is based on the grammar of data manipulation.

To extract those particular records, use the below command:

diamonds %>% filter(price>1000 & cut==”Ideal”)-> diamonds_1000_idea

57. Make a scatter plot between ‘price’ and ‘carat’ using

ggplot. ‘Price’ should be on the y-axis, ’carat’ should be on

the x-axis, and the ‘color’ of the points should be

determined by ‘cut.’

We will implement the scatter plot using ggplot.

The ggplot is based on the grammar of data visualization, and it helps us stack multiple

layers on top of each other.


So, we will start with the data layer, and on top of the data layer we will stack the

aesthetic layer. Finally, on top of the aesthetic layer we will stack the geometry layer.

Code:

>ggplot(data=diamonds, aes(x=caret, y=price, col=cut))+geom_point()

58. Introduce 25 percent missing values in this ‘iris’ dataset

and impute the ‘Sepal.Length’ column with ‘mean’ and the

‘Petal.Length’ column with ‘median.’

To introduce missing values, we will be using the missForest package:

library(missForest)

Using the prodNA function, we will be introducing 25 percent of missing values:

Iris.mis<-prodNA(iris,noNA=0.25)

For imputing the ‘Sepal.Length’ column with ‘mean’ and the ‘Petal.Length’ column with

‘median,’ we will be using the Hmisc package and the impute function:
library(Hmisc)
iris.mis$Sepal.Length<-with(iris.mis, impute(Sepal.Length,mean))
iris.mis$Petal.Length<-with(iris.mis, impute(Petal.Length,median))

59. Implement simple linear regression in R on this ‘mtcars’

dataset, where the dependent variable is ‘mpg’ and the

independent variable is ‘disp.’

Here, we need to find how ‘mpg’ varies w.r.t displacement of the column.

We need to divide this data into the training dataset and the testing dataset so that the

model does not overfit the data.

So, what happens is when we do not divide the dataset into these two components, it

overfits the dataset. Hence, when we add new data, it fails miserably on that new data.

Therefore, to divide this dataset, we would require the caret package. This caret

package comprises the createdatapartition() function. This function will give the true or

false labels.
Here, we will use the following code:
library(caret)

split_tag<-createDataPartition(mtcars$mpg, p=0.65, list=F)

mtcars[split_tag,]->train

mtcars[-split_tag,]->test

lm(mpg-data,data=train)->mod_mtcars

predict(mod_mtcars,newdata=test)->pred_mtcars

>head(pred_mtcars)

Explanation:

Parameters of the createDataPartition function: First is the column which determines the

split (it is the mpg column).

Second is the split ratio which is 0.65, i.e., 65 percent of records will have true labels

and 35 percent will have false labels. We will store this in a split_tag object.

Once we have split_tag object ready, from this entire mtcars dataframe, we will select all

those records where the split tag value is true and store those records in the training

set.

Similarly, from the mtcars dataframe, we will select all those record where the split_tag

value is false and store those records in the test set.

So, the split tag will have true values in it, and when we put ‘-’ symbol in front of it,

‘-split_tag’ will contain all of the false labels. We will select all those records and store

them in the test set.


We will go ahead and build a model on top of the training set, and for the simple linear

model we will require the lm function.

lm(mpg-data,data=train)->mod_mtcars

Now, we have built the model on top of the train set. It’s time to predict the values on top

of the test set. For that, we will use the predict function that takes in two parameters:

first is the model which we have built and second is the dataframe on which we have to

predict values.

Thus, we have to predict values for the test set and then store them in pred_mtcars.

predict(mod_mtcars,newdata=test)->pred_mtcars

Output:

These are the predicted values of mpg for all of these cars.

So, this is how we can build a simple linear model on top of this mtcars dataset.

60. Calculate the RMSE values for the model building.


When we build a regression model, it predicts certain y values associated with the given

x values, but there is always an error associated with this prediction. So, to get an

estimate of the average error in prediction, RMSE is used. Code:


cbind(Actual=test$mpg, predicted=pred_mtcars)->final_data

as.data.frame(final_data)->final_data

error<-(final_data$Actual-final_data$Prediction)

cbind(final_data,error)->final_data

sqrt(mean(final_data$error)^2)

Explanation: We have the actual and the predicted values. We will bind both of them

into a single dataframe. For that, we will use the cbind function:

cbind(Actual=test$mpg, predicted=pred_mtcars)->final_data

Our actual values are present in the mpg column from the test set, and our predicted

values are stored in the pred_mtcars object which we have created in the previous

question. Hence, we will create this new column and name the column actual. Similarly,

we will create another column and name it predicted which will have predicted values

and then store the predicted values in the new object which is final_data. After that, we

will convert a matrix into a dataframe. So, we will use the as.data.frame function and

convert this object (predicted values) into a dataframe:

as.data.frame(final_data)->final_data

We will pass this object which is final_data and store the result in final_data again. We

will then calculate the error in prediction for each of the records by subtracting the

predicted values from the actual values:

error<-(final_data$Actual-final_data$Prediction)
Then, store this result on a new object and name that object as error. After this, we will

bind this error calculated to the same final_data dataframe:

cbind(final_data,error)->final_data //binding error object to this


final_data

Here, we bind the error object to this final_data, and store this into final_data again.

Calculating RMSE:

Sqrt(mean(final_data$error)^2)

Output:

[1] 4.334423

Note: Lower the value of RMSE, the better the model. R and Python are two of the most

important programming languages for Machine Learning Algorithms.

61. Implement simple linear regression in Python on this

‘Boston’ dataset where the dependent variable is ‘medv’

and the independent variable is ‘lstat.’

Simple Linear Regression


import pandas as pd

data=pd.read_csv(‘Boston.csv’) //loading the Boston dataset

data.head() //having a glance at the head of this data

data.shape

Let us take out the dependent and the independent variables from the dataset:
data1=data.loc[:,[‘lstat’,’medv’]]
data1.head()

Visualizing Variables
import matplotlib.pyplot as plt

data1.plot(x=’lstat’,y=’medv’,style=’o’)

plt.xlabel(‘lstat’)

plt.ylabel(‘medv’)

plt.show()

Here, ‘medv’ is basically the median values of the price of the houses, and we are trying

to find out the median values of the price of the houses w.r.t to the lstat column.

We will separate the dependent and the independent variable from this entire

dataframe:

data1=data.loc[:,[‘lstat’,’medv’]]

The only columns we want from all of this record are ‘lstat’ and ‘medv,’ and we need to

store these results in data1.

Now, we would also do a visualization w.r.t to these two columns:


import matplotlib.pyplot as plt

data1.plot(x=’lstat’,y=’medv’,style=’o’)

plt.xlabel(‘lstat’)

plt.ylabel(‘medv’)

plt.show()

Preparing the Data


X=pd.Dataframe(data1[‘lstat’])

Y=pd.Dataframe(data1[‘medv’])

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y,


test_size=0.2, random_state=100)

from sklearn.linear_model import LinearRegression

regressor=LinearRegression()

regressor.fit(X_train,y_train)
print(regressor.intercept_)

Output :

34.12654201
print(regressor.coef_)//this is the slope

Output :

[[-0.913293]]

By now, we have built the model. Now, we have to predict the values on top of the test

set:
y_pred=regressor.predict(X_test)//using the instance and the predict
function and pass the X_test object inside the function and store this
in y_pred object

Now, let’s have a glance at the rows and columns of the actual values and the predicted

values:

Y_pred.shape, y_test.shape

Output :
((102,1),(102,1))

Further, we will go ahead and calculate some metrics so that we can find out the Mean

Absolute Error, Mean Squared Error, and RMSE.


from sklearn import metrics import NumPy as np

print(‘Mean Absolute Error: ’, metrics.mean_absolute_error(y_test,


y_pred))

print(‘Mean Squared Error: ’, metrics.mean_squared_error(y_test,


y_pred))

print(‘Root Mean Squared Error: ’,


np.sqrt(metrics.mean_absolute_error(y_test, y_pred))

Output:
Mean Absolute Error: 4.692198

Mean Squared Error: 43.9198

Root Mean Squared Error: 6.6270

62. Implement logistic regression on this ‘heart’ dataset in

R where the dependent variable is ‘target’ and the

independent variable is ‘age.’


For loading the dataset, we will use the read.csv function:
read.csv(“D:/heart.csv”)->heart

str(heart)

In the structure of this dataframe, most of the values are integers. However, since we

are building a logistic regression model on top of this dataset, the final target column is

supposed to be categorical. It cannot be an integer. So, we will go ahead and convert

them into a factor.

Thus, we will use the as.factor function and convert these integer values into categorical

data.

We will pass on heart$target column over here and store the result in heart$target as

follows:

as.factor(heart$target)->heart$target

Now, we will build a logistic regression model and see the different probability values for

the person to have heart disease on the basis of different age values.

To build a logistic regression model, we will use the glm function:

glm(target~age, data=heart, family=”binomial”)->log_mod1


Here, target~age indicates that the target is the dependent variable and the age is the

independent variable, and we are building this model on top of the dataframe.

family=”binomial” means we are basically telling R that this is the logistic regression

model, and we will store the result in log_mod1.

We will have a glance at the summary of the model that we have just built:

summary(log_mod1)

We can see Pr value here, and there are three stars associated with this Pr value. This

basically means that we can reject the null hypothesis which states that there is no

relationship between the age and the target columns. But since we have three stars

over here, this null hypothesis can be rejected. There is a strong relationship between

the age column and the target column.

Now, we have other parameters like null deviance and residual deviance. Lower the

deviance value, the better the model.


This null deviance basically tells the deviance of the model, i.e., when we don’t have

any independent variable and we are trying to predict the value of the target column

with only the intercept. When that’s the case, the null deviance is 417.64.

Residual deviance is wherein we include the independent variables and try to predict

the target columns. Hence, when we include the independent variable which is age, we

see that the residual deviance drops. Initially, when there are no independent variables,

the null deviance was 417. After we include the age column, we see that the null

deviance is reduced to 401.

This basically means that there is a strong relationship between the age column and the

target column and that is why the deviance is reduced.

As we have built the model, it’s time to predict some values:


predict(log_mod1, data.frame(age=30), type=”response”)

predict(log_mod1, data.frame(age=50), type=”response”)

predict(log_mod1, data.frame(age=29:77), type=”response”)

Now, we will divide this dataset into train and test sets and build a model on top of the

train set and predict the values on top of the test set:
>library(caret)

Split_tag<- createDataPartition(heart$target, p=0.70, list=F)

heart[split_tag,]->train

heart[-split_tag,]->test

glm(target~age, data=train,family=”binomial”)->log_mod2
predict(log_mod2, newdata=test, type=”response”)->pred_heart

range(pred_heart)

63. Build an ROC curve for the model built

The below code will help us in building the ROC curve:


library(ROCR)

prediction(pred_heart, test$target)-> roc_pred_heart

performance(roc_pred_heart, “tpr”, “fpr”)->roc_curve

plot(roc_curve, colorize=T)

Graph:

Go through this Data Science Course in London to get a clear understanding of Data

Science!
64. Build a confusion matrix for the model where the

threshold value for the probability of predicted values is

0.6, and also find the accuracy of the model.

Accuracy is calculated as:

Accuracy = (True positives + true negatives)/(True positives+ true negatives + false

positives + false negatives)

To build a confusion matrix in R, we will use the table function:

table(test$target,pred_heart>0.6)

Here, we are setting the probability threshold as 0.6. So, wherever the probability of

pred_heart is greater than 0.6, it will be classified as 0, and wherever it is less than 0.6 it

will be classified as 1.

Then, we calculate the accuracy by the formula for calculating Accuracy.


65. Build a logistic regression model on the

‘customer_churn’ dataset in Python. The dependent

variable is ‘Churn’ and the independent variable is

‘MonthlyCharges.’ Find the log_loss of the model.

First, we will load the pandas dataframe and the customer_churn.csv file:

customer_churn=pd.read_csv(“customer_churn.csv”)

After loading this dataset, we can have a glance at the head of the dataset by using the

following command:

customer_churn.head()

Now, we will separate the dependent and the independent variables into two separate

objects:
x=pd.Dataframe(customer_churn[‘MonthlyCharges’])

y=customer_churn[‘ Churn’]

#Splitting the data into training and testing sets

from sklearn.model_selection import train_test_split


x_train, x_test, y_train, y_test=train_test_split(x,y,test_size=0.3,
random_state=0)

Now, we will see how to build the model and calculate log_loss.
from sklearn.linear_model, we have to import LogisticRegression

l=LogisticRegression()

l.fit(x_train,y_train)

y_pred=l.predict_proba(x_test)

As we are supposed to calculate the log_loss, we will import it from sklearn.metrics:


from sklearn.metrics import log_loss

print(log_loss(y_test,y_pred)//actual values are in y_test and


predicted are in y_pred

Output:

0.5555020595194167

Become a master of Data Science by going through this online Data Science Course in

Toronto!

66. Build a decision tree model on ‘Iris’ dataset where the

dependent variable is ‘Species,’ and all other columns are

independent variables. Find the accuracy of the model

built.
To build a decision tree model, we will be loading the party package:
#party package

library(party)

#splitting the data

library(caret)

split_tag<-createDataPartition(iris$Species, p=0.65, list=F)

iris[split_tag,]->train

iris[~split_tag,]->test

#building model

mytree<-ctree(Species~.,train)

Now we will plot the model

plot(mytree)
Model:

#predicting the values

predict(mytree,test,type=’response’)->mypred

After this, we will predict the confusion matrix and then calculate the accuracy using the

table function:

table(test$Species, mypred)
67. Build a random forest model on top of this ‘CTG’

dataset, where ‘NSP’ is the dependent variable and all other

columns are independent variables.

We will load the CTG dataset by using read.csv:


data<-read.csv(“C:/Users/intellipaat/Downloads/CTG.csv”,header=True)

str(data)

Converting the integer type to a factor


data$NSP<-as.factor(data$NSP)

table(data$NSP)

#data partition

set.seed(123)

split_tag<-createDataPartition(data$NSP, p=0.65, list=F)


data[split_tag,]->train

data[~split_tag,]->test

#random forest -1

library(randomForest)

set.seed(222)

rf<-randomForest(NSP~.,data=train)

rf

#prediction

predict(rf,test)->p1

Building confusion matrix and calculating accuracy:

table(test$NSP,p1)

If you have any doubts or queries related to Data Science, get them clarified from Data

Science experts on our Data Science Community!


68. Write a function to calculate the Euclidean distance

between two points.

The formula for calculating the Euclidean distance between two points (x1, y1) and (x2,

y2) is as follows:

√(((x1 - x2) ^ 2) + ((y1 - y2) ^ 2))

Code for calculating the Euclidean distance is as given below:


def euclidean_distance(P1, P2):
return (((P1[0] - P2[0]) ** 2) + ((P1[1] - P2[1]) ** 2)) ** .5

69. Write code to calculate the root mean square error

(RMSE) given the lists of values as actual and predicted.

To calculate the root mean square error (RMSE), we have to:

1. Calculate the errors, i.e., the differences between the actual and the

predicted values

2. Square each of these errors

3. Calculate the mean of these squared errors

4. Return the square root of the mean

The code in Python for calculating RMSE is given below:


def rmse(actual, predicted):
errors = [abs(actual[i] - predicted[i]) for i in range(0,
len(actual))]
squared_errors = [x ** 2 for x in errors]
mean = sum(squared_errors) / len(squared_errors)
return mean ** .5
Check out this Machine Learning Course to get an in-depth understanding of Machine

Learning.

70. Mention the different kernel functions that can be used

in SVM.

In SVM, there are four types of kernel functions:

● Linear kernel

● Polynomial kernel

● Radial basis kernel

● Sigmoid kernel

71. How to detect if the time series data is stationary?

Time series data is considered stationary when variance or mean is constant with time.

If the variance or mean does not change over a period of time in the dataset, then we

can draw the conclusion that, for that period, the data is stationary.

72. Write code to calculate the accuracy of a binary

classification algorithm using its confusion matrix.

We can use the code given below to calculate the accuracy of a binary classification

algorithm:
def accuracy_score(matrix):
true_positives = matrix[0][0]
true_negatives = matrix[1][1]
total_observations = sum(matrix[0]) + sum(matrix[1])
return (true_positives + true_negatives) / total_observations

73. What does root cause analysis mean?

Root cause analysis is the process of figuring out the root causes that lead to certain

faults or failures. A factor is considered to be a root cause if, after eliminating it, a

sequence of operations, leading to a fault, error, or undesirable result, ends up working

correctly. Root cause analysis is a technique that was initially developed and used in the

analysis of industrial accidents, but now, it is used in a wide variety of areas.

74. What is A/B testing?

A/B testing is a kind of statistical hypothesis testing for randomized experiments with

two variables. These variables are represented as A and B. A/B testing is used when

we wish to test a new feature in a product. In the A/B test, we give users two variants of

the product, and we label these variants as A and B.

The A variant can be the product with the new feature added, and the B variant can be

the product without the new feature. After users use these two products, we capture

their ratings for the product.

If the rating of product variant A is statistically and significantly higher, then the new

feature is considered an improvement and useful and is accepted. Otherwise, the new

feature is removed from the product.

Check out this Python Course to get deeper into Python programming.
75. Out of collaborative filtering and content-based

filtering, which one is considered better, and why?

Content-based filtering is considered to be better than collaborative filtering for

generating recommendations. It does not mean that collaborative filtering generates bad

recommendations.

However, as collaborative filtering is based on the likes and dislikes of other users we

cannot rely on it much. Also, users’ likes and dislikes may change in the future.

For example, there may be a movie that a user likes right now but did not like 10 years

ago. Moreover, users who are similar in some features may not have the same taste in

the kind of content that the platform provides.

In the case of content-based filtering, we make use of users’ own likes and dislikes that

are much more reliable and yield more positive results. This is why platforms such as

Netflix, Amazon Prime, Spotify, etc. make use of content-based filtering for generating

recommendations for their users.

76. In the following confusion matrix, calculate precision

and recall.

Total = 510 Actual

Predicted P N

P 156 11
N 16 327

The formulae for precision and recall are given below.


Precision:
(True Positive) / (True Positive + False Positive)
Recall:
(True Positive) / (True Positive + False Negative)
Based on the given data, precision and recall are:
Precision: 156 / (156 + 11) = 93.4
Recall: 156 / (156 + 16) = 90.7

77. Write a function that when called with a confusion

matrix for a binary classification model returns a dictionary

with its precision and recall.

We can use the below for this purpose:


def calculate_precsion_and_recall(matrix):
true_positive = matrix[0][0]
false_positive = matrix[0][1]
false_negative = matrix[1][0]
return {
'precision': (true_positive) / (true_positive + false_positive),
'recall': (true_positive) / (true_positive + false_negative)
}

78. What is reinforcement learning?

Reinforcement learning is a kind of Machine Learning, which is concerned with building

software agents that perform actions to attain the most number of cumulative rewards.
A reward here is used for letting the model know (during training) if a particular action

leads to the attainment of or brings it closer to the goal. For example, if we are creating

an ML model that plays a video game, the reward is going to be either the points

collected during the play or the level reached in it.

Reinforcement learning is used to build these kinds of agents that can make real-world

decisions that should move the model toward the attainment of a clearly defined goal.

79. Explain TF/IDF vectorization.

The expression ‘TF/IDF’ stands for Term Frequency–Inverse Document Frequency. It is

a numerical measure that allows us to determine how important a word is to a document

in a collection of documents called a corpus. TF/IDF is used often in text mining and

information retrieval.

80. What are the assumptions required for linear

regression?

There are several assumptions required for linear regression. They are as follows:

● The data, which is a sample drawn from a population, used to train the

model should be representative of the population.

● The relationship between independent variables and the mean of dependent

variables is linear.

● The variance of the residual is going to be the same for any value of an

independent variable. It is also represented as X.

● Each observation is independent of all other observations.


● For any value of an independent variable, the independent variable is

normally distributed.

81. What happens when some of the assumptions required

for linear regression are violated?

These assumptions may be violated lightly (i.e., some minor violations) or strongly (i.e.,

the majority of the data has violations). Both of these violations will have different effects

on a linear regression model.

Strong violations of these assumptions make the results entirely redundant. Light

violations of these assumptions make the results have greater bias or variance.

Most Commonly Asked Data Scientist


Interview Questions and Answers
Here’s a list of frequently asked basic-level questions at data science interviews:

1. Explain the differences between big data and data


science.
Data science is an interdisciplinary field that looks at analytical aspects of data and
involves statistics, data mining, and machine learning principles. Data scientists use
these principles to obtain accurate predictions from raw data. Big data works with a
large collection of data sets and aims to solve problems pertaining to data management
and handling for informed decision-making.

2. There are missing random values in a data set. How


will you deal with it?
This can be resolved by partitioning the available data into one set with missing values
and another with non-missing values.

3. Define fsck.
It is an abbreviation for “file system check.” This command can be used for searching
for possible errors in the file.

4. Explain the different techniques used for sampling


data.
There are two major techniques:

● Probability Sampling techniques: Clustered sampling, Simple random sampling,


Stratified sampling.
● Non-Probability Sampling techniques: Quota sampling, Convenience sampling,
snowball sampling

5. Describe the different types of deep learning


modules.
The most common frameworks are:

● Pytorch
● Microsoft Cognitive Toolkit
● TensorFlow
● Caffe
● Chainer
● Keras

6. What is cross-validation?
Cross-validation is a statistical technique that one can use to improve a model’s
performance. This is helpful when the model is dealing with unknown data.

7. Explain the differences between a test set and a


validation set.
A Test set is used to test and evaluate the trained model's performance. In contrast, a
validation set is part of the training set used for selecting different parameters to avoid
model overfitting.

8. Explain regression data set.


It refers to the data set directory, which contains test data for linear regression. Taking a
set of data (xi,yi) to determine the ideal linear relationship is the simplest type of
regression.

9. How will you explain linear regression to a non-tech


person?
Linear Regression refers to a statistical technique that measures the linear relationship
between the two variables. Increasing one variable would lead to an increase in the
other variable and vice-versa.

10. Why is data cleansing important?


Data cleansing allows you to sift through all the data within a database and remove or
update information that is incomplete, incorrect, or irrelevant. It is important as it
improves the data quality.

Recommended Reading: How to Create an Impressive Data Scientist Resume

Popular Data Science Interview Questions and


Answers at FAANG+ Companies
Probability and statistics are widely used throughout the career of a data scientist.
Therefore, these topics are a crucial part of the interview process for Data Scientists at
every company. At FAANG, these topics have a dedicated interview round.

Following are examples of probability and statistics problems that are frequently asked
at FAANG+ companies:

1. The “choose a door” problem


In the problem, you are on a game show, being asked to choose between three doors.
Behind each door, there is either a car or a goat. You choose a door. The host, Monty
Hall, picks one of the other doors, which he knows has a goat behind it, and opens it,
showing you the goat. (You know, by the rules of the game, that Monty will always
reveal a goat.) Monty then asks whether you would like to switch your choice of door to
the other remaining door. Assuming you prefer having a car more than having a goat,
do you choose to switch or not to switch?

1. Switch
2. Won’t switch
3. Can’t conclude

Solution:

Here, we have three possible cases:

Door 1
Door 2
Door 3
If you switch the door
If the door is not switched
Goat
Car
Goat
Win
Lose
Goat
Goat
Car
Win
Lose
Car
Goat
Goat
Lose
Win
If you switch the door, you are more likely to win (i.e., with a 2/3 probability)
2. The “fair coin” problem
A coin was flipped 1000 times, and there were 560 heads. For this scenario, develop
the hypothesis to test whether the coin is fair or not.

Solution:

Let’s assume that the probability of a head in the coin toss is p. We need to test if p is
0.5 or not.

● Null Hypothesis: p = 0.5


● Alternate Hypothesis: p ≠ 0.5

Using the Central Limit Theorem, we can approximate the total number of heads as
normally distributed (since 1000 is a large sample size).

Now, the number of ways of getting x(=560) number of heads in the n(=1000) trial is

This is a binomial distribution.

So, expected number of heads if null hypothesis is true (i.e., p = 0.5) = n*p = 1000*0.5 =
500

Similarly,

Now, since we know that number of heads can be approximated as a normal


distribution, we can check how our actual number of heads or sample mean (i.e., 560) is
away from the actual mean or population mean (i.e., 500) considering the null
hypothesis (p=0.5) is true. We can do that by calculating the z-score:

z-score = (population mean - sample mean)/standard deviation of the population


For our case:

99.73% of the normal distribution lies under the 3 standard deviations from the mean.
And the z-score is showing that the number is around 3.79 standard deviation away
from the mean. Hence, we can say that there is a less than 1% chance that the coin is
unbiased, and we reject the null hypothesis. Hence, the coin is biased.

3. The elevator problem


Eight people enter an elevator in a building with ten floors. What is the expected number
of stoppings?

Solution:

There is no assumption about where (specific floor) and when (together or separately)
people get on the elevator.

Probability of a person getting off at a specific floor (out of 10) = 1/10

Probability a person not getting off at a specific floor = 1 - 1/10 = 9/10

4. The “coin toss” problem


A fair coin is tossed 10 times; given that there were 4 heads in the 10 tosses, what is
the probability that the first toss was heads?

Solution:

Apply Bayes’ Theorem to solve the problem:


5. Find the distribution of the sum of two random
numbers.
You have two independent, identical, uniformly distributed random variables x and y
ranging between 0 and 1. What distribution does the sum of these two random
numbers follow? What is the probability that their product is less than 0.5.

Solution: Random variable created by the addition of 2 random variables is again a


normal random variable.

A quick way to check if the probability of the product of X(0,1) and Y(0,1) is less than
0.5 is to visualize a 2-dimensional plane. All the points (x,y) within the square [0, 1] x [0,
1] fall in the candidate space.

The case when xy = 0.5 makes a curve y = 0.5/x, the area under the curve would
represent the cases for which xy <= 0.5. Since the area for the square is 1, that area is
the sought probability.

The curve intersects the square at [0.5,1 ] and [1, 0.5].


6. Increase the conversion on an e-commerce website
There are a few ideas to increase the conversion on an e-commerce website, such as
enabling multiple-items checkout (currently, users can check out one item at a time),
allowing non-registered users to checkout, changing the size and color of the
“Purchase” button, etc. How do you select which idea to invest in?

Solution:

This is an open-ended question based on A/B Testing. It is a vanilla version of the type.
The decision of which program to invest in depends on the A/B test results we get from
the available options. Please pay close attention to the final goal (improved conversion
at checkout), as this also determines the metrics of interest. To answer such questions,
usually approach in the following order:

1. Identify the metric for tracking


2. Explain how to randomize and what your samples are exactly
3. Construct null and alternative hypotheses
4. Keep the test statistics in mind
5. How to draw conclusions from the test statistic computations
6. Follow-up analysis

7. What are the effects of outliers in linear regression?


How to deal with outliers?
Solution:

Linear regression is sensitive to outliers. Since linear regression minimizes the sum of
squared errors across all observations, when an outlier is present, the fit will change to
accommodate. Hence, making the linear regression fit sensitive to outliers.

To deal with outliers, one needs to identify whether the outlier is a valid datapoint or not.
If it is due to data collection issues, simply remove the invalid outlier datapoint. If the
datapoint is valid, try to understand how common the valid datapoint is. Data
transformation and fitting a separate model for the outliers might need to be done for
that case.

8. How do you decide if a feature is important in a


linear regression model?
Solution: T-test can be done for the coefficients of the linear regression model, i.e.:

In other words, the T-test will determine whether the jth feature has a statistically
significant non-zero coefficient in the model. Generally, a non-zero coefficient feature is
considered to be important for the model.

Alternatively, Lasso Regression can be used to identify significant features. The ones
with coefficients not sent to zero by the Lasso Regression are considered to be
important.

9. What can be done if data visualization clearly


indicates that the relationship between dependent
variable y and independent variable x is not linear?
Solution:

10. Is R^2 = 1 good, the larger, the better?


Solution:
In the following sections, we’ll cover some more sample interview questions asked at
FAANG+ companies.

Amazon Data Scientist Interview Questions


Being one of the biggest data-driven companies, Amazon is constantly looking for
expert data scientists. If you’re preparing for a data scientist interview at Amazon, the
following are some sample questions you can practice:

1. Create a Python code that can recognize whether entries to a list have common
characters or not.
2. Suppose you have an array of integers. You have been asked to find a certain
element. What is the algorithm you would use, and what is its efficacy?
3. In the case of a long-sorted and short-sorted list, what algorithm would you use
to search the long list for the 4 elements?
4. Tell us about an instance where you applied machine learning to resolve
ambiguous business problems.
5. If you have categorical variables and there are thousands of distinct values, how
will you encode them?
6. Define lstm. How have you used it?
7. Enumerate the difference between bagging and boosting.
8. How does 1D CNN work?
9. Differentiate between linear regression and a t-test?
10. How will you locate the customer who has the highest total order cost between
2020-02-02 to 2020-05-06? You can assume that every first name in the dataset
is unique.
11. Take us through the steps of the cold-start problem in a recommender system?
12. Discuss the steps of building a forecasting model.
13. How will you create an AB test for a marketing campaign?
14. What are Markov chains?
15. What is root cause analysis?
Recommended Reading: Amazon Data Scientist Salary

Facebook Data Scientist Interview Questions


Facebook is one of the major players in data science and offers great job opportunities
for data scientists. Following are some sample data scientist interview questions for
Facebook interview prep:

1. How do you approach any data analytics-based project?


2. Explain Gradient Descent
3. Why is data cleaning crucial? How do you clean the data?
4. Define Autoencoders.
5. How will you treat missing values during data analysis?
6. How will you optimize the delivery of a million emails?
7. What are Artificial Neural Networks?
8. Describe the different machine learning models.
9. What is the difference between Data Science and Data Analytics?
10. How will you ensure good data visualization?

Recommended Reading: Facebook Data Scientist Salary

Airbnb Data Scientist Interview Questions


Being heavily dependent on tech and data, Airbnb is a great place to work for software
engineers and data scientists. You can practice the following interview questions for
your data scientist interview at Airbnb.

1. If you need to manage a chat thread, which tables and indices do you need in a
SQL DB?
2. How do you propose to measure the effectiveness of the operations team?
3. Explain p-value to a business head.
4. Explain the differences between independent and dependent variables.
5. What is the goal of A/B Testing?
6. Define Prior probability and likelihood?
7. Explain the key differences between supervised and unsupervised learning.
8. What is the difference between “long” and “wide” format data?
9. Explain the utility of a training set.
10. What is Logistic Regression?

Recommended Reading: Data Scientist Salary in the United States


Data Science Interview Questions for Freshers
If you’re a fresher, here are some data science interview questions that you must
prepare for:

1. Explain the differences between data analytics and data science.


2. Can you describe the various techniques used for data sampling?
3. What are the benefits of using data sampling?
4. What are precision and recall in data science?
5. What is the best way to handle missing values in data?
6. Define linear regression. How do you use it in data analysis?
7. What is logistic regression, and how is it different from linear regression?
8. What are the differences between long and wide-format data?
9. List out the differences between supervised learning and unsupervised learning.
10. Enlist the various steps involved in an analytics project.
11. What do you understand by deep learning?
12. What is data cleaning?
13. How does traditional application programming vary from data science?
14. What are the differences between Normalization and Standardization?
15. Define tensors in data science.

Recommended Reading: Data Engineer vs. Data Scientist — Everything You Need to
Know

Data Science Interview Questions for


Experienced Candidates
Experienced candidates applying for data scientist roles at tech companies can expect
the following types of interview questions:

1. How do you handle unbalanced binary classification?


2. Discuss three types of machine learning algorithms.
3. What is a random forest algorithm?
4. Define Cross-Validation.
5. What is bias?
6. What is the CART algorithm for decision trees?
7. Describe the different nodes of a decision tree.
8. Have you used hypothesis testing in machine learning problems?
9. What is ANOVA testing?
10. In the case of imbalance classification, how will you calculate F-measure and
precision?
11. Explain gradient descent with respect to linear models.
12. Why should you use regularization? What are the differences between L1 and L2
regularization?
13. Describe the differences between a box plot and a histogram.
14. What is a confusion matrix?
15. Describe outlier value. How do you treat them?

More Sample Questions for Data Science


Technical Interviews
Here are a few more technical interview questions for practicing for your data scientist
interview:

1. What do you mean by cluster sampling and systematic sampling?


2. Describe the differences between true-positive rate and false-positive rate.
3. What is Naive Bayes? Why is it known as Naive?
4. What do you understand about the “curse of dimensionality”?
5. What is cross-validation in data science?
6. What do you know about cross-validation?
7. How can you select an ideal value of K for K-means clustering?
8. What are the steps of building a random forest model?
9. What is ensemble learning?
10. How will you define clusters in cluster algorithm?

Recommended Reading: 7 Best Data Science Books for Interview Preparation

Behavioral Interview Questions for Data


Scientists
While there will be a heavy focus on your data science knowledge and skills, data
scientist interviews also include behavioral rounds. Following are some behavioral
interview questions you can practice to ace your data scientist interview:

1. Describe a time when you used data for presenting data-driven statistics.
2. Do you think vacations are important? How often do you think one should take a
vacation?
3. Did you ever have two deadlines that you had to meet simultaneously? How did
you manage that?
4. Describe a time when you had a disagreement with a senior over a project. How
did you handle it?
5. How will you handle the situation if you have an insubordinate team member?
6. Why do you want to work as a data scientist with this company?
7. Which is your favorite leadership principle?
8. How do you ensure high productivity levels at work?
9. Have you ever had to explain a technical concept to a non-technical person?
Was it difficult to do so?
10. How do you prioritize your work?

Recommended Reading: Python Data Science Interview Questions

That concludes the comprehensive list of data scientist interview questions. Make sure
you practice these frequently asked questions to prepare yourself for the interview.

FAQs on Data Scientist Interview Questions


1. What type of questions are asked in a data scientist interview?

Data science interview questions are usually based on statistics, coding, probability,
quantitative aptitude, and data science fundamentals.

2. Are coding questions asked at data scientist interviews?

Yes. In addition to core data science questions, you can also expect easy to medium
Leetcode problems or Python-based data manipulation problems. Your knowledge of
SQL will also be tested through coding questions.

3. Are behavioral questions asked at data scientist interviews?

Yes. Behavioral questions help hiring managers understand if you are a good fit for the
role and company culture. You can expect a few behavioral questions during the data
scientist interview.

4. What topics should I prepare to answer data scientist interview questions?

Some domain-specific topics that you must prepare include SQL, probability and
statistics, distributions, hypothesis testing, p-value, statistical significance, A/B testing,
causal impact and inference, and metrics. These will prepare you for data scientist
interview questions.

5. Is having a master’s degree essential to work as a Data Scientist at FAANG?

Based on our research, you can work as a data scientist even though you only have a
bachelor’s degree. You can always upgrade your skills via a data science boot camp.
But for better career prospects, having an advanced degree may be useful.
100+ Data Science Interview
Questions and Answers for 2023
Top 100 Common Data Scientist Interview Questions

and Answers

Common Data Science Interview Questions

1. What is Machine Learning?

Machine Learning comprises two words-machine and learning, which hint towards its
definition - a subdomain in computer science that deals with the application of
mathematical algorithms to identify the trend or pattern in a dataset.

The simplest example is the usage of linear regression (y=mt+c) to predict the output of
a variable y as a function of time. The machine learning model learns the trends in the
dataset by fitting the equation on the dataset and evaluating the best set of values for m
and c. One can then use these equations to predict future values.

Access 100+ ready-to-use, sample Python and R codes for data science to prepare for
your Data Science Interview

2. Quickly differentiate between Machine Learning, Data

Science, and AI.


Machine Learning Data Science Artificial Intelligence

Basic A branch of Artificial Data Science refers to A term that broadly


Meaning Intelligence that the art of using covers the applications
deals with the usage machine learning and of computer science
of simple deep learning spanning Robotics,
statistics-inspired techniques over large Text Analysis, etc.
algorithms to data to predict certain
identify patterns in outcomes.
the dataset.

3. Out of Python and R, which is your preference for

performing text analysis?

Python is likely to be everyone’s choice for text analysis as it has libraries like Natural
Language Toolkit (NLTK), Gensim. CoreNLP, SpaCy, TextBlob, etc. are useful for text
analysis.

4. What are Recommender Systems?

Understanding consumer behavior is often the primary goal of many businesses. For
example, consider the case of Amazon. If a user searches for a product category on its
website, the major challenge for Amazon’s backend algorithms is to come up with
suggestions that are likely to motivate the users to make a purchase. And such
algorithms are the heart of recommendation systems or recommender systems. These
systems aim at analyzing customer behavior and evaluating their fondness for different
products. Apart from Amazon, recommender systems are also used by Netflix, Youtube,
Flipkart, etc.

5. Why data cleaning plays a vital role in the analysis?

(Access popular Python and R Codes for data cleaning )It is cumbersome to clean data
from multiple sources to transform it into a format that data analysts or scientists can
work with. As the number of data sources increases, the time it takes to clean the data
increases exponentially due to the number of sources and the volume of data generated
in these sources. It might take up to 80% of the time for cleaning data, thus making it a
critical part of the analysis task.

6. Define Collaborative filtering.

The process of filtering is used by most recommender systems to identify patterns or


information by collaborating viewpoints, various data sources, and multiple agents.

New Projects
Learn Efficient Multi-Source Data Processing with Talend ETL
View Project
Build Piecewise and Spline Regression Models in Python
View Project
Python and MongoDB Project for Beginners with Source Code
View Project
Learn How to Implement SCD in Talend to Capture Data Changes
View Project
End-to-End ML Model Monitoring using Airflow and Docker
View Project
Build an AI Chatbot from Scratch using Keras Sequential Model
View Project
Learn to Create Delta Live Tables in Azure Databricks
View Project
EMR Serverless Example to Build a Search Engine for COVID19
View Project
AWS Project to Build and Deploy LSTM Model with Sagemaker
View Project
Build an ETL Pipeline with Talend for Export of Data from Cloud
View Project
Learn Efficient Multi-Source Data Processing with Talend ETL
View Project
Build Piecewise and Spline Regression Models in Python
View Project
Python and MongoDB Project for Beginners with Source Code
View Project
Learn How to Implement SCD in Talend to Capture Data Changes
View Project
End-to-End ML Model Monitoring using Airflow and Docker
View Project
Build an AI Chatbot from Scratch using Keras Sequential Model
View Project
Learn to Create Delta Live Tables in Azure Databricks
View Project
EMR Serverless Example to Build a Search Engine for COVID19
View Project
AWS Project to Build and Deploy LSTM Model with Sagemaker
View Project
Build an ETL Pipeline with Talend for Export of Data from Cloud
View Project
View all New Projects

7. What is an Eigenvalue and Eigenvector?

Eigenvectors are used for understanding linear transformations. They are the directions
along which a particular linear transformation acts by flipping, compressing, or
stretching. Eigenvalues can be referred to as the strength of the transformation in the
direction of the eigenvector or the factor by which the compression occurs. We usually
calculate the eigenvectors for a correlation or covariance matrix in data analysis.

8. What is Gradient Descent?

Gradient descent is an iterative procedure that minimizes the cost function parametrized
by model parameters. It is an optimization method based on convex function and trims
the parameters iteratively to help the given function attain its local minimum. Gradient
measures the change in parameter with respect to the change in error. Imagine a
blindfolded person on top of a hill and wanting to reach the lower altitude. The simple
technique he can use is to feel the ground in every direction and take a step in the
direction where the ground is descending faster. Here we need the help of the learning
rate which says the size of the step we take to reach the minimum. The learning rate
should be chosen so that it should not be too high or too low. When the selected
learning rate is too high, it tends to bounce back and forth between the convex function
of the gradient descent, and when it is too low, we will reach the minimum very slowly.

9. Differentiate between a multi-label classification

problem and a multi-class classification problem.

Multi-label Classification Multi-Class Classification

A classification problem where each A classification problem where each target


target variable in the dataset can be variable in the dataset can be assigned
labeled with more than one class. only one class out of two or more than two
classes.

For Example, a news article can be


labeled with more than two topics, say, For Example, the task of classifying fruits
sports and fashion. images where each image contains only
one fruit.

10. What are the various steps involved in an analytics

project?

● Understand the business problem and convert it into a data analytics problem.
● Use exploratory data analysis techniques to understand the given dataset.
● With the help of feature selection and feature engineering methods, prepare the
training and testing dataset.
● Explore machine learning/deep learning algorithms and use one to build a
training model.
● Feed training dataset to the model and improve the model’s performance by
analyzing various statistical parameters.
● Test the performance of the model using the testing dataset.
● Deploy the model, if needed, and monitor the model performance.

11. What is the difference between feature selection and

feature engineering methods?

Feature Selection Feature Engineering

Feature selection methods are the Feature Engineering methods are the
methods that are used to obtain a subset methods that are used to create new
of variables from the dataset that are features from the given dataset using the
required to build a model that best fits existing variables. These methods allow
the trends in the dataset. to better fit complicated trends in the
dataset.

Example: Intrinsic Methods(Rule and Example: Imputation, Discreteziation,


tree-based algorithms, MARS Models, Categorical Encoding, etc.
etc.), Filter Methods, Wrapper
Methods(Recursive Feature Elimination,
Genetic Algorithms, etc.)

12. What do you know about MLOps tools? Have you ever

used them in a machine learning project?

MLOps tools are the tools that are used to produce and monitor the enterprise-grade
deployment of machine learning models. Examples of such tools are MLflow,
Pachyderm, Kubeflow, etc.
In case you haven’t worked on an MLOps project, try this MLOps project by Goku
Mohandas on Github or this MLOps Project on GCP using Kubeflow for Model
Deployment by ProjectPro.

Data Science Technical Interview Questions

13. What do you understand by logistic regression?

Explain one of its use-cases.

Logistic regression is one of the most popular machine learning models used for solving
a binary classification problem, that is, a problem where the output can take any one of
the two possible values. Its equation is given by

Where X represents the feature variable, a,b are the coefficients, and Y is the target
variable. Usually, if the value of Y is greater than some threshold value, the input
variable is labeled with class A. Otherwise, it is labeled with class B.

14. How are univariate, bivariate, and multivariate

analyses different from each other?

Univariate Analysis Bivariate Analysis Multivariate Analysis


When only one variable is When trends in two When more than two
being analyzed through variables are compared variables are considered for
graphs like pie charts, the using graphs like scatter analysis to understand their
analysis is called plots, the analysis of the correlations, the analysis is
univariate. bivariate type. termed as multivariate.

15. What is K-means?

K-means clustering algorithm is an unsupervised machine learning algorithm that


classifies a dataset with n observations into k clusters. Each observation is labeled to
the cluster with the nearest mean.

16. How will you find the right K for K-means?

To find the optimal value for k, one can use the elbow method or the silhouette method.

17. What do you understand by long and wide data

formats?

In wide data format, you will find a column for each variable in the dataset. On the other
hand, in a long format, the dataset has a column for specific variable types & a column
for the values of those variables.
For example,

Wide Data Format


Long data format

Image Source: Mason John on Quora

18. What do you understand by feature vectors?

Feature vectors are the set of variables containing values describing each observation’s
characteristics in a dataset. These vectors serve as input vectors to a machine learning
model.

19. How does the use of dropout work as a regulariser for

deep neural networks?

Dropout is a regularisation method used for deep neural networks to train different
neural networks architectures on a given dataset. When the neural network is trained on
a dataset, a few layers of the architecture are randomly dropped out of the network.
This method introduces noise in the network by compelling nodes within a layer to
probabilistically take on more or less authority for the input values. Thus, dropout makes
the neural network model more robust by fixing the units of other layers with the help of
prior layers.

20. How beneficial is dropout regularisation in deep

learning models? Does it speed up or slow down the

training process, and why?

The dropout regularisation method mostly proves beneficial for cases where the dataset
is small, and a deep neural network is likely to overfit during training. The computational
factor has to be considered for large datasets, which may outweigh the benefit of
dropout regularisation.

The dropout regularisation method involves the random removal of a layer from a deep
neural network, which speeds up the training process.

21. How will you explain logistic regression to an

economist, physician-scientist, and biologist?

Logistic regression is one of the simplest machine learning algorithms. It is used to


predict the relationship between a categorical dependent variable and two or more
independent variables. The mathematical formula is given by

Where X is the independent variable, a,b are the coefficients, and Y is the dependent
variable that can take categorical values.

22. What is the benefit of batch normalization?


● The model is less sensitive to hyperparameter tuning.
● High learning rates become acceptable, which results in faster training of the
model.
● Weight initialization becomes an easy task.
● Using different non-linear activation functions becomes feasible.
● Deep neural networks are simplified because of batch normalization.

It introduces mild regularisation in the network.

23. What is multicollinearity, and how can you overcome

it?

A single dependent variable depends on several independent variables in a multiple


regression model. When these independent variables are deduced to possess high
correlations with each other, the model is considered to reflect multicollinearity.

One can overcome multicollinearity in their model by removing a few highly correlated
variables from the regression equation.

24. What do you understand by the trade-off between bias

and variance in Machine Learning? What is its

significance?

The expected value of test-MSE (Mean Square Error, for a given value x0, can always
be decomposed into the sum of three fundamental quantities: the variance of f0‘(x0), the
squared bias of f0(x0), and the variance of the error terms e. That is,

E(y0 − f0‘(x0))2 = Var(f0‘(x0) + [Bias(f0‘(x0))]2 + Var(e)

Here the notation(y0 − f0(x0))2 defines the expected test MSE, and refers to the average
test MSE that one would obtain if they repeatedly estimated f using a large number of
training sets, and tested each at x0. Also, f0‘(x0) refers to the output of the fitted ML
model for a given input x0 and e is the deviation of the predicted valuef0‘(x0) from the
true value at a given x0.
The equation above suggests that we need to select a statistical learning method that
simultaneously achieves low variance and low bias to minimize the expected test error.
A good statistical learning method's good test set performance requires low variance
and low squared bias. This is referred to as a trade-off because it is easy to obtain a
method with extremely low bias but high variance (for instance, by drawing a curve that
passes through every single training observation) or a method with a very low variance

but high bias (by fitting a horizontal line to the data). The challenge lies in finding a
method for which both the variance and the squared bias are low.

25. What do you understand by interpolating and

extrapolating the given data?

Interpolating the data means one is estimating the values in between two known values
of a variable from the dataset. On the other hand, extrapolating the data means one is
estimating the values that lie outside the range of a variable.

26. Do gradient descent methods always converge to the

same point?

No, gradient descent methods do not always converge to the same point because they
converge to a local minimum or a local optima point in some cases. It depends a lot on
the data one is dealing with and the initial values of the learning parameter.

27. What is the difference between Supervised Learning

and Unsupervised Learning?

Supervised Learning Unsupervised Learning


If an algorithm learns something from If the algorithm does not learn anything
the training data so that the knowledge beforehand because there is no response
can be applied to the test data, then it variable or training data, it is referred to as
is referred to as Supervised Learning. unsupervised learning.

It is majorly used to make predictions It is primarily used to perform analysis and


for a dependent variable. group similar data points together.

Classification and Regression are an Clustering and dimensionality reduction are


examples of Supervised Learning. examples of unsupervised learning.

28. What is Regularization and what kind of problems

does regularization solve?

Regularization is basically a technique that is used to push or encourage the


coefficients of the machine learning model towards zero to reduce the over-fitting
problem. The general idea of regularization is to penalize complicated models by adding
an additional penalty to the loss function in order to generate a larger loss. In this way,
we can discourage the model from learning too many details and the model is much
more general.
There are two ways of assigning the additional penalty term to the loss function giving
rise to two types of regularization techniques. They are

● L2 Regularization
● L1 Regularization

In L2 Regularization, the penalty term is the sum of squares of the magnitude of the
model coefficients while in L1 Regularization, it is the sum of absolute values of the
model coefficients.

29. How can you overcome Overfitting?


We can overcome overfitting using one or more of the following techniques
1. Simplifying the model: We can reduce the overfitting of the model by reducing the
complexity of model. We can either remove layers or reduce the number of neurons in
the case of a deep learning model, or prefer a lesser order polynomial model in case of
regression.

2. Use Regularization: Regularization is the common technique used to remove the


complexity of the model by adding a penalty to the loss function. There are two
regularization techniques namely L1 and L2. L1 penalizes the sum of absolute values of
weight whereas L2 penalizes the sum of square values of weight. When data is too
complex to be modeled, the L2 technique is preferred and L1 is better if the data to be
modeled is quite simple. However, L2 is more commonly preferred.
3. Data Augmentation: Data augmentation is nothing but creating more data samples
using the existing set of data. For example, in the case of a convolutional neural
network, producing new images by flipping, rotation, scaling, changing brightness of the
existing set of images helps in increasing the dataset size and reducing overfitting.
4. Early Stopping: Early stopping is a regularization technique that identifies the point
from where the training data leads to generalization error and begins to overfit. The
algorithm stops training the model at that point.
5. Feature reduction: If we have a small number of data samples with a large number of
features, we can prevent overfitting by selecting only the most important features. We
can use various techniques for this such as F-test, Forward elimination, and Backward
elimination.
6. Dropouts: In the case of neural networks, we can also randomly deactivate a
proportion of neurons in each layer. This technique is called dropout and it is a form of
regularization. However, when we use the dropout technique, we have to train the data
for more epochs.

30. Differentiate between Batch Gradient Descent,

Mini-Batch Gradient Descent, and Stochastic Gradient

Descent.

Gradient descent is one of the most popular machine learning and deep learning
optimization algorithms used to update a learning model's parameters. There are 3
variants of gradient descent.
Batch Gradient Descent: Computation is carried out on the entire dataset in batch
gradient descent.
Stochastic Gradient Descent: Computation is carried over only one training sample in
stochastic gradient descent.
Mini Batch Gradient Descent: A small number/batch of training samples is used for
computation in mini-batch gradient descent.
For example, if a dataset has 1000 data points, then batch GD, will train on all the 1000
data points, Stochastic GD will train on only a single sample and the mini-batch GD will
consider a batch size of say100 data points and update the parameters.

Data Science Statistics Interview Questions

31. How can you make data normal using Box-Cox

transformation?

The Box-Cox transformation is a method of normalizing data, named after two


statisticians who introduced it, George Box and David Cox. Each data point, X, is
transformed using the formula Xa, where a represents the power to which each data
point is raised. The box-cox transformation fits the data for values -5 to +5 until the
optimal ’a' value that can best normalizes the data is identified.

32. What does P-value signify about the statistical data?

In statistics, the p-value is used to test the significance of a null hypothesis. A p-value
lower than 0.05 suggests that there is only 5% chance that the outcomes of an
experiment are random and the null hypothesis must be rejected. On the other hand, a
higher p-value,say0.8, suggests that the null hypothesis can not be rejected as 80% of
the sample has random outcomes.

33. Why do we use A/B Testing?

A/B Testing is a technique for understanding user experience. It involves serving a user
with two different product versions to analyze which version is likely to outperform the
other. The testing is also used to understand user preferences.

34. What is the standard normal distribution?


The standard normal distribution is a special kind of normal distribution in statistics that
zero mean and standard deviation equals one. The graph of a standard normal
distribution looks like the famous bell curve with zero at its center. As you can see, the
distribution is symmetrical around the origin, and asymptomatic.

35. What is the difference between squared error and

absolute error?

Squared Error Absolute Error

The squared error is the square of As the name suggests, the absolute error
the difference between the value of a refers to the modular of the difference
quantity,x from its inferred value, x’. between the value of a quantity,x from its
inferred value, x’.

It is represented as (x-x’)2.
It is represented as |x-x’|.

In data science, mean squared error is more popular for understanding the deviation of
the inferred values from the actual values as it gives relatively more weight to the highly
deviated points and gives a continuous derivative which is useful for analysis.

36. What is the difference between skewed and uniform

distribution?

A skewed distribution is a distribution where the values in the dataset are not
normalized and the distribution curve is inclined towards one side. A uniform distribution
on the other hand is a symmetric distribution where the probability of occurrence of
each point is same for a given range of values in the dataset.

37. What do you understand by Recall and Precision?


For explaining Recall and Precision, it is best to consider an example of a confusion
matrix.

Predicted\ Cancer Not a


Patient Cancer
Actual Patient

Cancer 30 12
Patient

Not a 10 28
Cancer
Patient

Assume that the confusion matrix mentioned above represents the results of the
classification problem of cancer detection. It is easy to conclude the following:

True Positives, No. of patients actually having cancer = 30

True Negatives, No. of patients that do have cancer = 28

False Positives, No. of patients that do not have cancer but the model predicted
otherwise = 12

False Negatives, No. of patients that have cancer but the model predicted otherwise =
10

For such problem,

Recall = True Positives / (True Positives + False Negatives) = 30/40 = 0.75

The formula for recall clearly suggests that it estimates the ability of a model to correctly
identify true positives, that is, the patients who are infected with cancer. To understand it
better, take a careful look at the denominator which is nothing but the total number of
people possessing cancerous cells. Thus, a recall value of 0.75 suggests that the model
was able to correctly identify 75% of the patients that have cancer.On the other hand,
Precision = True Positives / (True Positives + False Positives) = 30/42 = 0.71
The formula for Precision suggests that it reflects how many times the model is
successful in deducing True positives wrt the false positives. Thus, the number 0.71
suggests that whenever the model predicts a patient has cancer, the chances of making
a correct prediction are 71%.

38. What is the curse of dimensionality?

High dimensional data refers to data that has a large number of features. The
dimension of data is the number of features or attributes in the data. The problems
arising while working with high dimensional data are referred to as the curse of
dimensionality. It basically means that error increases as the number of features
increases in data. Theoretically, more information can be stored in high-dimensional
data, but practically, it does not help as it can have higher noise and redundancy. It is
hard to design algorithms for high-dimensional data. Also, the running time increases
exponentially with the dimension of data.

39. What is the use of the R-squared value?

The r-squared value compares the variation of a fitted curve to a set of data points with
the variation of those points wrt the line that passes through the average value. It can
be understood with the help of the formula

R2 = [Var(mean) - Var(model)] / Var(mean)

It is obvious that the model is likely to fit better than the average line. So, the variation
for the model is likely to be less than the variation for the line. Thus, if the r-square has
a value of 0.92, it suggests that the model fits the data points better than the line as
there is 92% less variation. It also shows that there is a strong correlation between the
feature and target value. However, if the r-squared value is less, it suggests that the
correlation is weak and the two variables are quite independent of each other.

Data Science Probability Interview Questions

40. What do you understand by Hypothesis in the content

of Machine Learning?
In machine learning, a hypothesis represents a mathematical function that an algorithm
uses to represent the relationship between the target variable and features.

41. How will you tackle an exploding gradient problem?

By sticking to a small learning rate, scaled target variables, a standard loss function,
one can carefully configure the network of a model and avoid exploding gradients.
Another approach for tackling exploding gradients is using gradient scaling or gradient
clipping to change the error before it is propagated back through the network. This
change in error allows rescaling of weights.

42. Is Naïve Bayes bad? If yes, under what aspects.

Naïve Bayes is a machine learning algorithm based on the Bayes Theorem. This is
used for solving classification problems. It is based on two assumptions, first, each
feature/attribute present in the dataset is independent of another, and second, each
feature carries equal importance. But this assumption of Naïve Bayes turns out to be
disadvantageous. As it assumes that the features are independent of each other, but in
real-life scenarios, this assumption cannot be true as there is always some dependence
present in the given set of features. Another disadvantage of this algorithm is the
‘zero-frequency problem’ where the model assigns value zero for those features in the
test dataset that were not present in the training dataset.

43. How would you develop a model to identify

plagiarism?

Follow the steps below for developing a model that identifies plagiarism:

● Tokenise the document.


● Use the NLTK library in Python for the removal of stopwords from data.
● Create LDA or SDA of the document and then use the GenSim library to identify
the most relevant words, line by line.
● Use Google Search API to search for those words.

44. Explain the central limit theorem.


The central limit theorem says that if someone collects a large number of samples of a
population, the distribution spread of their mean values will obey the curve of a normal
distribution curve irrespective of the distribution each sample obeys.

45. What is the relevance of the central limit theorem to a

class of freshmen in the social sciences who hardly have

any knowledge about statistics?

The most important consequence of the central limit theorem is that it reveals how
nature likes to obey the normal distribution curve. It allows experts from various fields
like statistics, physics, mathematics, computer sciences, etc. to assume that the data
they are looking at obeys the famous bell curve.

46. Given a dataset, show me how Euclidean Distance

works in three dimensions.

The formula for evaluating euclidean distance in three dimensions between two points
defined by coordinates (x1,y1,z1) and (x2,y2,z2) is simply given by

Distance = _/ (x1-x2)2 + (y1-y2)2 + (z1-z2)2

It simply represents the length of a line that connects the two points in a
three-dimensional space.

47. In experimental design, is it necessary to do

randomization? If yes, why?

Yes, it is necessary to use randomization while designing experiments. By


randomization, we try to eliminate the bias as much as possible. The main purpose of
randomization is it automatically controls for all lurking variables. Experiments with
randomization establish a clearer causal relationship between explanatory variables and
response variables by having control over explanatory variables.

Data Science Coding Interview Questions

48. What will be the output of the following R programming

code?

var2<- c("I","Love,"ProjectPro")

var2

It will give an error.

49. Find the First Unique Character in a String.

def frstuniquechar(strng: str) -> int:

# Lowercase

strng = strng.lower()

# Here is a dictionary that will contain each unique letter and its counts

c = {}

#Iterating over every letter in the string

for letter in strng:

# If can’t find the letter in dictionary, add it and set the count to 1

if letter not in c:

c[letter] = 1
# If can’t find the letter in dictionary, add 1 to the count

else:

c[letter] += 1

#Iterating the range of string length

for i in range(len(strng)):

# If there's only one letter

if c[strng[i]] == 1:

# Return the index position

return i

# No first unique character

return -1

# Test cases

for s in ['Hello', 'Hello ProjectPro!', 'Thank you for visiting.']:

print(f"Index: {frstuniquechar(strng=s)}")

50. Write the code to calculate the Factorial of a number

using Recursion.

def fact(num):

# Extreme cases
if num< 0: return -1

if num == 0: return 1

# Exit condition - num = 1

if num == 1:

return num

else:

# Recursion Used

return num * factorial(num - 1)

# Test cases

for num in [1, 3, 5, 6, 8, -10]:

print(f"{num}! = {fact(num=num)}")

Statistics Interview Questions for Data Science


1. Out of L1 and L2 regularizations, which one causes parameter sparsity and why?
2. List the differences between Bayesian Estimate and Maximum Likelihood
Estimation (MLE).
3. Differentiate between Cluster and Systematic Sampling?
4. How will you prevent overfitting when creating a statistical model?

Python Data Science Interview Questions


1. Explain the range function.
2. How can you freeze an already built machine learning model for later use? What
command you would use?
3. Differentiate between func and func().
4. Write the command to import a decision tree classification algorithm using
sklearn library.
5. What do you understand by pickling in Python?

Suggested Answers by Data Scientists for

Open-Ended Data Science Interview Questions

1. How can you ensure that you don’t analyze


something that ends up producing meaningless
results?
Understanding whether the model chosen is correct or not. Start understanding from the
point where you did Univariate or Bivariate analysis, analyzed the distribution of data
and correlation of variables, and built the linear model. Linear regression has an
inherent requirement that the data and the errors in the data should be normally
distributed. If they are not then we cannot use linear regression. This is an inductive
approach to find out if the analysis using linear regression will yield meaningless results
or not.

Another way is to train and test data sets by sampling them multiple times. Predict on all
those datasets to determine whether the resultant models are similar and are
performing well.

By looking at the p-value, by looking at r square values, by looking at the fit of the
function, and analyzing as to how the treatment of missing value could have affected-
data scientists can analyze if something will produce meaningless results.

- Gaganpreet Singh, Data Scientist

So, there you have over 120 data science interview questions and answers for most of
them too. These are some of the more common interview questions for data scientists
around data, statistics, and data science that can be asked in the interviews. We will
come up with more questions – specific to language, Python/ R, in the subsequent
articles, and fulfill our goal of providing 120 data science interview questions PDF with
answers to our readers.

3 Secrets to becoming a Great Enterprise Data Scientist

• Keep on adding technical skills to your data scientist’s toolbox.


• Improve your scientific axiom

• Learn the language of business as the insights from a data scientist help in
reshaping the entire organization.

The important tip, to nail a data science interview is to be confident with the answers
without bluffing. If you are well-versed with a particular technology whether it is Python,
R, Hadoop, Spark or any other big data technology ensure that you can back this up but
if you are not strong in a particular area do not mention it unless asked about it. The
above list of data scientist job interview questions is not an exhaustive one. Every
company has a different approach to interviewing data scientists. However, we do hope
that the above data science technical interview questions elucidate the data science
interview process and provide an understanding of the type of data scientist job
interview questions asked when companies are hiring data people.

We request industry experts and data scientists to chime in their suggestions in


comments for open-ended data science interview questions to help students understand
the best way to approach the interviewer and help them nail the interview.

You might also like