Data Science and Machine Learning - Interview Questions
Data Science and Machine Learning - Interview Questions
Questions
Data analytics tools include data mining, data Machine Learning, Hadoop,
modelling, database management and data Java, Python, software
analysis. development etc., are the
tools of Data Science.
Here's a list of the most popular data science interview questions on the technical
concept which you can expect to face, and how to frame your answers.
1. Randomly select 'k' features from a total of 'm' features where k << m
2. Among the 'k' features, calculate the node D using the best split point
3. Split the node into daughter nodes using the best split
4. Repeat steps two and three until leaf nodes are finalized
5. Build forest by repeating steps one to four for 'n' times to create 'n' number of
trees
1. Keep the model simple—take fewer variables into account, thereby removing
some of the noise in the training data
2. Use cross-validation techniques, such as k folds cross-validation
3. Use regularization techniques, such as LASSO, that penalize certain model
parameters if they're likely to cause overfitting
Univariate data contains only one variable. The purpose of the univariate analysis is to
describe the data and find patterns that exist within it.
167.3
170
174.2
178
180
The patterns can be studied by drawing conclusions using mean, median, mode,
dispersion or range, minimum, maximum, etc.
Bivariate
Bivariate data involves two different variables. The analysis of this type of data deals
with causes and relationships and the analysis is done to determine the relationship
between the two variables.
20 2,000
25 2,100
26 2,300
28 2,400
30 2,600
36 3,100
Here, the relationship is visible from the table that temperature and sales are directly
proportional to each other. The hotter the temperature, the better the sales.
Multivariate
2 0 900 $4000,00
3 2 1,100 $600,000
3.5 5 1,500 $900,000
4 3 2,100 $1,200,000
The patterns can be studied by drawing conclusions using mean, median, and mode,
dispersion or range, minimum, maximum, etc. You can start describing the data and
using it to guess what the price of the house will be.
Filter Methods
This involves:
The best analogy for selecting features is "bad data in, bad answer out." When we're
limiting or selecting the features, it's all about cleaning up the data coming in.
Wrapper Methods
This involves:
● Forward Selection: We test one feature at a time and keep adding them until
we get a good fit
● Backward Selection: We test all the features and start removing them to see
what works better
● Recursive Feature Elimination: Recursively looks through all the different
features and how they pair together
Wrapper methods are very labor-intensive, and high-end computers are needed if a lot of
data analysis is performed with the wrapper method.
If the data set is large, we can just simply remove the rows with missing data values. It
is the quickest way; we use the rest of the data to predict the values.
For smaller data sets, we can substitute missing values with the mean or average of the
rest of the data using the pandas' data frame in python. There are different ways to do
so, such as df.mean(), df.fillna(mean).
10. For the given points, how will you calculate the
Euclidean distance in Python?
plot1 = [1,3]
plot2 = [2,5]
Check out the Simplilearn's video on "Data Science Interview Question" curated by
industry experts to help you prepare for an interview.
This reduction helps in compressing data and reducing storage space. It also reduces
computation time as fewer dimensions lead to less computing. It removes redundant
features; for example, there's no point in storing a value in two different units (meters
and inches).
12. How will you calculate eigenvalues and eigenvectors
of the following 3x3 matrix?
-2 -4 2
-2 1 2
4 2 5
Expanding determinant:
- λ3 + 4λ2 + 27λ – 90 = 0,
λ3 - 4 λ2 -27 λ + 90 = 0
33 – 4 x 32 - 27 x 3 +90 = 0
Hence, (λ - 3) is a factor:
For X = 1,
-5 - 4Y + 2Z =0,
-2 - 2Y + 2Z =0
3 + 2Y = 0,
Y = -(3/2)
Z = -(1/2)
Monitor
Evaluate
Evaluation metrics of the current model are calculated to determine if a new algorithm
is needed.
Compare
The new models are compared to each other to determine which model performs the
best.
Rebuild
Collaborative Filtering
As an example, Last.fm recommends tracks that other users with similar interests play
often. This is also commonly seen on Amazon after making a purchase; customers may
notice the following message accompanied by product recommendations: "Users who
bought this also bought…"
Content-based Filtering
As an example: Pandora uses the properties of a song to recommend music with similar
properties. Here, we look at content, instead of looking at who else is listening to music.
Within the sum of squares (WSS), it is defined as the sum of the squared distance
between each member of the cluster and its centroid.
This indicates strong evidence against the null hypothesis; so you reject the null
hypothesis.
This indicates weak evidence against the null hypothesis, so you accept the null
hypothesis.
Example: height of an adult = abc ft. This cannot be true, as the height cannot be a
string value. In this case, outliers can be removed.
If the outliers have extreme values, they can be removed. For example, if all the data
points are clustered between zero to 10, but one point lies at 100, then we can remove
this point.
● Try a different model. Data detected as outliers by linear models can be fit by
nonlinear models. Therefore, be sure you are choosing the correct model.
● Try normalizing the data. This way, the extreme data points are pulled to a
similar range.
● You can use algorithms that are less affected by outliers; an example would
be random forests.
In the second graph, the waves get bigger, which means it is non-stationary and the
variance is changing with time.
You can see the values for total data, actual values, and predicted values.
= 609 / 650
= 0.93
= 262 / 277
= 0.94
= 262 / 288
= 0.90
22. 'People who bought this also bought…'
recommendations seen on Amazon are a result of which
algorithm?
The recommendation engine is accomplished with collaborative filtering. Collaborative
filtering explains the behavior of other users and their purchase history in terms of
ratings, selection, etc.
The engine makes predictions on what might interest a person based on the
preferences of other users. In this algorithm, item features are unknown.
For example, a sales page shows that a certain number of people buy a new phone and
also buy tempered glass at the same time. Next time, when a person buys a phone, he
or she may see a recommendation to buy tempered glass as well.
23. Write a basic SQL query that lists all orders with
customer information.
Usually, we have order tables and customer tables that contain the following columns:
● Order Table
● Orderid
● customerId
● OrderNumber
● TotalAmount
● Customer Table
● Id
● FirstName
● LastName
● City
● Country
● The SQL query is:
● SELECT OrderNumber, TotalAmount, FirstName, LastName, City, Country
● FROM Order
● JOIN Customer
● ON Order.CustomerId = Customer.Id
Hence, to evaluate model performance, we should use Sensitivity (True Positive Rate),
Specificity (True Negative Rate), F measure to determine the class wise performance of
the classifier.
25. Which of the following machine learning algorithms
can be used for inputting missing values of both
categorical and continuous variables?
● K-means clustering
● Linear regression
● K-NN (k-nearest neighbor)
● Decision trees
The K nearest neighbor algorithm can be used because it can compute the nearest
neighbor and if it doesn't have a value, it just computes the nearest neighbor based on
all the other features.
When you're dealing with K-means clustering or linear regression, you need to do that in
your pre-processing, otherwise, they'll crash. Decision trees also have the same
problem, although there is some variance.
26. Below are the eight actual values of the target variable
in the train file. What is the entropy of the target variable?
[0, 0, 0, 1, 1, 1, 1, 1]
1. Logistic Regression
2. Linear Regression
3. K-means clustering
4. Apriori algorithm
As we are looking for grouping people together specifically by four different similarities,
it indicates the value of k. Therefore, K-means clustering (answer A) is the most
appropriate algorithm for this study.
The True Positive Rate (TPR) is calculated by taking the ratio of the [True Positives (TP)]
and [True Positive (TP) & False Negatives (FN) ].
TPR=TP/TP+FN
● The False Positive Rate (FPR) defines the probability that an actual negative
result will be shown as a positive one i.e the probability that a model will
generate a false alarm.
The False Positive Rate (FPR) is calculated by taking the ratio of the [False Positives
(FP)] and [True Positives (TP) & False Positives(FP)].
The formula for the same is stated below -
FPR=FP/TN+FP
The False Positive Rate (FPR) is calculated by taking the ratio between False Positives
and the total number of negative samples, and the True Positive Rate (TPR) is
calculated by taking the ratio between True Positives and the total number of positive
samples.
In order to construct the ROC curve, the TPR and FPR values are plotted on multiple
threshold values. The area range under the ROC curve has a range between 0 and 1. A
completely random model, which is represented by a straight line, has a 0.5 ROC. The
amount of deviation a ROC has from this straight line denotes the efficiency of the
model.
The image above denotes a ROC curve example.
WIDE FORMAT DATA: In the Wide Format Data, the data’s repeated responses will be in
a single row, and each response can be recorded in separate columns.
NAME HEIGHT
RAMA 182
SITA 160
● Tensor Flow
● Pandas
● NumPy
● SciPy
● Scrapy
● Librosa
● MatPlotLib
The goal of cross-validation is to term a data set to test the model in the training phase
(i.e. validation data set) to limit problems like overfitting and gain insight into how the
model will generalize to an independent data set.
65. What are the types of biases that can occur during
sampling?
1. Selection bias
2. Undercoverage bias
3. Survivorship bias
This exhaustive list is sure to strengthen your preparation for data science interview
questions.
Some of the popular machine learning algorithms which are low on the bias scale are -
Support Vector Machines (SVM), K-Nearest Neighbors (KNN), and Decision Trees.
The following things are observed regarding some of the popular machine learning
algorithms -
● The Support Vector Machine algorithm (SVM) has high variance and low bias.
In order to change the trade-off, we can increase the parameter C. The C
parameter results in a decrease in the variance and an increase in bias by
influencing the margin violations allowed in training datasets.
● In contrast to the SVM, the K-Nearest Neighbors (KNN) Machine Learning
algorithm has a high variance and low bias. To change the trade-off of this
algorithm, we can increase the prediction influencing neighbors by increasing
the K value, thus increasing the model bias.
The below diagram explains a step-by-step model of the Markov Chains whose output
depends on their current state.
A perfect example of the Markov Chains is the system of word recommendation. In this
system, the model recognizes and recommends the next word based on the
immediately previous word and not anything before that. The Markov Chains take the
previous paragraphs that were similar to training data-sets and generates the
recommendations for the current paragraphs accordingly based on the previous word.
and histograms.
Boxplots are more often used in comparing several datasets and compared to
histograms, take less space and contain fewer details. Histograms are used to know
and understand the probability distribution underlying a dataset.
The diagram above denotes a boxplot of a dataset.
72. What does NLP stand for?
NLP is short for Natural Language Processing. It deals with the study of how computers
learn a massive amount of textual data through programming. A few popular examples
of NLP are Stemming, Sentimental Analysis, Tokenization, removal of stop words, etc.
Standardization Normalization
To crack a data science interview is no walk in the park. It requires in-depth knowledge
and expertise in various topics. Furthermore, the projects that you have worked on can
significantly boost your potential in a lot of interviews. In order to help you with your
interviews, we have compiled a set of questions for you to relate to. Since data science
is an extensive field, there are no limitations on the type of questions that can be
inquired. With that being said, you can answer each of these questions depending on
the projects you have worked on and the industries you have been in. Try to answer
each one of these sample questions and then share your answer with us through the
comments.
Pro Tip: No matter how basic a question may seem, always try to view it from a
technical perspective and use each question to demonstrate your unique technical skills
and abilities.
77. Which according to you is the most important skill that makes a good data
scientist?
80. How do you usually prefer working on a project - individually, small team, or large
team?
81. Based on your experience in the industry, tell me about your top 5 predictions for the
next 10 years.
82. What are some unique skills that you can bring to the team as a data scientist?
83. Were you always in the data science field? If not, what made you change your career
path and how did you upgrade your skills?
84. If we give you a random data set, how will you figure out whether it suits the
business needs or not?
85. Given a chance, if you could pick a career other than being a data scientist, what
would you choose?
86. Given the constant change in the data science field, how quickly can you adapt to
new technologies?
87. Have you ever been in a conflict with your colleagues regarding different strategies
to go about a project? How were you able to resolve it?
88. Can you break down an algorithm you have used on a recent project?
89. What tools did you use in your last project and why?
90. Think of the last technical problem that you solved. If you had no limitations with the
project’s budget, what would be the first thing you would do to solve the same problem?
91. When you are assigned multiple projects at the same time, how best do you
organize your time?
92. Tell me about a time when your project didn’t go according to plan and what you
learned from it.
93. Have you ever created an original algorithm? How did you go about doing that and
for what purpose?
94. What is your most favored strategy to clean a big data set and why?
Let's start with some commonly asked machine learning interview questions and
answers.
Supervised Learning
Unsupervised Learning
In unsupervised learning, we don't have labeled data. A model can identify patterns,
anomalies, and relationships in the input data.
Reinforcement Learning
Using reinforcement learning, the model can learn based on the rewards it received for
its previous action.
When a model is given the training data, it shows 100 percent accuracy—technically a
slight loss. But, when we use the test data, there may be an error and low efficiency.
This condition is known as overfitting.
● Regularization. It involves a cost term for the features involved with the
objective function
● Making a simple model. With lesser variables and parameters, the variance
can be reduced
● Cross-validation methods like k-folds can also be used
● If some model parameters are likely to cause overfitting, techniques for
regularization like LASSO can be used that penalize these parameters
● The training set is examples given to the ● The test set is used
model to analyze and learn to test the accuracy
● 70% of the total data is typically taken as of the hypothesis
the training dataset generated by the
● This is labeled data used to train the model
model ● Remaining 30% is
taken as testing
dataset
● We test without
labeled data and
then verify results
with labels
Consider a case where you have labeled data for 1,000 records. One way to train the
model is to expose all 1,000 records during the training process. Then you take a small
set of the same data to test the model, which would give good results in this case.
But, this is not an accurate way of testing. So, we set aside a portion of that data called
the ‘test set’ before starting the training process. The remaining data is called the
‘training set’ that we use for training the model. The training set passes through the
model multiple times until the accuracy is high, and errors are minimized.
Now, we pass the test data to check if the model can accurately predict the values and
determine if training is effective. If you get errors, you either need to change your model
or retrain it with more data.
Regarding the question of how to split the data into a training set and test set, there is
no fixed rule, and the ratio can vary based on individual preferences.
● IsNull() and dropna() will help to find the columns/rows with missing data and
drop them
● Fillna() will replace the wrong values with a placeholder value
5. How Can You Choose a Classifier Based on a Training
Set Data Size?
When the training set is small, a model that has a right bias and low variance seems to
work better because they are less likely to overfit.
For example, Naive Bayes works best when the training set is large. Models with low
bias and high variance tend to perform better as they work fine with complex
relationships.
Here,
Total No = 3+9 = 12
For a model to be accurate, the values across the diagonals should be high. The total
sum of all the values in the matrix equals the total observations in the test data set.
= (12+9) / 25
= 21 / 25
= 84%
False negatives are those cases that wrongly get classified as False but are True.
In the term ‘False Positive,’ the word ‘Positive’ refers to the ‘Yes’ row of the predicted
value in the confusion matrix. The complete term indicates that the system has
predicted it as a positive, but the actual value is negative.
So, looking at the confusion matrix, we get:
False-positive = 3
True positive = 12
Similarly, in the term ‘False Negative,’ the word ‘Negative’ refers to the ‘No’ row of the
predicted value in the confusion matrix. And the complete term indicates that the
system has predicted it as negative, but the actual value is positive.
False Negative = 1
True Negative = 9
● Model Building
Choose a suitable algorithm for the model and train it according to the
requirement
● Model Testing
Check the accuracy of the model through the test data
● Applying the Model
Make the required changes after testing and use the final model for real-time
projects
Here, it’s important to remember that once in a while, the model needs to be checked to
make sure it’s working correctly. It should be modified to make sure that it is up-to-date.
One of the primary differences between machine learning and deep learning is that
feature engineering is done manually in machine learning. In the case of deep learning,
the model consisting of neural networks will automatically determine which features to
use (and which not to use).
This is a commonly asked question asked in both Machine Learning Interviews as well
as Deep Learning Interview Questions
10. What Are the Differences Between Machine Learning
and Deep Learning?
Learn more: Difference Between AI,ML and Deep Learning
In the case of semi-supervised learning, the training data contains a small amount of
labeled data and a large amount of unlabeled data.
13. What Are Unsupervised Machine Learning Techniques?
There are two techniques used in unsupervised learning: clustering and association.
Clustering
Clustering problems involve data to be divided into subsets. These subsets, also called
clusters, contain data that are similar to each other. Different clusters reveal different
details about the objects, unlike classification or regression.
Association
In an association problem, we identify patterns of associations between different
variables or items.
For example, an e-commerce website can suggest other items for you to buy, based on
the prior purchases that you have made, spending habits, items in your wishlist, other
customers’ purchase habits, and so on.
K-means KNN
The algorithm assumes that the presence of one feature of a class is not related to the
presence of any other feature (absolute independence of features), given the class
variable.
For instance, a fruit may be considered to be a cherry if it is red in color and round in
shape, regardless of other features. This assumption may or may not be right (as an
apple also matches the description).
With reinforced learning, we don’t have to deal with this problem as the learning agent
learns by playing the game. It will make a move (decision), check if it’s the right move
(feedback), and keep the outcomes in memory for the next step it takes (learning).
There is a reward for every correct decision the system takes and punishment for the
wrong one.
● Predicting yes or no
● Estimating gender
● Breed of an animal
● Type of color
Underfitting: High bias can cause an algorithm to miss the relevant relations between
features and target outputs.
Variance
Variance refers to the amount the target model will change when trained with different
training data. For a good model, the variance should be minimized.
Overfitting: High variance can cause an algorithm to model the random noise in the
training data rather than the intended outputs.
Necessarily, if you make the model more complex and add more variables, you’ll lose
bias but gain variance. To get the optimally-reduced amount of error, you’ll have to trade
off bias and variance. Neither high bias nor high variance is desired.
High bias and low variance algorithms train models that are consistent, but inaccurate
on average.
High variance and low bias algorithms train models that are accurate but inconsistent.
27. Define Precision and Recall.
Precision
Precision is the ratio of several events you can correctly recall to the total number of
events you recall (mix of correct and wrong recalls).
Recall
A recall is the ratio of the number of events you can recall the number of total events.
● Starting at the leaves, each node is replaced with its most popular class
● If the prediction accuracy is not affected, the change is kept
● There is an advantage of simplicity and speed
The output of logistic regression is either a 0 or 1 with a threshold value of generally 0.5.
Any value above 0.5 is considered as 1, and any point below 0.5 is considered as 0.
In K nearest neighbors, K can be an integer greater than 1. So, for every new data point,
we want to classify, we compute to which neighboring group it is closest.
Let us classify an object using the following example. Consider there are three clusters:
● Football
● Basketball
● Tennis ball
Let the new data point to be classified is a black ball. We use KNN to classify it. Assume
K = 5 (initially).
When multiple classes are involved, we prefer the majority. Here the majority is with the
tennis ball, so the new data point is assigned to this cluster.
Now that you have gone through these machine learning interview questions, you must
have got an idea of your strengths and weaknesses in this domain.
F1 = 2 * (P * R) / (P + R)
The F1 score is one when both Precision and Recall scores are one.
Covariance: Covariance tells us the direction of the linear relationship between two
random variables. It can take any value between - ∞ and + ∞.
Example: A Random Forest with 100 trees can provide much better results than using
just one decision tree.
41. What is Cross-Validation?
Cross-Validation in Machine Learning is a statistical resampling technique that uses
different parts of the dataset to train and test a machine learning algorithm on different
iterations. The aim of cross-validation is to test the model’s ability to predict a new set
of data that was not used to train the model. Cross-validation avoids the overfitting of
data.
K-Fold Cross Validation is the most popular resampling technique that divides the whole
dataset into K sets of equal sizes.
Gini Impurity: Splitting the nodes of a decision tree using Gini Impurity is followed when
the target variable is categorical.
● Multivariate normality
● No auto-correlation
● Homoscedasticity
● Linear relationship
● No or little multicollinearity
Looking forward to a successful career in AI and Machine learning. Enrol in our Artificial
Intelligence Course in collaboration with Caltech University now.
Become Part of the Machine Learning Talent Pool
With technology ramping up, jobs in the field of data science and AI will continue to be
in demand. Candidates who upgrade their skills and become well-versed in these
emerging technologies can find many job opportunities with impressive salaries.
Looking forward to becoming a Machine Learning Engineer? Enroll in Simplilearn's AI
and ML Course and get certified today. Based on your experience level, you may be
asked to demonstrate your skills in machine learning, additionally, but this depends
mostly on the role you’re pursuing. These machine learning interview questions and
answers will prepare you to clear your interview on the first attempt!
Apart from the above mentioned interview questions, it is also important to have a fair
understanding of frequently asked Data Science interview questions.
Considering this trend, Simplilearn offers AI and Machine Learning certification course
to help you gain a firm hold of machine learning concepts. This course is well-suited for
those at the intermediate level, including:
● Analytics managers
● Business analysts
● Information architects
● Developers looking to become data scientists
● Graduates seeking a career in data science and machine learning
Facing the machine learning interview questions would become much easier after you
complete this course.
Top Data Science Interview Questions And
Answers
Data Science is among the leading and most popular technologies in the world today.
Major organizations are hiring professionals in this field. With the high demand and low
professionals. This Data Science Interview preparation blog includes the most
frequently asked questions in Data Science job interviews. Here is a list of these
Q9. What is the difference between the long format data and wide format data?
Q10. Mention some techniques used for sampling. What is the main advantage of
sampling?
Data Science is a field of computer science that explicitly deals with turning data into
information and extracting meaningful insights out of it. The reason why Data Science is
so popular is that the kind of insights it allows us to draw from the available data has led
to some major innovations in several products and companies. Using these insights, we
are able to determine the taste of a particular customer, the likelihood of a product
Check out Our Data Science Course in Kolkata and become a certified Data Scientist!
The goal of data analytics is to illustrate The goal of data science is to discover
the precise details of retrieved insights. meaningful insights from massive
datasets and derive the best possible
solutions to resolve business issues.
Become an expert in Data Scientist. Enroll now in PG program in Data Science and
Linear regression helps in understanding the linear relationship between the dependent
which helps in finding the linear relationship between two variables. One is the predictor
or the independent variable and the other is the response or the dependent variable. In
Linear Regression, we try to understand how the dependent variable changes w.r.t the
independent variable. If there is only one independent variable, then it is called simple
linear regression, and if there is more than one independent variable then it is known as
Logistic regression is a classification algorithm that can be used when the dependent
variable is binary. Let’s take an example. Here, we are trying to determine whether it will
dependent variable. So, the logistic regression algorithm actually produces an S shape
curve.
Now, let us look at another scenario: Let’s suppose that x-axis represents the runs
scored by Virat Kohli and the y-axis represents the probability of the team India winning
the match. From this graph, we can say that if Virat Kohli scores more than 50 runs,
then there is a greater probability for team India to win the match. Similarly, if he scores
less than 50 runs then the probability of team India winning the match is less than 50
percent.
So, basically in logistic regression, the Y value lies within the range of 0 and 1. This is
The confusion matrix is a table that is used to estimate the performance of a model. It
tabulates the actual values and the predicted values in a 2×2 matrix.
True Positive (d): This denotes all of those records where the actual values are true and
the predicted values are also true. So, these denote all of the true positives. False
Negative (c): This denotes all of those records where the actual values are true, but the
predicted values are false. False Positive (b): In this, the actual values are false, but the
predicted values are true. True Negative (a): Here, the actual values are false and the
predicted values are also false. So, if you want to get the correct values, then correct
values would basically represent all of the true positives and the true negatives. This is
false-positive rate?
True positive rate: In Machine Learning, true-positive rates, which are also referred to as
sensitivity or recall, are used to measure the percentage of actual positives which are
positive rate: False positive rate is basically the probability of falsely rejecting the null
hypothesis for a particular test. The false-positive rate is calculated as the ratio between
the number of negative events wrongly categorized as positive (false positive) upon the
False-Positives/Negatives.
programming?
Data Science takes a fundamentally different approach in building systems that provide
In traditional programming paradigms, we used to analyze the input, figure out the
expected output, and write code, which contains rules and statements needed to
transform the provided input into the expected output. As we can imagine, these rules
were not easy to write, especially, for data that even computers had a hard time
Data Science shifts this process a little bit. In it, we need access to large volumes of
data that contain the necessary inputs and their mappings to the expected outputs.
Then, we use Data Science algorithms, which use mathematical analysis to generate
This process of rule generation is called training. After training, we use some data that
was set aside before the training phase to test and check the system’s accuracy. The
generated rules are a kind of a black box, and we cannot understand how the inputs are
model).
As described above, in traditional programming, we had to write the rules to map the
input to the output, but in Data Science, the rules are automatically generated or
learned from the given data. This helped solve some really difficult challenges that were
Interested to learn Data Science skills? Check our Data Science course in Kottayam
Now!
unsupervised learning.
Supervised and unsupervised learning are two types of Machine Learning techniques.
They both allow us to build models. However, they are used for solving different kinds of
problems.
Works on the data that contains both Works on the data that contains no
inputs and the expected output, i.e., the mappings from input to output, i.e., the
labeled data unlabeled data
Used to create models that can be Used to extract meaningful information out
employed to predict or classify things of large volumes of data
Commonly used supervised learning Commonly used unsupervised learning
algorithms: Linear regression, decision algorithms: K-means clustering, Apriori
tree, etc. algorithm, etc.
A long format data has a column for Whereas, Wide data has a column for
possible variable types and a column for each variable.
the values of those variables.
Each row in the long format represents The repeated responses of a subject will
one time point per subject. As a result, be in a single row, with each response in
each topic will contain many rows of data. its own column, in the wide format.
This data format is most typically used in This data format is most widely used in
R analysis and for writing to log files at data manipulations, stats programmes for
the end of each experiment. repeated measures ANOVAs and is
seldom used in R analysis.
A long format contains values that do A wide format contains values that do not
repeat in the first column. repeat in the first column.
Use df.melt() to convert wide form to long use df.pivot().reset_index() to convert
form long form into wide form
Sampling is defined as the process of selecting a sample from a group of people or from
any particular kind for research purposes. It is one of the most important factors which
Probability sampling: It involves random selection which makes every element get a
below:
● Stratified sampling
● Systematic sampling
● Cluster Sampling
● Multi-stage Sampling
selection which means the selection is done based on your ease or any
other required criteria. This helps to collect the data easily. The following are
○ Convenience Sampling
○ Purposive Sampling
○ Quota Sampling
Bias is a type of error that occurs in a Data Science model because of using an
algorithm that is not strong enough to capture the underlying patterns or trends that
exist in the data. In other words, this error occurs when the data is too complicated for
assumptions. This leads to lower accuracy because of underfitting. Algorithms that can
dropping some fields or columns from the dataset. However, this is not done
haphazardly. In this process, the dimensions or fields are dropped only after making
sure that the remaining information will still be enough to succinctly describe similar
information.
Data Scientists have to clean and transform the huge data sets in a form that they can
work with. It is important to deal with the redundant data for better results by removing
used for Data cleaning and analysis. These libraries are used to load and clean the data
and do effective analysis. For example, a CSV file named “Student” has information
about the students of an institute like their names, standard, address, phone number,
R provides the best ecosystem for data analysis and visualization with more than
means you can easily find the solution to your problems on various platforms like
StackOverflow.
It has better data management and supports distributed computing by splitting the
operations between multiple tasks and nodes, which eventually decreases the
Below are the popular libraries used for data extraction, cleaning, visualization, and
deploying DS models:
charts.
● Pandas: Used to implement the ETL(Extracting, Transforming, and Loading
● PyTorch: Best for projects which involve Machine Learning algorithms and
Interested to learn more about Data Science, check out our Data Science Course in
New York!
Variance is a type of error that occurs in a Data Science model when the model ends up
being too complex and learns features from data, along with the noise that exists in it.
This kind of error can occur if the algorithm used to train the model has high complexity,
even though the data and the underlying patterns and trends are quite easy to discover.
This makes the model a very sensitive one that performs well on the training dataset but
poorly on the testing dataset, and on any kind of data that the model has not yet seen.
Pruning a decision tree is the process of removing the sections of the tree that are not
necessary or are redundant. Pruning leads to a smaller decision tree, which performs
entropy of a given dataset tells us how pure or impure the values of the dataset are. In
For example, suppose we are given a box with 10 blue marbles. Then, the entropy of
the box is 0 as it contains marbles of the same color, i.e., there is no impurity. If we need
to draw a marble from the box, the probability of it being blue will be 1.0. However, if we
replace 4 of the blue marbles with 4 red marbles in the box, then the entropy increases
algorithm?
When building a decision tree, at each step, we have to create a node that decides
which feature we should use to split data, i.e., which feature would best separate our
data so that we can make predictions. This decision is made using information gain,
which is a measure of how much entropy is reduced when a particular feature is used to
split the data. The feature that gives the highest information gain is the one that is
Explore this Data Science Course in Delhi and master decision tree algorithm.
In k-fold cross-validation, we divide the dataset into k equal parts. After this, we loop
over the entire dataset k times. In each iteration of the loop, one of the k parts is used
for testing, and the other k − 1 parts are used for training. Using k-fold cross-validation,
each one of the k parts of the dataset ends up being used for training and testing
purposes.
content. These systems generate recommendations based on what they know about
For example, imagine that we have a movie streaming platform, similar to Netflix or
Amazon Prime. If a user has previously watched and liked movies from action and
horror genres, then it means that the user likes watching the movies of these genres. In
that case, it would be better to recommend such movies to this particular user. These
recommendations can also be generated based on what users with a similar taste like
watching.
Data distribution is a visualization tool to analyze how data is spread out or distributed.
Data can be distributed in various ways. For instance, it could be with a bias to the left
Data may also be distributed around a central value, i.e., mean, median, etc. This kind
of distribution has no bias either to the left or to the right and is in the form of a
bell-shaped curve. This distribution also has its mean equal to the median. This kind of
Deep Learning is a kind of Machine Learning, in which neural networks are used to
imitate the structure of the human brain, and just like how a brain learns from
information, machines are also made to learn from the information that is provided to
them.
Deep Learning is an advanced version of neural networks to make the machines learn
from data. In Deep Learning, the neural networks comprise many hidden layers (which
is why it is called ‘deep’ learning) that are connected to each other, and the output of the
A recurrent neural network, or RNN for short, is a kind of Machine Learning algorithm
that makes use of the artificial neural network. RNNs are used to find patterns from a
sequence of data, such as time series, stock market, temperature, etc. RNNs are a kind
of feedforward network, in which information from one layer passes to another layer,
and each node in the network performs mathematical operations on the data. These
operations are temporal, i.e., RNNs store contextual information about previous
operations on some data every time it is passed. However, the output may be different
positive rate and a false positive rate, and it helps us to find out the right tradeoff
between the true positive rate and the false positive rate for different probability
thresholds of the predicted values. So, the closer the curve to the upper left corner, the
better the model is. In other words, whichever curve has greater area under it that would
be the better model. You can see this in the below graph:
A decision tree is a supervised learning algorithm that is used for both classification and
regression. Hence, in this case, the dependent variable can be both a numerical value
of that attribute, and each leaf node holds the class label. So, in this case, we have a
series of test conditions which give the final decision according to the condition.
Are you interested in learning Data Science from experts? Enroll in our Data Science
It combines multiple models together to get the final output or, to be more precise, it
combines multiple decision trees together to get the final output. So, decision trees are
interview?
P(B)=5/12
Now, the probability of at least one of them getting selected can be denoted at the
Where P(A ∩ B) stands for the probability of both Aman and Mohan getting selected for
the job.
To calculate the final answer, we first have to find out the value of P(A ∩ B)
1/8 * 5/12
5/96
Data modeling creates a conceptual model based on the relationship between various
data models. The process involves moving from the conceptual stage to the logical
model to the physical schema. It involves the systematic method of applying data
modeling techniques.
Database Design: This is the process of designing the database. The database design
creates an output which is a detailed data model of the database. Strictly speaking,
database design includes the detailed logical model of a database but it can also
Precision: When we are implementing algorithms for the classification of data or the
retrieval of information, precision helps us get a portion of positive class values that are
Recall: It is the set of all positive predictions out of the total number of positive
instances. Recall helps us identify the misclassified positive predictions. We use the
F1 score helps us calculate the harmonic mean of precision and recall that gives us the
test’s accuracy. If F1 = 1, then precision and recall are accurate. If F1 < 1 or equal to 0,
then precision or recall is less accurate, or they are completely inaccurate. See below
probability that shows the significance of output to the data. We compute the p-value to
know the test statistics of a model. Typically, it helps us choose whether we can accept
We use the p-value to understand whether the given data really describes the observed
effect or not. We use the below formula to calculate the p-value for the effect ‘E’ and the
error?
An error occurs in values while the prediction gives us the difference between the
observed values and the true values of a dataset. Whereas, the residual error is the
difference between the observed values and the predicted values. The reason we use
the residual error to evaluate the performance of an algorithm is that the true values are
never known. Hence, we use the observed values to measure the error using residuals.
provides summary statistics for individual objects when fed into the function. We use a
summary function when we want information about the values present in the dataset. It
Also, it provides the median, mean, 1st quartile, and 3rd quartile values that help us
each other?
Data Science and Machine Learning are two terms that are closely related but are often
misunderstood. Both of them deal with data. However, there are some fundamental
distinctions that show us how they are different from each other.
Data Science is a broad field that deals with large volumes of data and allows us to
draw insights out of this voluminous data. The entire process of Data Science takes
care of multiple steps that are involved in drawing insights out of the available data. This
process includes crucial steps such as data gathering, data analysis, data manipulation,
Machine Learning, on the other hand, can be thought of as a sub-field of Data Science.
It also deals with data, but here, we are solely focused on learning how to convert the
processed data into a functional model, which can be used to map inputs to outputs,
e.g., a model that can expect an image as an input and tell us if that image contains a
flower as an output.
In short, Data Science deals with gathering data, processing it, and finally, drawing
insights from it. The field of Data Science that deals with building models using
of Data Science.
39. Explain univariate, bivariate, and multivariate analyses.
When we are dealing with data analysis, we often come across terms such as
univariate, bivariate, and multivariate. Let’s try and understand what these mean.
one variable or, in other words, a single column or a vector of the data. This
analysis allows us to understand the data and extract patterns and trends
● Bivariate analysis: Bivariate analysis involves analyzing the data with exactly
two variables or, in other words, the data can be put into a two-column table.
This kind of analysis allows us to figure out the relationship between the
altitude.
more than two variables. The number of columns of the data can be anything
more than two. This kind of analysis allows us to figure out the effects of all
Example: Analyzing data about house prices, which contains information about the
houses, such as locality, crime rate, area, the number of floors, etc.
To be able to handle missing data, we first need to know the percentage of data missing
situation.
For example, if in a column the majority of the data is missing, then dropping the column
is the best option, unless we have some means to make educated guesses about the
missing values. However, if the amount of missing data is low, then we have several
One way would be to fill them all up with a default value or a value that has the highest
frequency in that column, such as 0 or 1, etc. This may be useful if the majority of the
Another way is to fill up the missing values in the column with the mean of all the values
in that column. This technique is usually preferred as the missing values have a higher
Finally, if we have a huge dataset and a few rows have values missing in some
columns, then the easiest and fastest way is to drop those columns. Since the dataset is
Dimensionality reduction reduces the dimensions and size of the entire dataset. It drops
unnecessary features while retaining the overall information in the data intact.
The reason why data with high dimensions is considered so difficult to deal with is that it
leads to high time consumption while processing the data and training a model on it.
Reducing dimensions speeds up this process, removes noise, and also leads to better
model accuracy.
one that has low bias and variance. We know that bias and variance are both errors that
Therefore, when we are building a model, the goal of getting high accuracy is only going
Bias is an error that occurs when a model is too simple to capture the patterns in a
dataset. To reduce bias, we need to make our model more complex. Although making
the model more complex can lead to reducing bias, and if we make the model too
complex, it may end up becoming too rigid, leading to high variance. So, the tradeoff
between bias and variance is that if we increase the complexity, the bias reduces and
the variance increases, and if we reduce complexity, the bias increases and the
variance reduces. Our goal is to find a point at which our model is complex enough to
give low bias but not so complex to end up having high variance.
RMSE stands for the root mean square error. It is a measure of accuracy in regression.
First, we calculate the errors in the predictions made by the regression model. For this,
we calculate the differences between the actual and the predicted values. Then, we
After this step, we calculate the mean of the squared errors, and finally, we take the
square root of the mean of these squared errors. This number is the RMSE, and a
model with a lower value of RMSE is considered to produce lower errors, i.e., the model
terms, a kernel function takes data as input and converts it into a required form. This
transformation of the data is based on something called a kernel trick, which is what
gives the kernel function its name. Using the kernel function, we can transform the data
that is not linearly separable (cannot be separated using a straight line) into one that is
linearly separable.
k-means?
make use of the elbow method to pick the appropriate k value. To do this, we run the
k-means algorithm on a range of values, e.g., 1 to 15. For each value of k, we compute
an average score. This score is also called inertia or the inter-cluster variance.
This is calculated as the sum of squares of the distances of all values in a cluster. As k
starts from a low value and goes up to a high value, we start seeing a sharp decrease in
the inertia value. After a certain value of k, in the range, the drop in the inertia value
becomes quite small. This is the value of k that we need to choose for the k-means
clustering algorithm.
Outliers can be dealt with in several ways. One way is to drop them. We can only drop
the outliers if they have values that are incorrect or extreme. For example, if a dataset
with the weights of babies has a value 98.6-degree Fahrenheit, then it is incorrect. Now,
if the value is 187 kg, then it is an extreme value, which is not useful for our model.
In case the outliers are not that extreme, then we can try:
● A different kind of model. For example, if we were using a linear model, then
● Normalizing the data, which will shift the extreme values closer to other data
points
● Using algorithms that are not so affected by outliers, such as random forest,
etc.
In a binary classification algorithm, we have only two labels, which are True and False.
Before we can calculate the accuracy, we need to understand a few key terms:
To calculate the accuracy, we need to divide the sum of the correctly classified
get a model that can understand the underlying trends in the training data and can
However, sometimes some datasets are very complex, and it is difficult for one model to
be able to grasp the underlying trends in these datasets. In such situations, we combine
ensemble learning.
technique, to generate recommendations, we make use of data about the likes and
dislikes of users similar to other users. This similarity is estimated based on several
If User A, similar to User B, watched and liked a movie, then that movie will be
recommended to User B, and similarly, if User B watched and liked a movie, then that
In other words, the content of the movie does not matter much. When recommending it
to a user what matters is if other users similar to that particular user liked the content of
systems.
Content-based filtering is one of the techniques used to build recommender systems. In
this technique, recommendations are generated by making use of the properties of the
For example, if a user is watching movies belonging to the action and mystery genre
and giving them good ratings, it is a clear indication that the user likes movies of this
In other words, here, the content of the movie is taken into consideration when
technique, we generate some data using the bootstrap method, in which we use an
already existing dataset and generate multiple samples of the N size. This bootstrapped
data is then used to train multiple models in parallel, which makes the bagging model
Once all the models are trained, when it’s time to make a prediction, we make
predictions using all the trained models and then average the result in the case of
regression, and for classification, we choose the result, generated by models, that have
used to parallelly train our models. In boosting, we create multiple models and
sequentially train them by combining weak models iteratively in a way that training a
In doing so, we take the patterns learned by a previous model and test them on a
dataset when training the new model. In each iteration, we give more importance to
Just like bagging and boosting, stacking is also an ensemble learning method. In
bagging and boosting, we could only combine weak models that used the same learning
algorithms, e.g., logistic regression. These models are called homogeneous learners.
However, in stacking, we can combine weak models that use different learning
algorithms as well. These learners are called heterogeneous learners. Stacking works
by training multiple (and different) weak models or learners and then using them
Learning.
A field of computer science, Machine Learning is a subfield of Data Science that deals
with using existing data to help systems automatically learn new skills to perform
Machine Learning models using algorithms that try to imitate the process of how the
human brain learns from the information in a system for it to attain new capabilities. In
Deep Learning, we make heavy use of deeply connected neural networks with many
layers.
Naive Bayes is a Data Science algorithm. It has the word ‘Bayes’ in it because it is
based on the Bayes theorem, which deals with the probability of an event occurring
It has ‘naive’ in it because it makes the assumption that each variable in the dataset is
independent of the other. This kind of assumption is unrealistic for real-world data.
However, even with this assumption, it is very useful for solving a range of complicated
those rows where the ‘price’ value is greater than 1000 and
library(ggplot2)
determined by ‘cut.’
The ggplot is based on the grammar of data visualization, and it helps us stack multiple
aesthetic layer. Finally, on top of the aesthetic layer we will stack the geometry layer.
Code:
library(missForest)
Iris.mis<-prodNA(iris,noNA=0.25)
For imputing the ‘Sepal.Length’ column with ‘mean’ and the ‘Petal.Length’ column with
‘median,’ we will be using the Hmisc package and the impute function:
library(Hmisc)
iris.mis$Sepal.Length<-with(iris.mis, impute(Sepal.Length,mean))
iris.mis$Petal.Length<-with(iris.mis, impute(Petal.Length,median))
Here, we need to find how ‘mpg’ varies w.r.t displacement of the column.
We need to divide this data into the training dataset and the testing dataset so that the
So, what happens is when we do not divide the dataset into these two components, it
overfits the dataset. Hence, when we add new data, it fails miserably on that new data.
Therefore, to divide this dataset, we would require the caret package. This caret
package comprises the createdatapartition() function. This function will give the true or
false labels.
Here, we will use the following code:
library(caret)
mtcars[split_tag,]->train
mtcars[-split_tag,]->test
lm(mpg-data,data=train)->mod_mtcars
predict(mod_mtcars,newdata=test)->pred_mtcars
>head(pred_mtcars)
Explanation:
Parameters of the createDataPartition function: First is the column which determines the
Second is the split ratio which is 0.65, i.e., 65 percent of records will have true labels
and 35 percent will have false labels. We will store this in a split_tag object.
Once we have split_tag object ready, from this entire mtcars dataframe, we will select all
those records where the split tag value is true and store those records in the training
set.
Similarly, from the mtcars dataframe, we will select all those record where the split_tag
So, the split tag will have true values in it, and when we put ‘-’ symbol in front of it,
‘-split_tag’ will contain all of the false labels. We will select all those records and store
lm(mpg-data,data=train)->mod_mtcars
Now, we have built the model on top of the train set. It’s time to predict the values on top
of the test set. For that, we will use the predict function that takes in two parameters:
first is the model which we have built and second is the dataframe on which we have to
predict values.
Thus, we have to predict values for the test set and then store them in pred_mtcars.
predict(mod_mtcars,newdata=test)->pred_mtcars
Output:
These are the predicted values of mpg for all of these cars.
So, this is how we can build a simple linear model on top of this mtcars dataset.
x values, but there is always an error associated with this prediction. So, to get an
as.data.frame(final_data)->final_data
error<-(final_data$Actual-final_data$Prediction)
cbind(final_data,error)->final_data
sqrt(mean(final_data$error)^2)
Explanation: We have the actual and the predicted values. We will bind both of them
into a single dataframe. For that, we will use the cbind function:
cbind(Actual=test$mpg, predicted=pred_mtcars)->final_data
Our actual values are present in the mpg column from the test set, and our predicted
values are stored in the pred_mtcars object which we have created in the previous
question. Hence, we will create this new column and name the column actual. Similarly,
we will create another column and name it predicted which will have predicted values
and then store the predicted values in the new object which is final_data. After that, we
will convert a matrix into a dataframe. So, we will use the as.data.frame function and
as.data.frame(final_data)->final_data
We will pass this object which is final_data and store the result in final_data again. We
will then calculate the error in prediction for each of the records by subtracting the
error<-(final_data$Actual-final_data$Prediction)
Then, store this result on a new object and name that object as error. After this, we will
Here, we bind the error object to this final_data, and store this into final_data again.
Calculating RMSE:
Sqrt(mean(final_data$error)^2)
Output:
[1] 4.334423
Note: Lower the value of RMSE, the better the model. R and Python are two of the most
data.shape
Let us take out the dependent and the independent variables from the dataset:
data1=data.loc[:,[‘lstat’,’medv’]]
data1.head()
Visualizing Variables
import matplotlib.pyplot as plt
data1.plot(x=’lstat’,y=’medv’,style=’o’)
plt.xlabel(‘lstat’)
plt.ylabel(‘medv’)
plt.show()
Here, ‘medv’ is basically the median values of the price of the houses, and we are trying
to find out the median values of the price of the houses w.r.t to the lstat column.
We will separate the dependent and the independent variable from this entire
dataframe:
data1=data.loc[:,[‘lstat’,’medv’]]
The only columns we want from all of this record are ‘lstat’ and ‘medv,’ and we need to
data1.plot(x=’lstat’,y=’medv’,style=’o’)
plt.xlabel(‘lstat’)
plt.ylabel(‘medv’)
plt.show()
Y=pd.Dataframe(data1[‘medv’])
regressor=LinearRegression()
regressor.fit(X_train,y_train)
print(regressor.intercept_)
Output :
34.12654201
print(regressor.coef_)//this is the slope
Output :
[[-0.913293]]
By now, we have built the model. Now, we have to predict the values on top of the test
set:
y_pred=regressor.predict(X_test)//using the instance and the predict
function and pass the X_test object inside the function and store this
in y_pred object
Now, let’s have a glance at the rows and columns of the actual values and the predicted
values:
Y_pred.shape, y_test.shape
Output :
((102,1),(102,1))
Further, we will go ahead and calculate some metrics so that we can find out the Mean
Output:
Mean Absolute Error: 4.692198
str(heart)
In the structure of this dataframe, most of the values are integers. However, since we
are building a logistic regression model on top of this dataset, the final target column is
Thus, we will use the as.factor function and convert these integer values into categorical
data.
We will pass on heart$target column over here and store the result in heart$target as
follows:
as.factor(heart$target)->heart$target
Now, we will build a logistic regression model and see the different probability values for
the person to have heart disease on the basis of different age values.
independent variable, and we are building this model on top of the dataframe.
family=”binomial” means we are basically telling R that this is the logistic regression
We will have a glance at the summary of the model that we have just built:
summary(log_mod1)
We can see Pr value here, and there are three stars associated with this Pr value. This
basically means that we can reject the null hypothesis which states that there is no
relationship between the age and the target columns. But since we have three stars
over here, this null hypothesis can be rejected. There is a strong relationship between
Now, we have other parameters like null deviance and residual deviance. Lower the
any independent variable and we are trying to predict the value of the target column
with only the intercept. When that’s the case, the null deviance is 417.64.
Residual deviance is wherein we include the independent variables and try to predict
the target columns. Hence, when we include the independent variable which is age, we
see that the residual deviance drops. Initially, when there are no independent variables,
the null deviance was 417. After we include the age column, we see that the null
This basically means that there is a strong relationship between the age column and the
Now, we will divide this dataset into train and test sets and build a model on top of the
train set and predict the values on top of the test set:
>library(caret)
heart[split_tag,]->train
heart[-split_tag,]->test
glm(target~age, data=train,family=”binomial”)->log_mod2
predict(log_mod2, newdata=test, type=”response”)->pred_heart
range(pred_heart)
plot(roc_curve, colorize=T)
Graph:
Go through this Data Science Course in London to get a clear understanding of Data
Science!
64. Build a confusion matrix for the model where the
table(test$target,pred_heart>0.6)
Here, we are setting the probability threshold as 0.6. So, wherever the probability of
pred_heart is greater than 0.6, it will be classified as 0, and wherever it is less than 0.6 it
will be classified as 1.
First, we will load the pandas dataframe and the customer_churn.csv file:
customer_churn=pd.read_csv(“customer_churn.csv”)
After loading this dataset, we can have a glance at the head of the dataset by using the
following command:
customer_churn.head()
Now, we will separate the dependent and the independent variables into two separate
objects:
x=pd.Dataframe(customer_churn[‘MonthlyCharges’])
y=customer_churn[‘ Churn’]
Now, we will see how to build the model and calculate log_loss.
from sklearn.linear_model, we have to import LogisticRegression
l=LogisticRegression()
l.fit(x_train,y_train)
y_pred=l.predict_proba(x_test)
Output:
0.5555020595194167
Become a master of Data Science by going through this online Data Science Course in
Toronto!
built.
To build a decision tree model, we will be loading the party package:
#party package
library(party)
library(caret)
iris[split_tag,]->train
iris[~split_tag,]->test
#building model
mytree<-ctree(Species~.,train)
plot(mytree)
Model:
predict(mytree,test,type=’response’)->mypred
After this, we will predict the confusion matrix and then calculate the accuracy using the
table function:
table(test$Species, mypred)
67. Build a random forest model on top of this ‘CTG’
str(data)
table(data$NSP)
#data partition
set.seed(123)
data[~split_tag,]->test
#random forest -1
library(randomForest)
set.seed(222)
rf<-randomForest(NSP~.,data=train)
rf
#prediction
predict(rf,test)->p1
table(test$NSP,p1)
If you have any doubts or queries related to Data Science, get them clarified from Data
The formula for calculating the Euclidean distance between two points (x1, y1) and (x2,
y2) is as follows:
1. Calculate the errors, i.e., the differences between the actual and the
predicted values
Learning.
in SVM.
● Linear kernel
● Polynomial kernel
● Sigmoid kernel
Time series data is considered stationary when variance or mean is constant with time.
If the variance or mean does not change over a period of time in the dataset, then we
can draw the conclusion that, for that period, the data is stationary.
We can use the code given below to calculate the accuracy of a binary classification
algorithm:
def accuracy_score(matrix):
true_positives = matrix[0][0]
true_negatives = matrix[1][1]
total_observations = sum(matrix[0]) + sum(matrix[1])
return (true_positives + true_negatives) / total_observations
Root cause analysis is the process of figuring out the root causes that lead to certain
faults or failures. A factor is considered to be a root cause if, after eliminating it, a
correctly. Root cause analysis is a technique that was initially developed and used in the
A/B testing is a kind of statistical hypothesis testing for randomized experiments with
two variables. These variables are represented as A and B. A/B testing is used when
we wish to test a new feature in a product. In the A/B test, we give users two variants of
The A variant can be the product with the new feature added, and the B variant can be
the product without the new feature. After users use these two products, we capture
If the rating of product variant A is statistically and significantly higher, then the new
feature is considered an improvement and useful and is accepted. Otherwise, the new
Check out this Python Course to get deeper into Python programming.
75. Out of collaborative filtering and content-based
generating recommendations. It does not mean that collaborative filtering generates bad
recommendations.
However, as collaborative filtering is based on the likes and dislikes of other users we
cannot rely on it much. Also, users’ likes and dislikes may change in the future.
For example, there may be a movie that a user likes right now but did not like 10 years
ago. Moreover, users who are similar in some features may not have the same taste in
In the case of content-based filtering, we make use of users’ own likes and dislikes that
are much more reliable and yield more positive results. This is why platforms such as
Netflix, Amazon Prime, Spotify, etc. make use of content-based filtering for generating
and recall.
Predicted P N
P 156 11
N 16 327
software agents that perform actions to attain the most number of cumulative rewards.
A reward here is used for letting the model know (during training) if a particular action
leads to the attainment of or brings it closer to the goal. For example, if we are creating
an ML model that plays a video game, the reward is going to be either the points
Reinforcement learning is used to build these kinds of agents that can make real-world
decisions that should move the model toward the attainment of a clearly defined goal.
in a collection of documents called a corpus. TF/IDF is used often in text mining and
information retrieval.
regression?
There are several assumptions required for linear regression. They are as follows:
● The data, which is a sample drawn from a population, used to train the
variables is linear.
● The variance of the residual is going to be the same for any value of an
normally distributed.
These assumptions may be violated lightly (i.e., some minor violations) or strongly (i.e.,
the majority of the data has violations). Both of these violations will have different effects
Strong violations of these assumptions make the results entirely redundant. Light
violations of these assumptions make the results have greater bias or variance.
3. Define fsck.
It is an abbreviation for “file system check.” This command can be used for searching
for possible errors in the file.
● Pytorch
● Microsoft Cognitive Toolkit
● TensorFlow
● Caffe
● Chainer
● Keras
6. What is cross-validation?
Cross-validation is a statistical technique that one can use to improve a model’s
performance. This is helpful when the model is dealing with unknown data.
Following are examples of probability and statistics problems that are frequently asked
at FAANG+ companies:
1. Switch
2. Won’t switch
3. Can’t conclude
Solution:
Door 1
Door 2
Door 3
If you switch the door
If the door is not switched
Goat
Car
Goat
Win
Lose
Goat
Goat
Car
Win
Lose
Car
Goat
Goat
Lose
Win
If you switch the door, you are more likely to win (i.e., with a 2/3 probability)
2. The “fair coin” problem
A coin was flipped 1000 times, and there were 560 heads. For this scenario, develop
the hypothesis to test whether the coin is fair or not.
Solution:
Let’s assume that the probability of a head in the coin toss is p. We need to test if p is
0.5 or not.
Using the Central Limit Theorem, we can approximate the total number of heads as
normally distributed (since 1000 is a large sample size).
Now, the number of ways of getting x(=560) number of heads in the n(=1000) trial is
So, expected number of heads if null hypothesis is true (i.e., p = 0.5) = n*p = 1000*0.5 =
500
Similarly,
99.73% of the normal distribution lies under the 3 standard deviations from the mean.
And the z-score is showing that the number is around 3.79 standard deviation away
from the mean. Hence, we can say that there is a less than 1% chance that the coin is
unbiased, and we reject the null hypothesis. Hence, the coin is biased.
Solution:
There is no assumption about where (specific floor) and when (together or separately)
people get on the elevator.
Solution:
A quick way to check if the probability of the product of X(0,1) and Y(0,1) is less than
0.5 is to visualize a 2-dimensional plane. All the points (x,y) within the square [0, 1] x [0,
1] fall in the candidate space.
The case when xy = 0.5 makes a curve y = 0.5/x, the area under the curve would
represent the cases for which xy <= 0.5. Since the area for the square is 1, that area is
the sought probability.
Solution:
This is an open-ended question based on A/B Testing. It is a vanilla version of the type.
The decision of which program to invest in depends on the A/B test results we get from
the available options. Please pay close attention to the final goal (improved conversion
at checkout), as this also determines the metrics of interest. To answer such questions,
usually approach in the following order:
Linear regression is sensitive to outliers. Since linear regression minimizes the sum of
squared errors across all observations, when an outlier is present, the fit will change to
accommodate. Hence, making the linear regression fit sensitive to outliers.
To deal with outliers, one needs to identify whether the outlier is a valid datapoint or not.
If it is due to data collection issues, simply remove the invalid outlier datapoint. If the
datapoint is valid, try to understand how common the valid datapoint is. Data
transformation and fitting a separate model for the outliers might need to be done for
that case.
In other words, the T-test will determine whether the jth feature has a statistically
significant non-zero coefficient in the model. Generally, a non-zero coefficient feature is
considered to be important for the model.
Alternatively, Lasso Regression can be used to identify significant features. The ones
with coefficients not sent to zero by the Lasso Regression are considered to be
important.
1. Create a Python code that can recognize whether entries to a list have common
characters or not.
2. Suppose you have an array of integers. You have been asked to find a certain
element. What is the algorithm you would use, and what is its efficacy?
3. In the case of a long-sorted and short-sorted list, what algorithm would you use
to search the long list for the 4 elements?
4. Tell us about an instance where you applied machine learning to resolve
ambiguous business problems.
5. If you have categorical variables and there are thousands of distinct values, how
will you encode them?
6. Define lstm. How have you used it?
7. Enumerate the difference between bagging and boosting.
8. How does 1D CNN work?
9. Differentiate between linear regression and a t-test?
10. How will you locate the customer who has the highest total order cost between
2020-02-02 to 2020-05-06? You can assume that every first name in the dataset
is unique.
11. Take us through the steps of the cold-start problem in a recommender system?
12. Discuss the steps of building a forecasting model.
13. How will you create an AB test for a marketing campaign?
14. What are Markov chains?
15. What is root cause analysis?
Recommended Reading: Amazon Data Scientist Salary
1. If you need to manage a chat thread, which tables and indices do you need in a
SQL DB?
2. How do you propose to measure the effectiveness of the operations team?
3. Explain p-value to a business head.
4. Explain the differences between independent and dependent variables.
5. What is the goal of A/B Testing?
6. Define Prior probability and likelihood?
7. Explain the key differences between supervised and unsupervised learning.
8. What is the difference between “long” and “wide” format data?
9. Explain the utility of a training set.
10. What is Logistic Regression?
Recommended Reading: Data Engineer vs. Data Scientist — Everything You Need to
Know
1. Describe a time when you used data for presenting data-driven statistics.
2. Do you think vacations are important? How often do you think one should take a
vacation?
3. Did you ever have two deadlines that you had to meet simultaneously? How did
you manage that?
4. Describe a time when you had a disagreement with a senior over a project. How
did you handle it?
5. How will you handle the situation if you have an insubordinate team member?
6. Why do you want to work as a data scientist with this company?
7. Which is your favorite leadership principle?
8. How do you ensure high productivity levels at work?
9. Have you ever had to explain a technical concept to a non-technical person?
Was it difficult to do so?
10. How do you prioritize your work?
That concludes the comprehensive list of data scientist interview questions. Make sure
you practice these frequently asked questions to prepare yourself for the interview.
Data science interview questions are usually based on statistics, coding, probability,
quantitative aptitude, and data science fundamentals.
Yes. In addition to core data science questions, you can also expect easy to medium
Leetcode problems or Python-based data manipulation problems. Your knowledge of
SQL will also be tested through coding questions.
Yes. Behavioral questions help hiring managers understand if you are a good fit for the
role and company culture. You can expect a few behavioral questions during the data
scientist interview.
Some domain-specific topics that you must prepare include SQL, probability and
statistics, distributions, hypothesis testing, p-value, statistical significance, A/B testing,
causal impact and inference, and metrics. These will prepare you for data scientist
interview questions.
Based on our research, you can work as a data scientist even though you only have a
bachelor’s degree. You can always upgrade your skills via a data science boot camp.
But for better career prospects, having an advanced degree may be useful.
100+ Data Science Interview
Questions and Answers for 2023
Top 100 Common Data Scientist Interview Questions
and Answers
Machine Learning comprises two words-machine and learning, which hint towards its
definition - a subdomain in computer science that deals with the application of
mathematical algorithms to identify the trend or pattern in a dataset.
The simplest example is the usage of linear regression (y=mt+c) to predict the output of
a variable y as a function of time. The machine learning model learns the trends in the
dataset by fitting the equation on the dataset and evaluating the best set of values for m
and c. One can then use these equations to predict future values.
Access 100+ ready-to-use, sample Python and R codes for data science to prepare for
your Data Science Interview
Python is likely to be everyone’s choice for text analysis as it has libraries like Natural
Language Toolkit (NLTK), Gensim. CoreNLP, SpaCy, TextBlob, etc. are useful for text
analysis.
Understanding consumer behavior is often the primary goal of many businesses. For
example, consider the case of Amazon. If a user searches for a product category on its
website, the major challenge for Amazon’s backend algorithms is to come up with
suggestions that are likely to motivate the users to make a purchase. And such
algorithms are the heart of recommendation systems or recommender systems. These
systems aim at analyzing customer behavior and evaluating their fondness for different
products. Apart from Amazon, recommender systems are also used by Netflix, Youtube,
Flipkart, etc.
(Access popular Python and R Codes for data cleaning )It is cumbersome to clean data
from multiple sources to transform it into a format that data analysts or scientists can
work with. As the number of data sources increases, the time it takes to clean the data
increases exponentially due to the number of sources and the volume of data generated
in these sources. It might take up to 80% of the time for cleaning data, thus making it a
critical part of the analysis task.
New Projects
Learn Efficient Multi-Source Data Processing with Talend ETL
View Project
Build Piecewise and Spline Regression Models in Python
View Project
Python and MongoDB Project for Beginners with Source Code
View Project
Learn How to Implement SCD in Talend to Capture Data Changes
View Project
End-to-End ML Model Monitoring using Airflow and Docker
View Project
Build an AI Chatbot from Scratch using Keras Sequential Model
View Project
Learn to Create Delta Live Tables in Azure Databricks
View Project
EMR Serverless Example to Build a Search Engine for COVID19
View Project
AWS Project to Build and Deploy LSTM Model with Sagemaker
View Project
Build an ETL Pipeline with Talend for Export of Data from Cloud
View Project
Learn Efficient Multi-Source Data Processing with Talend ETL
View Project
Build Piecewise and Spline Regression Models in Python
View Project
Python and MongoDB Project for Beginners with Source Code
View Project
Learn How to Implement SCD in Talend to Capture Data Changes
View Project
End-to-End ML Model Monitoring using Airflow and Docker
View Project
Build an AI Chatbot from Scratch using Keras Sequential Model
View Project
Learn to Create Delta Live Tables in Azure Databricks
View Project
EMR Serverless Example to Build a Search Engine for COVID19
View Project
AWS Project to Build and Deploy LSTM Model with Sagemaker
View Project
Build an ETL Pipeline with Talend for Export of Data from Cloud
View Project
View all New Projects
Eigenvectors are used for understanding linear transformations. They are the directions
along which a particular linear transformation acts by flipping, compressing, or
stretching. Eigenvalues can be referred to as the strength of the transformation in the
direction of the eigenvector or the factor by which the compression occurs. We usually
calculate the eigenvectors for a correlation or covariance matrix in data analysis.
Gradient descent is an iterative procedure that minimizes the cost function parametrized
by model parameters. It is an optimization method based on convex function and trims
the parameters iteratively to help the given function attain its local minimum. Gradient
measures the change in parameter with respect to the change in error. Imagine a
blindfolded person on top of a hill and wanting to reach the lower altitude. The simple
technique he can use is to feel the ground in every direction and take a step in the
direction where the ground is descending faster. Here we need the help of the learning
rate which says the size of the step we take to reach the minimum. The learning rate
should be chosen so that it should not be too high or too low. When the selected
learning rate is too high, it tends to bounce back and forth between the convex function
of the gradient descent, and when it is too low, we will reach the minimum very slowly.
project?
● Understand the business problem and convert it into a data analytics problem.
● Use exploratory data analysis techniques to understand the given dataset.
● With the help of feature selection and feature engineering methods, prepare the
training and testing dataset.
● Explore machine learning/deep learning algorithms and use one to build a
training model.
● Feed training dataset to the model and improve the model’s performance by
analyzing various statistical parameters.
● Test the performance of the model using the testing dataset.
● Deploy the model, if needed, and monitor the model performance.
Feature selection methods are the Feature Engineering methods are the
methods that are used to obtain a subset methods that are used to create new
of variables from the dataset that are features from the given dataset using the
required to build a model that best fits existing variables. These methods allow
the trends in the dataset. to better fit complicated trends in the
dataset.
12. What do you know about MLOps tools? Have you ever
MLOps tools are the tools that are used to produce and monitor the enterprise-grade
deployment of machine learning models. Examples of such tools are MLflow,
Pachyderm, Kubeflow, etc.
In case you haven’t worked on an MLOps project, try this MLOps project by Goku
Mohandas on Github or this MLOps Project on GCP using Kubeflow for Model
Deployment by ProjectPro.
Logistic regression is one of the most popular machine learning models used for solving
a binary classification problem, that is, a problem where the output can take any one of
the two possible values. Its equation is given by
Where X represents the feature variable, a,b are the coefficients, and Y is the target
variable. Usually, if the value of Y is greater than some threshold value, the input
variable is labeled with class A. Otherwise, it is labeled with class B.
To find the optimal value for k, one can use the elbow method or the silhouette method.
formats?
In wide data format, you will find a column for each variable in the dataset. On the other
hand, in a long format, the dataset has a column for specific variable types & a column
for the values of those variables.
For example,
Feature vectors are the set of variables containing values describing each observation’s
characteristics in a dataset. These vectors serve as input vectors to a machine learning
model.
Dropout is a regularisation method used for deep neural networks to train different
neural networks architectures on a given dataset. When the neural network is trained on
a dataset, a few layers of the architecture are randomly dropped out of the network.
This method introduces noise in the network by compelling nodes within a layer to
probabilistically take on more or less authority for the input values. Thus, dropout makes
the neural network model more robust by fixing the units of other layers with the help of
prior layers.
The dropout regularisation method mostly proves beneficial for cases where the dataset
is small, and a deep neural network is likely to overfit during training. The computational
factor has to be considered for large datasets, which may outweigh the benefit of
dropout regularisation.
The dropout regularisation method involves the random removal of a layer from a deep
neural network, which speeds up the training process.
Where X is the independent variable, a,b are the coefficients, and Y is the dependent
variable that can take categorical values.
it?
One can overcome multicollinearity in their model by removing a few highly correlated
variables from the regression equation.
significance?
The expected value of test-MSE (Mean Square Error, for a given value x0, can always
be decomposed into the sum of three fundamental quantities: the variance of f0‘(x0), the
squared bias of f0(x0), and the variance of the error terms e. That is,
Here the notation(y0 − f0(x0))2 defines the expected test MSE, and refers to the average
test MSE that one would obtain if they repeatedly estimated f using a large number of
training sets, and tested each at x0. Also, f0‘(x0) refers to the output of the fitted ML
model for a given input x0 and e is the deviation of the predicted valuef0‘(x0) from the
true value at a given x0.
The equation above suggests that we need to select a statistical learning method that
simultaneously achieves low variance and low bias to minimize the expected test error.
A good statistical learning method's good test set performance requires low variance
and low squared bias. This is referred to as a trade-off because it is easy to obtain a
method with extremely low bias but high variance (for instance, by drawing a curve that
passes through every single training observation) or a method with a very low variance
but high bias (by fitting a horizontal line to the data). The challenge lies in finding a
method for which both the variance and the squared bias are low.
Interpolating the data means one is estimating the values in between two known values
of a variable from the dataset. On the other hand, extrapolating the data means one is
estimating the values that lie outside the range of a variable.
same point?
No, gradient descent methods do not always converge to the same point because they
converge to a local minimum or a local optima point in some cases. It depends a lot on
the data one is dealing with and the initial values of the learning parameter.
● L2 Regularization
● L1 Regularization
In L2 Regularization, the penalty term is the sum of squares of the magnitude of the
model coefficients while in L1 Regularization, it is the sum of absolute values of the
model coefficients.
Descent.
Gradient descent is one of the most popular machine learning and deep learning
optimization algorithms used to update a learning model's parameters. There are 3
variants of gradient descent.
Batch Gradient Descent: Computation is carried out on the entire dataset in batch
gradient descent.
Stochastic Gradient Descent: Computation is carried over only one training sample in
stochastic gradient descent.
Mini Batch Gradient Descent: A small number/batch of training samples is used for
computation in mini-batch gradient descent.
For example, if a dataset has 1000 data points, then batch GD, will train on all the 1000
data points, Stochastic GD will train on only a single sample and the mini-batch GD will
consider a batch size of say100 data points and update the parameters.
transformation?
In statistics, the p-value is used to test the significance of a null hypothesis. A p-value
lower than 0.05 suggests that there is only 5% chance that the outcomes of an
experiment are random and the null hypothesis must be rejected. On the other hand, a
higher p-value,say0.8, suggests that the null hypothesis can not be rejected as 80% of
the sample has random outcomes.
A/B Testing is a technique for understanding user experience. It involves serving a user
with two different product versions to analyze which version is likely to outperform the
other. The testing is also used to understand user preferences.
absolute error?
The squared error is the square of As the name suggests, the absolute error
the difference between the value of a refers to the modular of the difference
quantity,x from its inferred value, x’. between the value of a quantity,x from its
inferred value, x’.
It is represented as (x-x’)2.
It is represented as |x-x’|.
In data science, mean squared error is more popular for understanding the deviation of
the inferred values from the actual values as it gives relatively more weight to the highly
deviated points and gives a continuous derivative which is useful for analysis.
distribution?
A skewed distribution is a distribution where the values in the dataset are not
normalized and the distribution curve is inclined towards one side. A uniform distribution
on the other hand is a symmetric distribution where the probability of occurrence of
each point is same for a given range of values in the dataset.
Cancer 30 12
Patient
Not a 10 28
Cancer
Patient
Assume that the confusion matrix mentioned above represents the results of the
classification problem of cancer detection. It is easy to conclude the following:
False Positives, No. of patients that do not have cancer but the model predicted
otherwise = 12
False Negatives, No. of patients that have cancer but the model predicted otherwise =
10
The formula for recall clearly suggests that it estimates the ability of a model to correctly
identify true positives, that is, the patients who are infected with cancer. To understand it
better, take a careful look at the denominator which is nothing but the total number of
people possessing cancerous cells. Thus, a recall value of 0.75 suggests that the model
was able to correctly identify 75% of the patients that have cancer.On the other hand,
Precision = True Positives / (True Positives + False Positives) = 30/42 = 0.71
The formula for Precision suggests that it reflects how many times the model is
successful in deducing True positives wrt the false positives. Thus, the number 0.71
suggests that whenever the model predicts a patient has cancer, the chances of making
a correct prediction are 71%.
High dimensional data refers to data that has a large number of features. The
dimension of data is the number of features or attributes in the data. The problems
arising while working with high dimensional data are referred to as the curse of
dimensionality. It basically means that error increases as the number of features
increases in data. Theoretically, more information can be stored in high-dimensional
data, but practically, it does not help as it can have higher noise and redundancy. It is
hard to design algorithms for high-dimensional data. Also, the running time increases
exponentially with the dimension of data.
The r-squared value compares the variation of a fitted curve to a set of data points with
the variation of those points wrt the line that passes through the average value. It can
be understood with the help of the formula
It is obvious that the model is likely to fit better than the average line. So, the variation
for the model is likely to be less than the variation for the line. Thus, if the r-square has
a value of 0.92, it suggests that the model fits the data points better than the line as
there is 92% less variation. It also shows that there is a strong correlation between the
feature and target value. However, if the r-squared value is less, it suggests that the
correlation is weak and the two variables are quite independent of each other.
of Machine Learning?
In machine learning, a hypothesis represents a mathematical function that an algorithm
uses to represent the relationship between the target variable and features.
By sticking to a small learning rate, scaled target variables, a standard loss function,
one can carefully configure the network of a model and avoid exploding gradients.
Another approach for tackling exploding gradients is using gradient scaling or gradient
clipping to change the error before it is propagated back through the network. This
change in error allows rescaling of weights.
Naïve Bayes is a machine learning algorithm based on the Bayes Theorem. This is
used for solving classification problems. It is based on two assumptions, first, each
feature/attribute present in the dataset is independent of another, and second, each
feature carries equal importance. But this assumption of Naïve Bayes turns out to be
disadvantageous. As it assumes that the features are independent of each other, but in
real-life scenarios, this assumption cannot be true as there is always some dependence
present in the given set of features. Another disadvantage of this algorithm is the
‘zero-frequency problem’ where the model assigns value zero for those features in the
test dataset that were not present in the training dataset.
plagiarism?
Follow the steps below for developing a model that identifies plagiarism:
The most important consequence of the central limit theorem is that it reveals how
nature likes to obey the normal distribution curve. It allows experts from various fields
like statistics, physics, mathematics, computer sciences, etc. to assume that the data
they are looking at obeys the famous bell curve.
The formula for evaluating euclidean distance in three dimensions between two points
defined by coordinates (x1,y1,z1) and (x2,y2,z2) is simply given by
It simply represents the length of a line that connects the two points in a
three-dimensional space.
code?
var2<- c("I","Love,"ProjectPro")
var2
# Lowercase
strng = strng.lower()
# Here is a dictionary that will contain each unique letter and its counts
c = {}
# If can’t find the letter in dictionary, add it and set the count to 1
if letter not in c:
c[letter] = 1
# If can’t find the letter in dictionary, add 1 to the count
else:
c[letter] += 1
for i in range(len(strng)):
if c[strng[i]] == 1:
return i
return -1
# Test cases
print(f"Index: {frstuniquechar(strng=s)}")
using Recursion.
def fact(num):
# Extreme cases
if num< 0: return -1
if num == 0: return 1
if num == 1:
return num
else:
# Recursion Used
# Test cases
print(f"{num}! = {fact(num=num)}")
Another way is to train and test data sets by sampling them multiple times. Predict on all
those datasets to determine whether the resultant models are similar and are
performing well.
By looking at the p-value, by looking at r square values, by looking at the fit of the
function, and analyzing as to how the treatment of missing value could have affected-
data scientists can analyze if something will produce meaningless results.
So, there you have over 120 data science interview questions and answers for most of
them too. These are some of the more common interview questions for data scientists
around data, statistics, and data science that can be asked in the interviews. We will
come up with more questions – specific to language, Python/ R, in the subsequent
articles, and fulfill our goal of providing 120 data science interview questions PDF with
answers to our readers.
• Learn the language of business as the insights from a data scientist help in
reshaping the entire organization.
The important tip, to nail a data science interview is to be confident with the answers
without bluffing. If you are well-versed with a particular technology whether it is Python,
R, Hadoop, Spark or any other big data technology ensure that you can back this up but
if you are not strong in a particular area do not mention it unless asked about it. The
above list of data scientist job interview questions is not an exhaustive one. Every
company has a different approach to interviewing data scientists. However, we do hope
that the above data science technical interview questions elucidate the data science
interview process and provide an understanding of the type of data scientist job
interview questions asked when companies are hiring data people.