Book 2.0 - Python

Download as pdf or txt
Download as pdf or txt
You are on page 1of 143
At a glance
Powered by AI
The key takeaways are univariate analysis, bivariate analysis, visualization techniques, linear regression modeling, diagnostics checking and interaction modeling.

The prerequisites mentioned are basic Python knowledge and referring to Python documentation for any doubts. Other resources mentioned are StackOverflow and GeeksforGeeks.

Some of the statistical techniques covered are univariate analysis, bivariate analysis, visualization techniques, linear regression modeling, model comparison and diagnostics checking.

1

Prerequisite: Basic Python


Please refer to Python documentation pages anytime you are not sure
about a function
StackOverflow and GeeksforGeeks are your best friends!

Copyright © 2021 by Kanika Tayal and Anubhav Dubey


Kanika Tayal and Anubhav Dubey assert the moral right to be identified
as the authors of this book.

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


Dedicated To
Every covid warrior fighting relentlessly on the frontlines,
research, administration against the greatest crisis of our lives.
Those who have lost a loved one.
The ones who did their best to save lives by arranging for oxygen,
beds, food.
And
Our families without whom we are nothing
Our teachers who have given us the light to walk our paths
Our friends whose questions became the motivation for this book

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


Preface 2.0
We hope you are doing well.

Why the second edition? Over the past year, while using the book for revising
certain concepts for placement interviews, we identified a few gaps. While those
may have made sense to someone from our batch, they would have caused a few
difficulties for those starting out. Keeping this in mind, we identified and altered
the pain points, making the book more coherent and comprehensive.

Unlike the previous version which was hosted via Google Drive, we are using
Gumroad. This will ensure that the readers will have access to any future updates
without any re-registration.

For any queries or suggestions, feel free to reach out to us via mail or LinkedIn.

Stay safe!

Kanika Tayal
Anubhav Dubey

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


First Edition Preface
“Ours is not a better way, it is merely another way.”
Neale Donald Walsch

When we started working on our first project, we were lost and lacked direction.
We reached out to our seniors and searched on the web about ways to do a project,
both of which helped us immensely in getting started. However, we realized that
the transition from knowledge-based mostly in theory to research and application
can be a daunting one if there hasn’t been any prior attempt.

Since epistemology is best left for the branch of philosophy, we, for now, deal only
with the application part.

This book is an attempt to provide a basic framework towards approaching the first
project in Data Analysis. The pages that follow are neither theoretically deep nor
exhaustive in methods; however, they are intended to bridge the theory and
applications, especially the statistical techniques.

Data Science is an ever-growing and ever-evolving field. Hence, any journey into
this wonderful realm can never fully cover all its aspects. However, it is important
to begin because like all journeys, this will be exhilarating, to say the least, and we
believe that this book will be a good starting point.

To great beginnings
Kanika Tayal
Anubhav Dubey

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


Format
The book is majorly divided into two sections: Warm-up and Statistical
Analysis of Data.

The warm-up section simply talks about the basic concepts that are
required for understanding the analysis. These are short definitions
intended for revision. If you are not aware of these concepts, we
recommend reading these concepts before proceeding further.

From our experience, these basic concepts are very important from
an interview perspective. You can know lots of advanced stuff but
command over basics is a must.

The Statistical Analysis aims to explain the steps of Exploratory Data


Analysis along with building a Linear Regression Model. All the steps
have been illustrated with the help of the Carseats dataset (you can
download it using this link). At each step, an attempt has been made to
describe the inferences of the output.

Follow along to start your journey of doing an awesome project!

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


TABLE OF CONTENTS
Warm-Up 9
Overview of Linear Regression 18
Summary of the Statistical Analysis 21
1. Dataset 22
1.1. Loading the dataset 23
2. Univariate Analysis 26
2.1. Distribution of variables 29
2.2. Inference about the mean 38
2.3. Impact of Categorical Variables 43
3. Bivariate Analysis 54
3.1. Correlation Analysis 54
3.2. Bivariate Linear Regression Analysis 63
3.3. Regression Diagnostics 69
3.4. RMSE 79
3.5. Prediction of the response variable 82
3.6. Confidence and Prediction Interval 85
4. Multivariate Correlation and Regression 89
4.1. Multivariate Correlation Analysis 90
4.2 Multiple Regression Analysis 93
4.3 Comparing Models using adjusted R2, AIC, and ANOVA 99
4.4. Regression Diagnostics 106
4.5. Stepwise Regression 111
4.5.1 Regression Diagnostics of Best Model 120
4.6 Multicollinearity 124
4.6.1 Final Model and Regression Diagnostics 127
4.7. Parallel Slopes Model 130
4.7.1. Diagnostics of Parallel Slopes Model 135
4.8. Interactions 140
Conclusion 143

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


A meme for you!

Source: Link

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


Warm-Up
1. Measures of Central Tendency

A measure of central tendency is a single value that attempts to describe


a dataset by identifying the central position within that data. The 3 most
common measures of central tendency are the mean, median, and mode.

a) Arithmetic Mean – The sum of all observations divided by the


number of observations.
b) Median – The value that divides sorted numerical data into two equal
halves.
c) Mode – The value that appears most often in the data.

2. Partition Values

These are the values that divide the data into a number of equal parts.
Commonly used partition values are Quartiles, Deciles, and Percentiles.

a) Quartiles – Three points that divide the data into 4 equal parts.
b) Deciles – Nine points that divide the data into 10 equal parts.
c) Percentiles – Ninety-nine points that divide the data into 100 equal
parts.

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


3. Measures of Dispersion

A measure of dispersion is useful in gauging the extent of spread in the


given data. Commonly used measures of dispersion are given below.

a) Range – The difference between the maximum and minimum values


of data.
b) Quartile Deviation – The difference between the first and third
quartiles is divided by 2.
c) Standard Deviation – The positive square root of the arithmetic mean
of squares of deviations of given values from their arithmetic means.
d) Variance – square of standard deviation.

4. Skewness and Kurtosis

a) Skewness allows us to measure the symmetry or the lack of it for a


given data. If the density curve of data is shifted to the left or to the right,
it is said to be skewed. Data can be symmetric, positively skewed, or
negatively skewed.

Symmetric distribution: Mean = Median = Mode


Positively(Right) skewed distribution: Mean > Median > Mode
Negatively(Left) skewed distribution: Mean < Median < Mode

10

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


b) Kurtosis gives us an insight into the “peakedness” or “flatness” of the
frequency curve of the given data. The data can be leptokurtic,
mesokurtic (normal) or platykurtic.

5. Normal Distribution

11

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


A normal distribution is a continuous probability distribution. A random
variable X is said to follow a normal distribution if the pdf of the
distribution is given by
2
(𝑥− μ)
− 2

𝑒
𝑓(𝑥) =
σ 2π
where µ is the population mean and σ is the population standard
deviation.

The following form is used to denote that X follows a normal


distribution with mean µ and standard deviation σ: X ~ N(µ,σ2)

If X ~ N(0,1) then X is said to follow a standard normal distribution.

A log-normal (or lognormal) distribution is a continuous
probability distribution of a random variable whose logarithm is
normally distributed. Thus, if the random variable X
is log-normally distributed, then Y = ln(X) has a normal distribution.

6. Sample vs. Population

A population is an entire group that you want to draw conclusions about.


Whereas a sample is always a smaller group (subset) within the
population.
For larger and more dispersed populations, it is often difficult or
impossible to collect data about every population unit. Hence, a sample
is selected to make inferences about the population.

12

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


7. Chi-square Distribution

Degrees of freedom (d.f./ d.o.f.) – These are the number of values that
can independently vary in statistical analysis, that is, the number of
values left after accounting for every constraint involved in an analysis.

A random variable Y is said to follow a chi-square distribution with n


d.o.f., denoted by χ2(n) if its distribution is the sum of squares of n
independent standard normal variables.

i.e., If Z1, ..., Zk are independent, standard normal random variables, then
the sum of their squares,
𝑘
2 2
𝑄 = ∑ 𝑍𝑖 ∼ χ 𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛 𝑤𝑖𝑡ℎ 𝑘 𝑑. 𝑜. 𝑓.
𝑖=1

8. t-distribution (Studentised t-distribution)

The ratio of an independent standard normal variate and the square root
of a chi-square variate divided by its degrees of freedom follows
t-distribution.
i.e., If Z has a standard normal, V has a 𝓧 2 distribution with k degrees of
freedom, and Z and V are independent. Then,
𝑍
𝑡 = ∼ 𝑡 𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛 𝑤𝑖𝑡ℎ 𝑘 𝑑. 𝑜. 𝑓.
𝑉/𝑘

13

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


9. F-distribution

The ratio of two independent chi-square variates divided by its degrees


of freedom follows F-distribution.
i.e, if U and V are independent chi-square random variables
with r1 and r2 degrees of freedom, respectively. Then,
𝑈/𝑟1
𝐹= 𝑉/𝑟2
~ F(r1,r2)

10. Errors in testing

a) Type I error, also known as a false positive, occurs when we


incorrectly reject a true null hypothesis. This means that we report that
our findings are significant when in fact they have occurred by chance. It
is denoted by α.

b) Type II error, also known as a false negative, occurs when we fail to


reject a null hypothesis that is false. Here we conclude there is not a
significant effect when actually there really is. It is denoted by β.

14

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


11. Critical Value (Tabulated Value)

In hypothesis testing, a critical value is a point on the test distribution


that is compared to the test statistic to determine whether to reject the
null hypothesis.

15

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


12. p-value

In statistics, the p-value is the probability of obtaining results as extreme


as the observed results of a statistical hypothesis test, assuming that the
null hypothesis is correct. The p-value provides the smallest level of
significance at which the null hypothesis would be rejected. A smaller
p-value means that there is stronger evidence in favor of the alternative
hypothesis.

16

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


13. Do not Worry!

17

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


Overview of Linear Regression
Linear Regression: The technique of modeling a linear relationship
between a response (dependent) variable and one or more explanatory
(independent) variables is called linear regression.

The equation of a multiple linear regression is given by:

Y = β 0 + β1X1 + β2X2+ … + βkXk + ε

where Y is the dependent variable and X i’s (i = 1 to k) are independent


variables and ε is the error term.

The model in matrix form is given by:


𝑌𝑛×1 = 𝑋𝑛×(𝑘+1)β(𝑘+1)×1 + ε𝑛×1

where Y n×1 is the vector of values of the response variable, Xn×(k+1) is the
design matrix, β(k+1)×1 is a vector of the model parameters, εn×1 is a vector
of residuals.

Assumptions:
There are four assumptions associated with a linear regression model.

1. Normality of residuals: Errors, ε are independent and identically


distributed N(0,σ2) variates
2. Linearity: The relationship between X and Y is linear

18

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


3. Homoscedasticity: The variance of the error is constant for any
value of X
4. Independence: Observations of X are independent of each other
(no multicollinearity)

Estimation of model parameters:


The (k+1) parameters, β0, β1, …, βk, are estimated using the method of
Ordinary Least Squares (OLS), that is, by minimizing the error sum of
squares. The formula for estimating the parameters is:

^ ' −1
β = (𝑋 𝑋) 𝑋'𝑌

t-test for significance of parameters:

^ ^ ^
Let β0, β1, …, βk be the estimated parameters. A hypothesis test for the
significance of an independent variable is to be performed. Then,

The test hypothesis is:


Null Hypothesis, H0: βi = 0 vs.
Alternate Hypothesis, H1: βi ≠ 0

Test Statistic:
^
β𝑖
𝑡= ^ ~ t(n-(k+1))
𝑠𝑒(β𝑖)
^
where se(βi) is the standard error of the estimated parameter.

19

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


Test Criteria: If | t | > t (α/2), then reject H0 at α% level of significance and
conclude that variable Xi is significant for the regression. Otherwise, do
not reject H0.

Confidence Interval:

100(1-α)% C.I. for βi is given by ( ^


β𝑖 + 𝑡 𝑎
2
( ))
𝑠𝑒 β𝑖
^

20

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


Summary of the Statistical Analysis
During this book, you will perform data analysis on the Carseats dataset.
(You can download the dataset using this link). The Statistical analysis
begins with a Univariate analysis of the variables followed by a bivariate
analysis to study the correlations and ultimately aims to fit a multiple
linear regression model for the Sales. In the end, you will compare
models using various techniques to find the most suitable model.

The steps followed in this book are mentioned below:


1. Load the libraries for analysis and read the dataset.
2. Univariate analysis – exploring the descriptive statistics of variables
using methods of summarization and visualization.
3. Bivariate analysis – to find associations between two variables and
measure the significance of such associations.
3.1 Simple Linear regression along with regression diagnostics
4. Multivariate Analysis – to find an association between one response
and more than one explanatory variable
4.1 Multiple Linear regression along with regression diagnostics
4.2 Comparing models using adjusted R2, AIC, and ANOVA
4.3 Stepwise regression
4.4 Parallel slopes model
4.5 Using Interaction terms for modeling

21

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


1. Dataset
The Carseats dataset has been used. (Download link)

The dataset is simulated and it is described in the lines that follow:

It is in a tabular format with 400 observations on the following 11


variables:

● Sales - Unit sales (in thousands) at each location


● CompPrice - Price charged by a competitor at each location
● Income - Community income level (in thousands of dollars)
● Advertising - Local advertising budget for the company at each
location (in thousands of dollars)
● Population - Population size in the region (in thousands)
● Price - Price company charges for car seats at each site
● ShelveLoc - A factor with levels Bad, Good, and Medium
indicating the quality of the shelving location for the car seats at
each site
● Age - Average age of the local population
● Education - Education level at each location
● Urban - A factor with levels No and Yes to indicate whether the
store is in an urban or rural location
● US – A factor with levels No and Yes to indicate whether the store
is in the US or not

22

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


1.1. Loading the dataset
Before loading the dataset, you should import the libraries with aliases.
In Python, the import keyword allows you to use packages and modules.
Further, you can use the from keyword to import a particular function
from a library.
For convenience, the common packages have been imported before
starting the analysis.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy.stats as stats
import seaborn as sns
import statsmodels.api as sm
import sklearn.metrics as sk

Task 1: Store the dataset in another variable and examine its


structure

After downloading the dataset (download link), you can use the
read_csv() function of Pandas to read a CSV file. Make sure that the file
is in the same directory otherwise you will have to specify the complete
path.

prodata = pd.read_csv("dataset.csv")

23

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


You can view the first 5 rows of the DataFrame using the head()
function. To view a specific number of rows you can specify it in the
function parenthesis. For example, prodata.head(10) will display the first
10 rows of the prodata DataFrame.

prodata.head()

To view a concise summary of a DataFrame, the info() function is very


handy. It displays all the columns along with their data types.

The output also displays the count of non-null observations in each


column. This will help you in finding if any column has missing
observations.

prodata.info()

24

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


Q1. Explain the structure of the dataset.

Ans. Carseats dataset contains 400 observations on eight numeric and


three object variables. These object variables are categorical in nature.
Each column represents a variable and each row corresponds to
observation on all these variables.

Q2. What are the levels of the factor variables?

Ans. Further analysis is required to know more about the levels of the
factor variables.

25

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


2. Univariate Analysis
a. What – As the name suggests, Univariate (uni – one) analysis is
concerned with the analysis of a single variable.
b. Why – It is useful to describe the variable in question by
summarizing it and trying to find some patterns indicative of its
nature.
c. How – The steps of this analysis have been described in this
section.

Task 2: Summarise the data

The describe() function returns an eight-point descriptive summary for a


numeric variable of a DataFrame. The function returns the mean,
standard deviation, and quartiles of each of the numeric variables. The
function, by default, excludes the character columns.

You can use the include parameter to specifically display the summary of
a character (object) variable.

Descriptive statistics returned from the describe() function helps in


understanding the distribution of a continuous variable. The following
cases are possible:
● If median > mean, then you may conclude that the data is
negatively skewed.

26

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


● If mean > median, then you may conclude that the data is positively
skewed.
● If median and mean are very close, then you may conclude that the
data is symmetric.

prodata.describe()

Q3. Is there any evidence that the distributions of Sales and Price
are somewhat symmetric?

Ans. As seen above, the mean and median values for these two variables
are almost equivalent, indicating that their distributions are somewhat
symmetric.

Note: The median value is given by the 50% row.

27

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


To view a summary of the character variables, you can use the following
command.

prodata.describe(include = np.object)

Notice that the variable ShelveLoc has 3 levels. On the other hand, US
and Urban have 2 levels each. To view the categories of each of these
variables you can use the unique() function as shown below.

prodata['ShelveLoc'].unique()

This displays that ShelveLoc has 3 categories: ‘Bad’, ‘Good’, and


‘Medium’. Likewise, you can check that US and Urban both have ‘Yes’
and ‘No’ as categories.

28

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


2.1. Distribution of variables
Task 3: Visualise the distributions of the variables Sales, Price,
Advertising, and Income with a stem-and-leaf plot and
histogram.

A Stem-and-leaf plot is a technique that helps in analyzing the


distribution of quantitative data in a graphical format.

The idea is to make a special table where each data value is split into a
"stem" (the first digit or digits) and a "leaf" (usually the last digit). For
example, 15 is split into 1 (stem) and 5 (leaf).

To create stem and leaf plots in Python, you can use the stem_graphic()
function from the stemgraphic package. Additionally, the scale argument
can be used to alter the number of stems in a plot.

Note: You can install the stemgrpahic package using


pip install stemgraphic

import stemgraphic
stemgraphic.stem_graphic(prodata['Sales'],
scale=1)

29

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


stemgraphic.stem_graphic(prodata['Price'],
scale=10)

30

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


stemgraphic.stem_graphic(prodata['Income'],
scale=5)

stemgraphic.stem_graphic(prodata['Advertising'],
scale=2)

31

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


You can observe that for Advertising, the results do not look neat. Thus,
histograms provide an edge over stem-and-leaf plots as they are more
interpretable.

Histograms provide a visual interpretation of numerical data by


indicating the number of data points that lie within a range of
values. These ranges of values are called classes or bins. The frequency
of the data that falls in each class is depicted by the use of a bar. The
higher the bar, the greater the frequency of data values in that bin.

The hist() function from Pandas allows you to plot the histogram for a
particular variable of a DataFrame. The aesthetics of the plot can be
edited by color, edgecolor, and grid parameters of the hist() function.

You can use the subplots(nrows,ncols) function to put multiple graphs


in a single plot by setting some graphical parameters. This function is
from the pyplot submodule of matplotlib which is usually imported with
alias ‘plt’. This function returns a tuple containing the figure and axes
object(s).
For instance, nrows = 2 and ncols = 3 generate a matrix of 2 rows and 3
columns and can be used to plot 6 graphs on one window.
To control the size of the obtained figure, you can use the figsize
argument of the subplots function.

Histograms of Sales, Price, Advertising, and Income have been plotted in


a 2X2 grid.

32

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


fig, axes = plt.subplots(nrows=2, ncols=2,
figsize =(10,5))

prodata.hist(column = 'Sales', ax = axes[0,0],


color = 'white', edgecolor = 'black', grid =
False)
prodata.hist(column = 'Price', ax = axes[0,1],
color = 'white', edgecolor = 'black', grid =
False)

prodata.hist(column = 'Advertising', ax =
axes[1,0], color = 'white', edgecolor = 'black',
grid = False)

prodata.hist(column = 'Income', ax = axes[1,1],


color = 'white', edgecolor = 'black', grid =
False)
plt.show()

33

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


Q4. Describe the skew and kurtosis of the distribution of Sales,
Price, Advertising, and Income.

Ans. For each of the given variables, the following observations can be
made:
● Sales: it appears that the distribution is symmetric and leptokurtic.
● Price: the distribution seems to be again symmetric and leptokurtic.
● Advertising: a leptokurtic right-skew in the distribution.
● Income seems to be uniformly distributed.

Note: This graphical plotting confirms the symmetry in distributions of


Sales and Price variables as observed by the descriptive statistics during
task 2.

34

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


Since the distributions of Sales and Price seem to be normally
distributed, it is logical to check the assumption of normality using Q-Q
plots.

Task 4: Create Q-Q Plots to observe normality/log-normality of


the Sales and Price variates.
 
A Quantile-Quantile plot (Q-Q plot) is a scatterplot created by plotting
two sets of quantiles against one another. If both sets of quantiles came
from the same distribution, you should see the points forming a roughly
straight line.

The y-coordinates of a Q-Q plot are the sample quantiles and the
x-coordinates are the theoretical quantiles from the normal distribution.
It is a method to check for normality in a dataset through visualization.
A reference line in a Q-Q plot that passes through the first and third
quartiles of the sample and the theoretical normal sample is also plotted.
This line is usually called the Q-Q Line.

The probplot() function from scipy.stats module can be used to create


Q-Q plots in Python. By default, this function fits the normal distribution
i.e. dist parameter is “norm” and the fit parameter is “True” by default.

To create multiple plots in a single window, you can also use the
subplot() function as illustrated below. It adds a subplot to a current
figure at the specified grid position. It is similar to the subplots()
function however unlike subplots() it adds one subplot at a time.

35

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


plt.figure(figsize=(10,10))

i) Sales

plt.subplot(2,2,1)
stats.probplot(prodata['Sales'], dist="norm",
plot=plt, fit = True)
plt.title('Normal Q-Q plot of Sales')

plt.subplot(2,2,2)
stats.probplot(np.log(prodata['Sales']),
dist="norm", plot=plt, fit = True)
plt.title('Normal Q-Q plot of log(Sales)')

ii)Price

plt.subplot(2,2,3)
stats.probplot(prodata['Price'], dist="norm",
plot=plt, fit = True)
plt.title('Normal Q-Q plot of Price')

plt.subplot(2,2,4)
stats.probplot(np.log(prodata['Price']),
dist="norm", plot=plt, fit = True)
plt.title('Normal Q-Q plot of log(Price)')

36

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


plt.show()

37

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


Q5. Does the distribution of Sales and Price appear to be normal or
log-normal?

Ans. As evidenced by the plots above, the scatter plot of theoretical and
sample quantiles fit the reference line better for normal distribution.
Hence, both Sales and Price seem to have a normal distribution.

38

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


2.2. Inference about the mean
In this segment, you will try to make inferences about the population
means of Sales and Price. To do so, the one-sample t-test is appropriate.
Note: Price and Sales are normally distributed hence the assumption of
the t-test is satisfied.

One sample t-test is a statistical procedure used to determine whether a


sample of observations could have been generated by a process with a
specific population mean.

Hypothesis:
Suppose you want to test the hypothesis that a random sample xi (i =
1,2,..,n) has been drawn from a normal population with a specified mean
μ0. Then,

The null hypothesis, H0: μ = μ0 vs.


Alternate hypothesis, H1: μ ≠ μ0

Test Statistic:
(𝑋− µ0)
𝑡= 𝑠 ~ t(n-1)
𝑛

where n is the number of observations, 𝑋 is the sample mean and s is the


sample standard deviation.

39

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


Test Criteria:
If | t | > t(α/2) then, reject the null hypothesis of μ = μ0. Otherwise, do not
reject H0.

Use the ttest_1samp() function from the scipy.stats module in Python to


apply a one-sample t-test. To conduct a one-sample t-test, the syntax is
ttest_1samp(x, mu = mu0) where x is an array-like data structure of
sample values and mu is set equal to the mean specified by the null
hypothesis.

Task 5: (i) Apply t-test for the null hypothesis that the true mean
of Sales is 7.

x = prodata['Sales']
np.mean(x)

7.496325

t-test to test μ = 7 with a conf.level = 0.95

t,p = stats.ttest_1samp(x,7)
print('t-Statistic = %.4f, p-value = %.4f' % (t,
p))

t-Statistic = 3.5149, p-value = 0.0005

40

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


The output returns the value of the test statistic and the p-value.

To find the confidence interval of sample mean, use the t.interval()


function from the scipy.stats module. It returns the lower limit and the
upper limit of the confidence interval.

(ll,ul) = stats.t.interval(alpha=0.95,
df=len(x)-1, loc=np.mean(x), scale=stats.sem(x))
print("CI for Sales = (%.4f,%.4f)" % (ll,ul))

CI for Sales = (7.2187,7.7739)

Q6. What is the estimated population mean and its 95% confidence
interval?

Ans. The estimated population mean is 7.496 and the 95% CI is (7.219,
7.774).

Q7. What is the probability of committing a Type I error?

Ans. With only a 5% chance of being wrong, an assertion can be made


that the true mean of Sales lies between 7.22 and 7.77.

41

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


Q8. Explain the result obtained by the t-test.

Ans. Since the p-value is less than 0.05, reject the null hypothesis and
conclude that the true Sales mean is not equal to 7.

(ii) Apply t-test for the null hypothesis that the true mean of
Price is 115.

y = prodata['Price']
np.mean(y)

115.795

t-test to test μ= 115 with a conf.level = 0.95

t,p = stats.ttest_1samp(y,115)
print('t-Statistic = %.4f, p-value = %.4f' % (t,
p))

t-Statistic = 0.6715, p-value = 0.5023

Can you make a conclusion about the t-test for the true mean of Price
now?

42

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


2.3. Impact of Categorical Variables
You will now attempt to separate the data based on one of the factor
variables and then analyze each of the subsets obtained.

Task 6: Subset the dataset based on whether the store is located


in the US or not.

To subset the data, use the subsetting operators [,] along with the
conditional equality operator (==).

Now, two new DataFrames, proUSno and proUSyes have been created
for the non-US and US data respectively.

proUSno = prodata[prodata['US'] == 'No']


#storing data where US column is No

proUSyes = prodata[prodata['US'] != 'No']


#storing data where US column is Yes

As a next step, the describe() function is applied to the two DataFrames


to check if there are significant differences between them. You can also
plot histograms or boxplots for visualizing the differences in
distributions.

proUSno.describe()

43

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


proUSyes.describe()

Q9. Is there any difference between the advertising budgets inside


and outside the US?

Ans. According to the output obtained above, there seems to be a lot of


difference in the advertising budgets inside and outside the US.

44

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


Task 7: Making side-by-side boxplots to visualize the difference
between distributions of the subsetted datasets.

Before performing the t-test, it is good to have some intuition about the
distribution of the Sales and Price inside and outside the US. Therefore, a
side-by-side boxplot is used to compare the distribution of a continuous
variable as subsetted by a categorical variable.

The boxplot() function from the seaborn library can be used in Python to
create side-by-side boxplots as illustrated in the code below:

plt.figure(figsize = (10,5))
plt.subplot(1,2,1)
sns.boxplot(x = prodata['US'],y =
prodata['Sales'])

plt.subplot(1,2,2)
sns.boxplot(x = prodata['US'],y =
prodata['Price'])
plt.show()

45

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


In these boxplots, you can observe that Sales are slightly higher inside
the US. However, the distribution of Price does not seem to vary much
inside and outside the US.

Note: You can also use the boxplot function from Pandas to make a
similar boxplot.
prodata.boxplot(['Sales'],by='US')
prodata.boxplot(['Price'],by='US')

Q10. Is there any difference between Price, Sales inside and outside
the US?

Ans. There appears to be no difference between Prices inside and outside


the US. But to check if the difference observed in Sales is statistically
significant or not, an independent sample t-test has to be performed.

46

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


Task 8: Perform the two-sample t-test (also known as
independent sample t-test) for equality of means to determine
the significance of the difference in the Price, Sales inside and
outside the US.

Let us first try to understand an independent sample t-test.

Two sample t-tests are used to determine if two population means are
equal. Here, conducting a t-test for unpaired data is appropriate.

Let us suppose there are two independent samples X1i(i = 1 to n1) and
X2i(i = 1 to n2) from a normal population.

Test hypothesis:
The null hypothesis, H0: μ1 = μ2 vs.
Alternate hypothesis, H1: μ1 ≠ μ2

Test Statistic:

1) Unequal variances,
2

( )
2 2
𝑠1 𝑠2
𝑋1 − 𝑋2 𝑛1
+𝑛
~ t distribution with 𝑑. 𝑓 =
2
𝑡= 2 2
2 2

𝑠1 𝑠2
⎛𝑠21 ⎞ ⎛𝑠22 ⎞
𝑛1
+ 𝑛2 ⎛ ⎞ ⎛ ⎞
𝑛 𝑛
⎜ ⎝ 1⎠ ⎟+⎜ ⎝ 2⎠ ⎟
𝑛1−1 𝑛2−1
⎜ ⎟ ⎜ ⎟

⎝ ⎠ ⎝ ⎠
This test is also known as Welch’s t-test.

2) Equal variances,

47

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


𝑋1 − 𝑋2
𝑡= ~ 𝑡(𝑛 +𝑛 −1)
1 1
𝑠𝑝 𝑛1
+ 𝑛2
1 2

2 (𝑛1−1)𝑠21+ (𝑛2−1)𝑠22
where𝑠𝑝 = 𝑛1 + 𝑛2−1

n1 and n2 are the sample sizes, 𝑋1 and 𝑋2 are sample means, and s12 and
s22 are sample variances.

Test Criteria:
If | t | > t(α/2) then reject the null hypothesis of no difference (in other
words, equality of means of two independent data). Otherwise, do not
reject H0.

To apply the independent sample t-test in Python, the ttest_ind()


function is used. This function performs a two-sided test for the null
hypothesis that 2 independent samples have identical average (expected)
values. This test assumes that the populations have equal variances by
default.

Additionally, you can also specify the alternative hypothesis using the
‘alternative’ argument.

ttest_ind() also returns the value of the test statistic along with the
p-value.

48

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


t,p = stats.ttest_ind(proUSno['Sales'],
proUSyes['Sales'], equal_var = False)
print("t-statistic for independent t-test on
Sales = %.4f and p-value = %.4f" % (t,p))

t-statistic for independent t-test on Sales =


-3.6956 and p-value = 0.0003

For Sales, since the p-value is less than 0.05, reject the null hypothesis of
no difference and conclude that sales differ significantly inside and
outside the US.

t,p = stats.ttest_ind(proUSno['Price'],
proUSyes['Price'], equal_var = False)
print("t-statistic for independent t-test on
Price = %.4f and p-value = %.4f" % (t,p))

t-statistic for independent t-test on Price =


-1.1164 and p-value = 0.2653

For Price, since the p-value is greater than 0.05, do not reject the
hypothesis and conclude that prices do not differ significantly inside and
outside the US in the given sample.

Task 9: Performing analysis by subsetting using ShelveLoc.

49

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


A similar analysis can be performed for other factor variables like
ShelveLoc and Urban.

Rather than repeating the entire analysis, making side-by-side boxplots


will give you a summary at a glance.

plt.figure(figsize = (10,5))
plt.subplot(1,2,1)
sns.boxplot(x = prodata['ShelveLoc'],y =
prodata['Sales'])

plt.subplot(1,2,2)
sns.boxplot(x = prodata['ShelveLoc'],y =
prodata['Price'])
plt.show()

50

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


Q11. Do Sales seem to differ across the three levels of ShelveLoc?

Ans. Sales seem to be impacted by ShelveLoc. The median of Sales is


highest for Good followed by Medium and Bad respectively.
However, to test for the statistical significance of the difference between
means of Sales across the three levels of ShelveLoc, an F-test has to be
performed.

Task 10: Perform an F-test to check for the homogeneity of Sales


for the three levels of ShelveLoc.

F-test for equality of several means (ANOVA) is used to determine


whether there are any statistically significant differences between the
means of three or more independent (unrelated) groups.
It consists of partitioning the total variation in an experiment into
different sources.

Let yij be the jth observation in the ith class where i = 1,2,...,k and j =
1,2,...,ni. Here, the homogeneity of population means for the k classes is
to be tested.

The hypothesis is
Null hypothesis, H0: μ1 = μ2 = … μk = μ vs.
Alternate hypothesis, H1: μi ≠ μj for at least one i and j

Test Statistic:

51

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


ANOVA Table
Sources of Degrees of Sum of Mean Sum F- Statistic
Variation Freedom Squares of Squares
Factor k-1 SS(Factor) MSF F=
𝑀𝑆𝐹
𝑀𝑆𝐸
Error n-k SS(Error) MSE ~F(k-1, n-k)

Total n-1 TSS

Test Criteria:
Reject H0 at α% level of significance if p-value < α. Otherwise, do not
reject H0.

In Python, the f_oneway() function from the scipy.stats module can be


used to perform one-way ANOVA. It tests the null hypothesis that two or
more groups have the same population mean. The test is applied to
samples from two or more groups.

F,p = stats.f_oneway(
prodata['Sales'][prodata['ShelveLoc']=='Bad'],
prodata['Sales'][prodata['ShelveLoc']=='Good'],
prodata['Sales'][prodata['ShelveLoc']=='Medium']
)

print('F-Statistic = %.4f, p-value = %.4f' % (F,


p))

52

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


F-Statistic = 92.2299, p-value = 0.0000

You can see that the p-value for the F-statistic is almost zero (i.e. p-value
< 0.05). Thus, reject the null hypothesis and conclude that the mean
population Sales are not homogenous across levels of ShelveLoc.

The analysis performed so far has given you an understanding of the


data. You have gained insights into what might be useful while
modeling, say the Price and ShelveLoc variables. Hence, these variables
have been used for further analysis of Sales in the upcoming sections of
the book.

53

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


3. Bivariate Analysis
a. What – As the name suggests, Bivariate (bi – two) analysis is
concerned with the analysis of the relationship of two variables.
b. Why – It is useful in finding associations or differences between
two variables and measuring the significance of such associations
or differences.
c. How – The steps of this analysis have been described in this
section.

3.1. Correlation Analysis


So far, you have performed a univariate analysis of variables. In this
segment, you will gain an understanding of the relationships between
pairs of variables present in the Carseats datasets. This step will help a
lot during the later stages of developing the model.

Task 11: Visualise the correlation among a few pairs of the


variables in the dataset.

Make the scatter plots of all the quantitative variables in the Carseats
dataset using the scatter_matrix() function from the plotting module of
Pandas.

54

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


You can also use the pairplot() function from the seaborn library to
make the same scatter plots.

from pandas.plotting import scatter_matrix


scatter_matrix(prodata[["Sales","CompPrice","Inc
ome","Advertising","Price","Population","Age"]],
figsize=(10,10))
plt.show()

55

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


In the above plot, a negative correlation between Sales and Price can be
observed. Whereas, Sales and Advertising appear to be slightly
positively correlated. You will explore these associations further
individually.

i) Sales and Price

The correlation plot above suggested a negative relationship between


Sales and Price. Thus, to observe this clearly a scatter plot is made
between the two variables of interest.

A scatter plot uses dots to represent values for two different numeric
variables. The position of each dot on the horizontal and vertical axis
indicates values for an individual data point. Scatter plots are used to
observe relationships between variables.

A scatter plot along with a regression line can be plotted by using the
regplot() function which is defined in the seaborn library. As an
argument, you have to pass the x and y variables along with the optional
data argument.

sns.regplot(y='Sales', x='Price', data =


prodata)
plt.show()

56

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


Q12. Describe the relationship in words.

Ans. The relationship between Price and Sales is negative. That is, at a
higher price, sales tend to be lower. The line of best fit also has a
negative slope in the above plot.

ii) Sales and Advertising

A scatter plot between Sales and Advertising budgets can be plotted


similarly.

sns.regplot(y='Sales', x='Advertising', data =


prodata)
plt.show()

57

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


Q13. Describe the relationship in words.

Ans. There is a mildly positive association between Advertising and


Sales. The scattered points suggest that the linear relationship is not
strong.

Task 12: Perform bivariate correlation analysis for the pairs for
which the visualization was done.

Pearson’s Correlation Coefficient (or Pearson Product moment


correlation) is a measure of the strength of a linear association between
two variables.

58

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


It ranges from -1 to 1. A “0” means there is no relationship between the
variables at all, while -1 or 1 means that there is a perfect negative or
positive correlation

t-test for testing the significance of population correlation


coefficient

A hypothesis test of the "significance of the correlation coefficient" is


performed to decide whether the linear relationship in the sample data is
strong enough to use to make inferences about the relationship in the
population.

59

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


The sample data is used to compute the sample correlation coefficient
(r). Whereas, ρ denotes the unknown population correlation coefficient.

Test Hypothesis:

Null Hypothesis, H0: ρ = 0 vs.


Alternate Hypothesis, H1: ρ ≠ 0

Test Statistic:
𝑟 𝑛−2
𝑡= 2
~ t(n-2)
1−𝑟
where n is the number of observations in each variable.

Test Criteria:

If p-value < α then, reject the null hypothesis at α% level of significance


for the given data and conclude that the population correlation
coefficient is statistically significant. Otherwise, we fail to reject the null
hypothesis.

In Python, the corr() function from the Pandas library calculates the
correlation coefficient. You can compute different types of correlation
coefficients by specifying the “method” argument of the corr() function.
By default, it calculates the Pearson correlation coefficient.

To conduct the t-test for significance of the correlation, the pearsonr()


function from the scipy library is used. The observations corresponding
to the two datasets are passed as two Pandas Series arguments.

60

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


i) Sales and Price

print(prodata[['Sales','Price']].corr().round(5)
)

Sales Price
Sales 1.00000 -0.44495
Price -0.44495 1.00000

As shown above by the scatter plot between Sales and Price, the
correlation coefficient also suggests a mildly negative correlation. As a
next step, to test the significance of the population correlation
coefficient, the pearsonr() function has been applied to Sales and Price.

c,p = stats.pearsonr(prodata['Sales'],
prodata['Price'])
print('p-value = %f' % p)

p-value = 0.000000

Q14. Analyze the result of the correlation test.

Ans. The correlation is certainly different from zero since the p-value is
almost zero. Thus, there is a significant correlation between Sales and
Price.

61

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


ii) Sales and Advertising

c,p = stats.pearsonr(prodata['Sales'],
prodata['Advertising'])
print('correlation coefficient between Sales and
Advertising = %f \n p-value = %f' % (c,p))

correlation coefficient between Sales and


Advertising = 0.269507
p-value = 0.000000

In this case, the p-value below 0.05 supports the observation about some
positive correlation between Advertising and Sales.

62

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


3.2. Bivariate Linear Regression
Analysis
Regression Analysis is the mathematical measure of the underlying
relationship between two or more variables.

When you study a variable (dependent) in terms of another variable


(independent) through a linear relationship between them, it is called
Bivariate Linear Regression Analysis.
If two or more independent variables are used then it is called Multiple
Linear Regression Analysis.

In non-deterministic models (where randomness is involved), regression


is always an approximation. Hence, the presence of errors is
unavoidable. However, by minimizing the sum of squares of errors,
you find the best equation that can explain the dependent variable in
terms of the independent variables by obtaining the closest estimates to
intercept and coefficients of the independent variables.

Coefficient of determination (R2) measures the proportion of variation


in the dependent variable that can be predicted from the set of
independent variables in a regression equation.  It varies from 0 to 1. A
value closer to 1 indicates that the variability in the dependent variable
can be explained well by the independent variables whereas a value

63

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


closer to 0 implies that most of the variance results from chance causes
or the absence of some other explanatory variable in the model.

2 𝑆𝑆𝐸
𝑅 =1− 𝑇𝑆𝑆

where SSE (Sum of Squares due to Error) is the variability left


unexplained by the model and TSS (Total Sum of Squares) is the total
variability in the independent variable.

𝑛 𝑛
2 ^ 2
𝑇𝑆𝑆 = ∑ (𝑌𝑖 − 𝑌) 𝑎𝑛𝑑 𝑆𝑆𝐸 = ∑ (𝑌𝑖 − 𝑌𝑖)
𝑖=1 𝑖=1

However, the addition of an independent variable almost always leads to


an increase in the value of R2, that is, R2 is a non-decreasing function of
the number of regressors. Therefore, adjusted R2 is used which takes
into account the number of regressors in the model. Both SSE and TSS
are divided by their degrees of freedom. Adjusted R2 is a more reliable
measure of goodness of fit of a linear regression model.

𝑆𝑆𝐸
2 2 (𝑛−(𝑘+1))
𝑎𝑑𝑗𝑢𝑠𝑡𝑒𝑑 𝑅 = 𝑅 = 1 − 𝑇𝑆𝑆
(𝑛−1)

where k is the number of explanatory variables in the model.

Note: SSE has n-(k+1) degrees of freedom since 1 intercept and k slope
coefficients have been estimated in the model.

64

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


The steps of Bivariate Linear Regression Analysis have been explained
by regressing Sales on Price.
In Python, there are two major ways to fit a linear regression model:

1) By using statsmodel library - The OLS() function in Statsmodels


library can be used to fit linear regression models. The advantage
of using this library is that the summary() method can be used to
display the details of the fitted model.
However, a small caveat with this library is that you will have to
explicitly specify the constant term in your model by adding a
column of 1’s in your X matrix (design matrix). A function,
add_constant() can save the effort of adding the column of 1’s
manually.
2) By using scikit-learn library - The LinearRegression() function
from the linear_model module can be used to fit linear regression
models. It has an additional parameter ‘fit_intercept’ which is True
by default. Thus, it is easier to fit models using this library.
But it does not have a summary method to display the details of a
model.

Due to the advantage of the summary() method, this book uses the
statsmodels library over sklearn for fitting.

Task 13: Fitting a bivariate linear regression model

65

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


For fitting a bivariate linear regression model with Sales as the
dependent variable and Price as the independent variable, the OLS()
function has been used. The steps are as follows:
1. Firstly, you create X and Y variables and then add a constant term.
2. Then, you use the OLS() function and specify the dependent
variable and the independent variables.
3. Lastly, you use the fit() method to fit a linear regression model.

On applying the summary() method on the model object, a table of


coefficients along with R2, adjusted R2, and AIC values is returned.
The table of coefficients also includes the p-values for the t-test of the
significance of individual regression coefficients. This table will help
you in identifying an insignificant variable (if any).

X = prodata['Price']
Y = prodata['Sales']
X = sm.add_constant(X)

model = sm.OLS(Y, X).fit()

print(model.summary())

66

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


Note: The table will look a bit different with the print command.

Q15. What is the best predictive equation for Sales given Price?
What is its interpretation?

Ans. Sales = 13.6419 - 0.0531*Price


For a unit increase in Price, Sales are expected to decrease by
approximately 53 units.

Note: Sales in Carseats dataset is given in thousands.

67

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


Q16. How much variability in Sales is explained by Price?

Ans. The adjusted R2 of the above model is 0.196. Hence, 19.6% of the
variability in Sales is explained by Price.

68

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


3.3. Regression Diagnostics
In the overview, the assumptions of a linear model have been stated.
Thus, the model obtained is the best fit to the data subject to the
condition that these assumptions hold. Therefore, certain techniques to
check for these assumptions have been elucidated below.

Task 14: Plot the actual values along with the fitted regression
line

A perfect model should predict the values of the dependent variable


without any error. However, the fitted model minimizes the errors
instead of a perfect model.

Additionally, according to one of the assumptions, the expected value of


the error terms should be zero. In such a scenario, the visualization of the
original values with the fitted regression line is expected to have an
almost equal scatter of values on both sides of the line throughout its
range.

Simply having a linear fit is not sufficient. This assumption must be


verified. Otherwise, the results can be misleading.

For example, consider the famous Anscombe's quartet. It contains four


data sets that have nearly identical simple descriptive statistics, yet have
very different distributions and appear very different when visualized.
However, the same linear fit is obtained in all four cases.

69

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


You can see that only the first graph satisfies the assumption.

For the second plot, a quadratic fit seems to be more appropriate than a
linear fit whereas in the third and fourth plot, due to the influence of a
single point, a much better model could not be fitted to the datasets.

Hence, verification of the model assumptions is very important.

70

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


To verify the assumption of linearity between the dependent variable (y)
and independent variable (x), a scatter plot along with the fitted line is
plotted. You can also add line segments from each data point to the line
of best fit.

In Python, a combination of the regplot() function from seaborn and the


plot() function from matplotlib library will create the desired plot.

plot([x0,x1],[y0,y1]) function draws a line between pairs of points whose


coordinates are given by (x0,y0) and (x1,y1). Hence, you will use it to plot
^
lines between pairs of points (Price, Sales) and (Price,𝑆𝑎𝑙𝑒𝑠).

The fitted values of Sales can be calculated using the predict() function.

The Python code is as shown below:

fig1,ax1 = plt.subplots()
sns.regplot('Price','Sales', data = prodata, ax
= ax1, ci = False, marker = '.')
ax1.plot([prodata['Price'],prodata['Price']],[pr
odata['Sales'],model.predict()], linestyle='--',
color = 'black')
plt.show()

71

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


Q17. What should be the ideal relationship between the predicted
and the observed values?

Ans. They should be identical, that is, they should fall on the 1:1 line. Of
course, they are not equal because of the presence of errors in model
fitting. In any case, they should be symmetric about a 1:1 line (i.e. the
length of the residual segments should be approximately equal above and
below the line) throughout the range.

Task 15: Plot Residuals vs. Predicted with 3,2,1 sigma limits (to
test the presence of heteroscedasticity).

Our assumption regarding the constant variance of error terms


(homoscedasticity) may not hold in the original data.

72

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


To check for this, a plot of residuals against the predicted values is made.

If the scatter is sufficiently random, that is, no pattern can be identified in


the plot then we can say that heteroscedasticity is not significant.

Consider the examples shown above. Apart from the first one, all the
other plots have some pattern and hence show signs of the presence of
heteroscedasticity or non-constant variance of the error terms.
Therefore, a model showing these kinds of outputs on the Residuals vs.
Fitted values plot violates the assumption of homoscedasticity.

A plot of residuals vs fitted lines is created using the scatterplot()


function of the seaborn library. A for loop is then used to mark ±3, ±2,

73

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


±1 sigma lines on the obtained plot using the axhline() function of the
pyplot module of matplotlib library.

Y_pred = model.predict(X)
residuals = Y - Y_pred
sd_red=np.std(residuals)
a=[-3,-2,-1,0,1,2,3]
b=['r','g','b','k','b','g','r']
sns.scatterplot(Y_pred, residuals)
plt.xlabel('Predicted Sales')
plt.ylabel('Residuals')
for i,j in zip(a,b):
plt.axhline(i*sd_red,color=j)

std() function from the numpy library computes the standard deviation of
the array of residuals. To compute the residuals, subtract the fitted values
of Sales from the actual Sales.
The color argument of axhline specifies the color of the plotted line.

74

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


Q18. What should be the ideal plot of residuals vs fitted values? Do
we see this expected relation? What can you say about the presence
of heteroscedasticity?

Ans. All the residuals should ideally fall on the horizontal line with a
y-intercept as 0. However, they don’t fall on that line due to the presence
of error. But in any case, they should be symmetric about this line
throughout the range and have the same degree of spread.
There is no visible pattern in the plot and the spread seems to be random.
This confirms that there is no heteroscedasticity in the data.

Note: The points on the upper and lower half of the plot are almost
equal. You can also verify it yourself.

Task 16: Visualise the distribution of residuals (To test for the
normality of error terms).

The assumption of the normality of the error terms should reflect in their
visualization. Therefore, a histogram or a stem-and-leaf plot of residuals
should show bars that can be approximated to a bell-shaped curve.

i) Stem and Leaf plot

stemgraphic.stem_graphic(residuals, scale=1)
plt.show()

75

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


ii) Histogram

plt.hist(residuals, color = 'white', edgecolor =


'black') #by default bin is set to 10
plt.xlabel('Residuals')
plt.ylabel('Count')
plt.title('Histogram of residuals')
plt.show()

76

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


Note: Even the distplot() function from the seaborn library can be used
to plot histogram along with the fitted density curve.
sns.distplot(residuals)

Q19. What can be said about the distribution of residuals?

Ans. Based on the above-plotted histogram, the residuals are symmetric


about zero and seem to be normally distributed.

Task 17: Create a Q-Q Plot of the residuals

Along with the histogram, the QQ Plot of the residuals should show most
of the points lying on or very close to the QQ line throughout the range.

77

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


stats.probplot(residuals, dist="norm", plot=plt,
fit = True)
plt.title('Normal Q-Q plot of Residuals')
plt.show()

Q20. Do the residuals follow the normal distribution? Where are the
discrepancies, if any?

Ans. For most of the part, the residuals follow the theoretical normal
distribution well: they are very close to the normal line. However, there
is a slight deviation towards the left end. This deviation does not seem
significant for the analysis and can be ignored.

78

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


3.4. RMSE
Root Mean Squared Error (RMSE) is the standard deviation of the
residuals. RMSE is a measure of how spread out the residuals are. In
other words, it tells us how concentrated the data is around the line of
best fit.

RMSE is always non-negative, and a value of 0 (rarely achieved in


practice) would indicate a perfect fit for the data. In general, a lower
RMSE is better than a higher one.

However, comparisons across different models using RMSE would be


invalid because the measure is dependent on the scale of the numbers
used. It is computed by the following formula.
𝑛
2
∑ 𝑢𝑖
𝑖=1
𝑅𝑀𝑆𝐸 = 𝑛
^
where 𝑢𝑖 = (𝑦𝑖 − 𝑦𝑖) (i = 1,2,...,n) denotes the ith residual and n is the
number of observations.

Task 18: Find the values of RMSE and coefficient of


determination for both models.

To compute the RMSE of the Sales on Price, use the


mean_sqaured_error() function from the metrics module of scikit learn.

79

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


The parameters are y_true and y_pred which takes the true and predicted
values of the dependent variable respectively. Additionally, there is a
parameter ‘squared’ which returns MSE value if True and RMSE value if
False.

rmse = sk.mean_squared_error(Y, Y_pred, squared


= False)
print('RMSE' + ' = ' + str(rmse.round(5)))

RMSE = 2.52599

You would have noticed before that the summary() function prints the
coefficient of determination (R2) value among other things. Another
method of extracting it is by using the r2_score() function from the
metrics module of scikit learn. This function takes two parameters y_true
and y_pred.

r2 = sk.r2_score(Y, Y_pred)
print('R2' + ' = ' + str(r2.round(5)))

R2 = 0.19798

Q21. What is the difference between RMSE and R2 values?

80

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


Ans. Any two models cannot be compared using RMSE as it depends
upon the units of the dependent variable (y). Hence, a lower RMSE need
not necessarily imply that one model is better than the other.

However, R2 can overcome this drawback of RMSE and can be used to


compare different models having the same number of independent
variables.

81

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


3.5. Prediction of the response
variable
The whole idea behind modeling is to be able to predict the unknown
from the known with a particular margin of error.

In this section, the previously created model has been used to predict the
dependent variable using two different methods.

Task 19: Predict the values for Sales for the mean value of Price.

a) Method of Regression Equation

Simply substitute the value of the independent variable(s) in the


regression equation obtained before to find the value of the response
variable. Though this works well for a single value, but you would need
to apply a loop if you have many values to predict.

mean_p = np.mean(X['Price'])
print(mean_p)

115.795

82

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


Predicted Sales when Price = 115.795

The estimated coefficients can be extracted by applying the params


attribute on the model object. To find the coefficients of the model of
Sales on Price, use the following command.

mp = model.params
print(mp)

const 13.641915
Price -0.053073
dtype: float64

Now once you know the coefficients, you can simply replace them in the
model equation to find the fitted value of the dependent variable.

predsales1 = mp['const'] + mean_p*mp['Price']


print(predsales1.round(5))

7.49633

b) Using the predict function

The predict() function is a more convenient way of finding the predicted


values.
The illustration of the Python code is given below.

83

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


Predicted Sales when Price = 115.795

p = [1, 115.795]
predsales2 = model.predict(p)
print(predsales2.round(5))

[7.49633]

You will have to add a constant term manually while using the predict
function with a model fitted from statsmodels library.
The output returns the fitted value of the dependent variable.

You can observe that the fitted value comes out to be the same from both
methods.

84

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


3.6. Confidence and Prediction
Interval
A confidence interval is a range of values associated with a
population parameter for a given confidence level. For example, a
confidence interval having a 95% confidence level is a range of values
that you can be 95% certain contains the true mean of the population.

Let X1, X2, …, Xn be a random sample from N(μ,σ2) where σ2is


unknown. Let 𝑋 denote the sample mean and s be the standard error.

Then,

(
𝑃 −𝑡
( )
α
2
<
𝑋−μ
𝑠
𝑛
<𝑡
( )
α
2
) =1−α

Hence, 100(1-α)% CI for population mean (μ) is given by

(
𝑋−𝑡
( )
α
2
𝑠
𝑛
,𝑋 + 𝑡
( ) α
2
𝑠
𝑛 )

Task 20: Make a prediction DataFrame for the unique values of


Price and then compute Confidence and Prediction Intervals.

85

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


A prediction interval is a range of values that is likely to contain the
value of a single new observation given specified settings of
the predictors. For example, for a 95% prediction interval of (5,10), you
can be 95% confident that the next new observation will fall within this
range.

A prediction interval is always wider than the corresponding confidence


interval since it takes into account the uncertainty associated with a
future observation.

A combination of summary_frame() method and get_prediction()


method can be used to return a DataFrame containing the confidence and
prediction intervals.

result = model.get_prediction().summary_frame()
result.head()

86

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


The mean_ci_upper and mean_ci_lower columns of the result
DataFrame are the upper and lower confidence intervals respectively.
Whereas, obs_ci_upper and nd obs_ci_lower columns are the upper and
lower prediction intervals.

Task 21: Graphical Visualisation of Confidence and Prediction


Intervals

In this section, you will find out that the prediction interval is wider than
the confidence interval by visualization. For doing so, plot the lines of
the confidence interval and the prediction interval along with the original
data points to compare.

A simple regplot() command along with the plot() command will help
you in creating this graph as shown below.

fig1,ax1 = plt.subplots()
sns.regplot('Price','Sales', data=prodata,
ax=ax1, ci=False, marker='.', color='black')

ax1.plot(prodata['Price'],result['mean_ci_lower'
], color='red')
ax1.plot(prodata['Price'],result['mean_ci_upper'
], color='red')

ax1.plot(prodata['Price'],result['obs_ci_lower']
, color='b')

87

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


ax1.plot(prodata['Price'],result['obs_ci_upper']
, color='b')
plt.show()

You can see that the prediction interval (blue line) is much wider than the
confidence interval (red line).

88

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


4. Multivariate Correlation and
Regression
a. What – As the name suggests, Multivariate (multi – more than
two) analysis is concerned with the analysis of the relationship of
more than two variables.
b. Why – It is useful in finding an association between one response
and more than one explanatory variable
c. How – The steps of this analysis have been described in this
section.

To fit a multiple linear regression model, you have to figure out the
possible explanatory variables that can impact our response variable. To
do so, again start by analyzing the correlations between the variables of
the Carseats dataset.

89

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


4.1. Multivariate Correlation Analysis
Task 22: To calculate Correlation between all the variables.

To find the pairwise simple correlations, you can use the corr() function
from the Pandas library.
It returns a matrix of Pearson’s correlation coefficient between the
variables.

prodata.corr()

You can observe that the diagonal values of the matrix are 1. This is
because the correlation of a variable with itself is 1. The off-diagonal
entries range between -1 and +1.

90

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


Another great tool to study correlations is by using a heatmap. A
heatmap can be created by using the heatmap() function from the
seaborn library as depicted below.

sns.heatmap(prodata.corr(), annot = True)


plt.show()

‘annot’ = True marks (annotates) the correlation values on the heatmap.

Q25. Describe the output in words.

Ans.
● The sales variable has a moderate negative correlation with Price, a
low positive correlation with Advertising and Income, and a very
low positive correlation with CompPrice.

91

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


● Price has a very low positive and negative correlation with
Advertising and Income respectively. It also shows a moderately
positive correlation with CompPrice.
● Advertising has a very low positive and negative correlation with
Income and CompPrice respectively.
● Income has a very low negative correlation with CompPrice.

92

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


4.2 Multiple Regression Analysis
Congratulations on making it this far!

This is the most important step of the analysis. Almost always you will
work on more than two variables for modeling. It is crucial to understand
the steps that follow.

Task 23: Null, simple, and multiple regression models.

A null model contains no predictors and only an intercept term.


A simple regression model contains one independent variable while a
multiple regression model contains more than one independent
variable.

In this section, you will compare these models to observe if there is any
improvement in terms of their capacity to explain the dependent variable.

To do so, three methods have been used, namely, comparison of adjusted


R2, comparison of AIC values, and ANOVA.

Begin by fitting the required regression models and then proceed to find
more information about each of them using the summary() function.

93

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


Sales as the dependent variable

a) Null Model

A null model has no predictors. It just contains one intercept where the


intercept is the mean of Y. To fit a null model, pass an array of ones as
the independent variable.

X1 = np.ones(400)
null = sm.OLS(Y, X1).fit()
null.summary()

94

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


Note: Null model does not return any R2 or adjusted R2 value as it has no
independent variable in the model. Hence, the total sum of squares is
equal to the error sum of squares.
2 𝑆𝑆𝐸
Thus, 𝑅 = 1 − 𝑇𝑆𝑆
=0

b) Simple Regression Model (Sales ~ Price)

You have already fitted a simple linear regression model for Sales on
price during the previous section of the book. Hence, you can use the
summary() function on the previously created model.

model.summary()

95

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


Simple Regression Model (Sales ~ Advertising)

You also must have noticed that there exists some correlation between
Sales and Advertising as well. Hence, to compare, you can fit another
simple linear regression model with Sales as the dependent variable and
Advertising as the independent variable.

X2 = prodata[['Advertising']]
X2 = sm.add_constant(X2)

modelS_Ad = sm.OLS(Y, X2).fit()


modelS_Ad.summary()

96

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


c) Multiple Regression Model

Now fit a Multiple Linear Regression Model using both Price and
Advertising as the independent variables. It seems logical that the
advertising budget can influence sales.

X3 = prodata[['Price', 'Advertising']]
X3 = sm.add_constant(X3)

model_mult = sm.OLS(Y, X3).fit()


model_mult.summary()

97

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


You can observe in the table of parameter estimates that both Price and
Advertising have p-values < 0.05. This suggests that both the variables
are significant for the regression.

Q26. How much of the total variability of the predictand is explained


by each of the models? Give the three predictive equations, rounded
to two decimals.

Ans. a) Based on the R2 values of the models, conclude the following


● 19.8% variability of Sales has been explained by Price
● 7.26% variability of Sales has been explained by the Advertising
budget
● 28.19% variability of Sales is explained by a multiple regression
model of Price and Advertising

b) The predictive equations of the fitted models are given below:

Null: Sales = 7.496


A null model always predicts the sample mean of the response variable.

Simple: Sales = 13.642 - 0.059*Price


Simple: Sales = 6.737 + 0.114*Advertising
Multiple: Sales = 13.003 - 0.054*Price + 0.123*Advertising

98

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


4.3 Comparing Models using
2
adjusted R , AIC, and ANOVA
1) Comparing regression models with the adjusted R2

You have already learnt the concept of adjusted R2 in the previous


sections of this book. You will now use it to compare the fitted models.

To extract the adjusted R2 value, use the rsquared_adj method from the
statsmodels library.

print(null.rsquared_adj.round(5))
print(model.rsquared_adj.round(5))
print(modelS_Ad.rsquared_adj.round(5))
print(model_mult.rsquared_adj.round(5))

0.0
0.19597
0.0703
0.27824

The multiple linear regression model of Sales on Price and Advertising


has a higher adjusted R2 as compared to the other simple models. This
suggests an improvement in the model fit.

99

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


Q27. What is the effect of adding a variable to the simple model of
Sales on Price?

Ans. Adding Advertising variable to a linear regression of Sales~Price


increased adjusted R2 from 0.196 to 0.278.

2) Comparing Regression Models with AIC

AIC is another method to compare the goodness of fit of different


models. Akaike information criterion (AIC) is a technique based on
in-sample fit to estimate the likelihood of a model to predict/estimate the
future values. A good model is the one that has minimum AIC among all
the other models. A lower AIC value indicates a better fit.

Let k be the number of estimated parameters in the model. Let be the


maximum value of the likelihood function for the model. Then the AIC
value of the model is given by
^
𝐴𝐼𝐶 = 2𝑘 − 2𝑙𝑛⁡(𝐿)

The formula for AIC is a linear function of the number of independent


variables used in the model. Thus, it penalizes every additional
independent variable by increasing the value of AIC linearly.

To apply AIC in practice, start with a set of candidate models and then
find the models’ corresponding AIC values. Select, from among the
candidate models, the model that minimizes the information loss.

100

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


aic method from the stats package can be used to extract the AIC of the
models.

For the four candidate models, you can find the AIC as shown below.

print(null.aic.round(5))
print(model.aic.round(5))
print(modelS_Ad.aic.round(5))
print(model_mult.aic.round(5))

1966.70562
1880.45635
1938.54287
1838.27177

Q28. Which model is the best according to AIC?

Ans. Observe that the multiple linear regression model of Sales on both
Price and Advertising has the lowest AIC among the set of candidate
models and hence it is the best.

3) Comparing regression models with ANOVA

ANOVA is used to compare two models by using an F-test.


To perform it in Python, you have to pass the models to be compared as
two separate arguments.

101

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


Before performing the ANOVA test, you need to understand the theory
associated with this test.

ANOVA to compare two regression models

Let us say that two models are to be compared, a full model and a
reduced model. The full model might have more variables than the
reduced model.

For example, in multiple regression:

Reduced model y = β0 + β1x1 + · · · + βkxk + ε and


Full model y = β0 + β1x1 + · · · + βkxk + βk+1xk+1 + · · · + βpxp + ε

The test hypothesis is that the full model adds explanatory value over the
reduced model i.e. full model is better than the reduced model.

Mathematically, the hypothesis is


H0: βk+1 = · · · = βp = 0 vs.
H1: βi≠ 0 for at least one i (i = (k+1) to p)

Test Statistic:
(𝑆𝑆𝐸𝑟𝑒𝑑𝑢𝑐𝑒𝑑 − 𝑆𝑆𝐸𝑓𝑢𝑙𝑙)
(𝑝−𝑘)
𝐹= (𝑆𝑆𝐸𝑓𝑢𝑙𝑙) ~F(p-k, n-p-1)
(𝑛−𝑝−1)

Test Criteria:
Reject H0 at α% level of significance if p-value < α. Otherwise, do not
reject H0.

102

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


To perform the F-test to compare two models, use anova_lm() function.
This function is defined in the anova submodule of the stats module of
the statsmodels library.

Comparing (Sales on Price and Advertising) and (Sales on Price)

from statsmodels.stats.anova import anova_lm


anova1 = anova_lm(model, model_mult)
anova1

The p-value < 0.05 for the ANOVA test. Hence, we reject the null
hypothesis and conclude that the multiple regression model of Sales is
better than the simple model of Sales on Price.

print('Reduction in RSS = ' +


str((anova1['ssr'][0] -
anova1['ssr'][1])/anova1['ssr'][0]))

Reduction in RSS = 0.10457915441004215

103

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


Q29. How does the RSS of the multiple regression model compare
with the simple model of Sales on Price?

Ans. The simple model has one higher degree of freedom as it has one
less independent variable in the model. A 10.5% reduction is seen in the
Residual Sum of Squares as compared to the simple model.

Comparing (Sales on Price and Advertising) and (Sales on


Advertising)

anova2= anova_lm(modelS_Ad, model_mult)


anova2

The p-value < 0.05 for the ANOVA test. Hence, reject the null
hypothesis and conclude that the multiple regression model of Sales is
better than the simple model of Sales on Advertising.

print('Reduction in RSS = ' +


str((anova2['ssr'][0] -
anova2['ssr'][1])/anova2['ssr'][0]))

104

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


Reduction in RSS = 0.22560852645992305

You can observe a 22.56% reduction in Residual Sum of Squares as


compared to a simple model of Sales ~ Advertising.

105

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


4.4. Regression Diagnostics
During the previous analysis, the multiple regression model proved to be
the best for the given dataset. Now, in this segment, an attempt has been
made to check if the assumptions made are correct or not.

To plot the diagnostics of a multiple regression model, 3 plots have been


made.
1) To check for the normality of residuals, you can make a Q-Q plot
of the residuals.
2) To assess the fit of the model, plot fitted vs. observed values of the
response variable and compare it with a line having intercept 0 and
slope 1.
3) To check for the assumption of homoscedasticity (constant
variance), plot residuals vs. fitted values and check if the scatter is
random or not.

Task 24: Analyse the regression diagnostics for the best model.

For Sales on Price and Advertising

1) Q-Q plot of residuals

106

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


As depicted in a previous task, you can create a Q-Q Plot to check for
the normality of residuals. If the points lie close to the Q-Q Line, then it
is reasonable to assume that the data is normally distributed.

pred_mult = model_mult.predict(X3)
resid_mult = Y - pred_mult

stats.probplot(resid_mult, dist = "norm",


plot=plt)
plt.title('Q-Q plot of residuals')
plt.show()

It seems safe to assume that the residuals are normally distributed as the
points are lying close to the Q-Q Line.

2) Fitted vs Actual Sales

107

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


A scatter plot of Fitted vs. Actual values along with a line having slope 1
can be used to assess the goodness of fit of a model. A good fit will
imply that the points lie close to the reference line.

sns.scatterplot(pred_mult, Y, color = 'black')


plt.xlabel('Fitted Sales')
plt.ylabel('Sales')
plt.plot(Y, Y)
plt.title('Observed vs Fitted')
plt.show()

Points seem to have a lot of deviation from the reference line. Thus, the
model is not performing up to the mark.

3) Residuals vs Fitted Sales

108

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


As discussed previously, this plot is made to check for the presence of
heteroscedasticity. As long as the points seem to scatter randomly, it is
safe to assume that the errors are homoscedastic.

sns.scatterplot(pred_mult, resid_mult,
color='black')
plt.xlabel('Fitted Values')
plt.ylabel('Residuals')
plt.axhline(0)
plt.title('Residuals vs Fitted')
plt.show()

The points appear to be scattered randomly. You can see that there is no
visible pattern in the plot.

Q30. Are the residuals normally distributed? What can you


conclude from these diagnostic plots?

109

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


Ans.
● The distribution seems to be normal. However, a slight deviation
towards the two ends can be ignored.
● On looking at the Observed vs. Fitted curve, the scatter is too much
to give a good enough despite an upward sloping line being fit.
● There is no pattern in the residuals vs. fitted values plot. Hence,
you can conclude that the errors are homoscedastic.

110

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


4.5. Stepwise Regression
During the last segment, you would have noticed that the multiple linear
regression with two independent variables (Price and Advertising) is not
sufficient to explain the dependent variable (Sales).

The dataset, however, has other variables that might be useful in fitting a
linear regression which have not been explored yet.

While dealing with datasets, it is important to find the variables that are
most important in predicting the outcome variable. It is necessary due to
the computational costs as well as the bias-variance trade-off.

There are many approaches to decide which variables should be retained


in the model. One such method of variable selection has been discussed
in this segment.

Task 25: Apply backward stepwise regression on the full model.

In backward stepwise regression, you start with a full model, that is, a
model that uses all independent variables, say k, to predict the dependent
variable. Then at each step, a variable is removed based on some criteria
(here, the algorithm uses AIC). A model is fitted again using the
remaining k-1 independent variables and repeat the procedure until the

111

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


loss of information is prevented by removing none of the remaining
variables.

Applying backward stepwise regression manually can be a tedious task.


Fortunately, in Python, the RFE() function can be used to perform
backward stepwise regression with only a few lines of code. RFE is
defined in the feature_selection module of sklearn library.

However, before applying the backward stepwise regression, you will


have to prepare the data for modelling.

1) Data Preparation

To prepare the data, you will have to convert the categorical variables to
integer variables. This is done through one-hot encoding.

When a categorical variable does not have a natural ordering, then using
integer encoding (labeling the categories as 1,2,3, and so on) is not
sufficient. In fact, using this encoding may result in poor performance or
unexpected results.

In one-hot encoding, a new binary variable is added for each unique


category of the variable. To do this encoding in Python, you can directly
use the get_dummies() function on a dataframe.

data = pd.get_dummies(prodata, drop_first =


True)

112

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


data.head()

Note: The drop_first argument is used to avoid extra variables. For a


variable with n categories, (n-1) new encoded variables can sufficiently
hold all the information.
For example: To represent the 3 categories of the ShelveLoc variable, 2
encoded variables are enough. You can automatically guess the outcome
of the third variable. For the first row, you can see that both
ShelveLoc_Good and ShelveLoc_Medium are 0 suggesting that this
observation has a ‘Bad’ ShelveLoc.

Now, create X and Y for fitting

Y = data['Sales']
X = data.drop('Sales', axis=1)

You can use the drop() function from Pandas to keep all the variables of
‘data’ except the ‘Sales’ variable in X.

2) Performing Backward Stepwise Regression

113

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


RFE stands for Recursive Feature Elimination.

The documentation on sklearn describes RFE as follows:

“The goal of the RFE() function is to select features by recursively


considering smaller and smaller sets of features. First, the estimator is
trained on the initial set of features and the importance of each feature is
obtained either through any specific attribute. Then, the least important
features are pruned from the current set of features. That procedure is
recursively repeated on the pruned set until the desired number of
features to select is eventually reached.”

You will also need the LinearRegression() function for fitting a linear
regression model. This will be used as a parameter within the RFE
function.

The implementation of backward stepwise regression is given below. In


this case, 8 variables have been selected out of the 11 variables available.

from sklearn.feature_selection import RFE


from sklearn.linear_model import
LinearRegression

lm = LinearRegression()
lm.fit(X, Y)
rfe = RFE(lm,n_features_to_select=8,verbose=1)
rfe = rfe.fit(X,Y)

114

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


Fitting estimator with 11 features.
Fitting estimator with 10 features.
Fitting estimator with 9 features.

The RFE function has the support_ and ranking_ attributes.


support_ is a boolean array. It is True if a variable is included in the top 8
and false otherwise.
ranking_ contains the feature ranking, such that ranking_[i] corresponds
to the ranking position of the i-th feature

list(zip(X.columns,rfe.support_,rfe.ranking_))

[('CompPrice', True, 1),


('Income', False, 3),
('Advertising', True, 1),
('Population', False, 4),
('Price', True, 1),
('Age', True, 1),
('Education', False, 2),
('ShelveLoc_Good', True, 1),
('ShelveLoc_Medium', True, 1),
('Urban_Yes', True, 1),
('US_Yes', True, 1)]

115

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


The best model for Sales (as given by the backward stepwise
regression) contains the following variables:

X.columns[rfe.support_]

Index(['CompPrice', 'Advertising', 'Price',


'Age', 'Urban_Yes', 'US_Yes', 'ShelveLoc_Good',
'ShelveLoc_Medium'], dtype='object')

Now, subset the X DataFrame to contain the desired variables only and
use it to fit the linear regression model.

X_final = X[X.columns[rfe.support_]]
lmbest = sm.OLS(Y,
sm.add_constant(X_final)).fit()

Use the summary() function to find the table of parameter estimates, R2,
and adjusted R2.

lmbest.summary()

116

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


The output of the summary() function confirms that US and Urban are
not significant for explaining the regression of Sales. Thus, we remove
them from the set of independent variables.

X_final = X_final.drop(['US_Yes','Urban_Yes'],
axis=1)

117

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


lmbest = sm.OLS(Y,
sm.add_constant(X_final)).fit()
lmbest.summary()

Q31. After performing Recursive Feature Elimination, what is the


best model of Sales? How much variability is explained by this
model?

118

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


Ans. Sales ~ CompPrice + Advertising + Price + Age +
ShelveLoc_Good + ShelveLoc_Medium

84.6% of the variability is explained by the model as given by the


adjusted R2 value

Q32. Are all the explanatory variables in the best model for Sales
significant?

Ans. Since the p-value for the t-test of the significance of all the
regression coefficients is almost zero. We conclude that all the
explanatory variables in the best model are significant.

119

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


4.5.1 Regression Diagnostics of Best
Model
As depicted in the previous sections, again make the three plots to check
for the assumptions of linear regression.

By this time, we expect you to remember these graphs by heart! :P

For Sales ~ CompPrice + Advertising + Price + Age +


ShelveLoc_Good + ShelveLoc_Medium

pred = lmbest.predict()
resid = Y - pred

1) Q-Q plot of residuals

stats.probplot(resid, dist = "norm", plot=plt)


plt.title('Q-Q plot of residuals')
plt.show()

120

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


2) Fitted vs Actual Sales

sns.scatterplot(pred, Y, color = 'black')


plt.xlabel('Fitted Sales')
plt.ylabel('Sales')
plt.plot(Y, Y)
plt.title('Observed vs Fitted')
plt.show()

121

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


Remember the last Observed vs Fitted plot?
The above figure suggests that the best model is indeed performing
better over the model with two independent variables. The points in this
graph lie very close to the reference line validating the better
performance.

3) Residuals vs Fitted Sales

sns.scatterplot(pred, resid, color='black')


plt.xlabel('Fitted Values')
plt.ylabel('Residuals')
plt.axhline(0)
plt.title('Residuals vs Fitted')
plt.show()

122

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


There is no visible pattern in the plot, thus, you can conclude that errors
appear to be homoscedastic.

123

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


4.6 Multicollinearity
Task 26: Diagnosing Multicollinearity

Multicollinearity refers to a situation in which two or more explanatory


(independent) variables in a multiple regression model are highly
linearly related.
The following are the consequences of Multicollinearity:
● It is difficult to obtain estimates of the parameters in a model.
● If obtained, the estimates of regression coefficients have large
standard errors.
● Hence, you almost always get insignificant t-ratios of regression
coefficients.

VIF approach to detect Multicollinearity:

1
VIF or Variance Inflation Factor is defined as 𝑉𝐼𝐹 = 2
1 − 𝑅𝑗

where Rj2 is the multiple correlation coefficient of the jth explanatory


variable on all the remaining explanatory variables.

In general, a VIF of 5 or 10 and above indicates a multicollinearity


problem.

124

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


In python, you can use the variance_inflation_factor() function from
the statsmodels library to check for the presence of multicollinearity.
Import the function from statsmodels.stats.outliers_influence and then
use it as depicted below.

from statsmodels.stats.outliers_influence import


variance_inflation_factor

def calc_vif(X):
vif = pd.DataFrame()
vif["variables"] = X.columns
vif["VIF"] =
[variance_inflation_factor(X.values, i) for i in
range(X.shape[1])]
return(vif)

calc_vif(X_final)

125

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


The variable CompPrice is multicollinear with the remaining variables.
Thus, we drop it from the independent variables and again check for
multicollinearity.

X_final = X_final.drop('CompPrice', axis = 1)


calc_vif(X_final)

Q33. According to the VIF >10 criteria, which variables are highly
correlated with the others?

Ans. Using VIF>10, variables do not appear to be highly collinear. Thus,


you can assume that there is no severe multicollinearity in the data.

126

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


4.6.1 Final Model and Regression
Diagnostics
The final model is given by
Sales ~ Advertising + Price + Age + ShelveLoc_Good +
ShelveLoc_Medium

final_model =
sm.OLS(Y,sm.add_constant(X_final)).fit()
final_model.summary()

127

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


We believe that you can make the diagnostic plots now. Thus, the code
has not been provided for it. However, the obtained plots have been
added for your reference.

128

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


129

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


4.7. Parallel Slopes Model
Task 27: Combine a continuous and a discrete explanatory
variable to predict Sales.

A regression having additive effects of a numerical and categorical


predictor is termed a parallel slopes model. In parallel regression, the
lines of best fit, corresponding to distinct classes of the categorical
variable, have the same slope but different intercepts i.e. there is only
one regression line, which is displaced up or down for each class of the
categorical predictor. Hence, the name parallel slopes.

A simple linear regression formula looks like y ~ x, where y is the name


of the response variable, and x is the name of the explanatory variable.
Here, you can simply extend this formula to include multiple explanatory
variables.

A parallel slopes model has the form y ~ x + z, where z is a categorical


explanatory variable and x is a numerical explanatory variable.

While performing the univariate analysis, it was observed that Sales is


significantly impacted by ShelveLoc. Thus, to illustrate, ShelveLoc has
been used as the categorical explanatory variable in parallel slope
analysis.

130

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


Three models have been fitted for comparison. Two simple models with
Price and ShelveLoc as independent variables and then a multiple
regression model consisting of both Price and ShelveLoc.

a) Simple model (Sales on Price)

The previously fitted model object of Sales on Price has been called with
the summary() method.

model.summary()

b) Simple model (Sales on ShelveLoc)

131

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


A simple model of Sales on ShelveLoc is fitted using the OLS()
function. Use One-hot encoding and add a constant term in the X
DataFrame before fitting the model.

X4 = pd.get_dummies(prodata['ShelveLoc'],
drop_first = True)
modelS_Sh = sm.OLS(Y, sm.add_constant(X4)).fit()
modelS_Sh.summary()

Adjusted R2 for this model is 31.4% which is greater than the model with
Price as the independent variable.

132

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


c) Combined model (Sales on Price and ShelveLoc)

X5 = pd.concat([prodata['Price'],X4],axis=1)
modelS_PSh=sm.OLS(Y, sm.add_constant(X5)).fit()
modelS_PSh.summary()

Q34. Which model has the highest adjusted R2?

133

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


Ans. The combined model has the highest adjusted R2. On adding
Shelveloc to the model of Sales on Price, the adjusted R2 increases from
0.196 to 0.539.

Q35. Is ShelveLoc significant in the combined model?

Ans. Yes. Since the p-value of ShelveLoc is almost zero in the combined
model, it contributes significantly to explain the dependent variable
Sales.

134

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


4.7.1. Diagnostics of Parallel Slopes
Model
Task 28: Visualise the residuals of the combined model and
check for normality.

pred_parallel = modelS_PSh.predict()
resid_parallel = Y - pred_parallel

resid_parallel.hist(color = 'white', edgecolor =


'black', grid = False)
plt.xlabel('Residuals of Parallel slopes model')
plt.ylabel('Count')
plt.title('Histogram of residuals')
plt.show()

135

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


You can also use a histogram to check for the normality of residuals.
However, a lot of times the shape of a histogram changes when the
number of bins is altered.
Thus, a Q-Q Plot is a more reliable method of checking the normality of
residuals.

stats.probplot(resid_parallel, dist="norm",
plot=plt, fit = True)
plt.title('Normal Q-Q plot of residuals')
plt.show()

136

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


Q36. Do the residuals appear to be normally distributed?

Ans. Based on the histogram and Q-Q plot, the residuals appear to have a
normal distribution.

Task 30: Use a discrete variable to plot parallel lines of


regression

In this task, the data points of Sales and Price have been plotted along
with its line of best fit as determined by the parallel slopes model.

To make this plot, use the hue parameter of the scatterplot() function to
color the points based on ShelveLoc.

137

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


fig, ax = plt.subplots()
sns.scatterplot('Price','Sales', hue =
'ShelveLoc', data=prodata)
ax.scatter(prodata['Price'],pred_parallel,
marker = '.', color = 'black')
plt.show()

The regression lines (black dots) are parallel to each other in this case.

Note: Using lmplot() function directly with the hue parameter will not
give you the parallel slopes model.
sns.lmplot('Price', 'Sales', hue = 'ShelveLoc', data=prodata)
The lmplot function will fit 3 different regressions of Sales on Price as
subsetted by the ShelveLoc variable.

138

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


Q37. Do the lines of best fit appear to be different across ShelveLoc?

Ans. Yes, the intercepts of the lines are different though the slope is the
same.

139

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


4.8. Interactions
In regression, an interaction effect exists when the effect of an
independent variable on a dependent variable changes, depending on the
value(s) of one or more other independent variables. In a regression
equation, an interaction effect is represented as the product of two or
more independent variables.

For example, here is a typical regression equation without an interaction:


ŷ = β0 + β1X1 + β2X2
where ŷ is the predicted value of a dependent variable, X1 and X2 are
independent variables, and β0, β1, and β2 are regression coefficients.

And here is the same regression equation with an interaction:


ŷ = β0 + β1X1 + β2X2 + β3X1X2
Here, β3 is a regression coefficient, and X1X2 is the interaction. The
interaction between X1 and X2 is called a two-way interaction because it
is the interaction between two independent variables.

Task 30: Use interactions to model response

from statsmodels.formula.api import ols


prodata1 = pd.get_dummies(prodata, drop_first =
True)
int_model = ols(formula = 'Sales ~ Price +
ShelveLoc_Good+ShelveLoc_Medium+Price:ShelveLoc_

140

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


Good+Price:ShelveLoc_Medium', data =
prodata1).fit()
int_model.summary()

Observe that the p-value corresponding to the coefficient estimates of


interaction terms is greater than 0.05. Hence, you can conclude that the
interaction of Price and ShelveLoc does not play a significant role in
explaining Sales in the model.

141

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


Q38. How much of the variation in Sales is explained by this model?
Is it better than the additive model?

Ans. adjusted R2 = 0.539 which is almost the same as the additive model.

142

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences


Conclusion
Throughout the course of this book, you have learnt some amazing
statistical techniques to analyze a dataset.
The key takeaways are as follows:

1. Univariate analysis to understand the distribution of variables


2. Bivariate analysis to find out the relationship between two
variables.
3. Visualization techniques and their interpretation
4. Fitting linear models and comparing them using adjusted R2, AIC,
and ANOVA
5. Checking for the diagnostics of a linear regression model
6. Parallel Slopes Model and fitting models with interaction terms

This brings you to the end of this book. We hope that you have learnt
something from it. We surely did while making it!

Congratulations on completing this book!

143

Visit www.campussecorporate.com/index for Data Science Interview Questions, Corporate Experiences

You might also like