Book 2.0 - Python
Book 2.0 - Python
Book 2.0 - Python
Why the second edition? Over the past year, while using the book for revising
certain concepts for placement interviews, we identified a few gaps. While those
may have made sense to someone from our batch, they would have caused a few
difficulties for those starting out. Keeping this in mind, we identified and altered
the pain points, making the book more coherent and comprehensive.
Unlike the previous version which was hosted via Google Drive, we are using
Gumroad. This will ensure that the readers will have access to any future updates
without any re-registration.
For any queries or suggestions, feel free to reach out to us via mail or LinkedIn.
Stay safe!
Kanika Tayal
Anubhav Dubey
When we started working on our first project, we were lost and lacked direction.
We reached out to our seniors and searched on the web about ways to do a project,
both of which helped us immensely in getting started. However, we realized that
the transition from knowledge-based mostly in theory to research and application
can be a daunting one if there hasn’t been any prior attempt.
Since epistemology is best left for the branch of philosophy, we, for now, deal only
with the application part.
This book is an attempt to provide a basic framework towards approaching the first
project in Data Analysis. The pages that follow are neither theoretically deep nor
exhaustive in methods; however, they are intended to bridge the theory and
applications, especially the statistical techniques.
Data Science is an ever-growing and ever-evolving field. Hence, any journey into
this wonderful realm can never fully cover all its aspects. However, it is important
to begin because like all journeys, this will be exhilarating, to say the least, and we
believe that this book will be a good starting point.
To great beginnings
Kanika Tayal
Anubhav Dubey
The warm-up section simply talks about the basic concepts that are
required for understanding the analysis. These are short definitions
intended for revision. If you are not aware of these concepts, we
recommend reading these concepts before proceeding further.
From our experience, these basic concepts are very important from
an interview perspective. You can know lots of advanced stuff but
command over basics is a must.
Source: Link
2. Partition Values
These are the values that divide the data into a number of equal parts.
Commonly used partition values are Quartiles, Deciles, and Percentiles.
a) Quartiles – Three points that divide the data into 4 equal parts.
b) Deciles – Nine points that divide the data into 10 equal parts.
c) Percentiles – Ninety-nine points that divide the data into 100 equal
parts.
10
5. Normal Distribution
11
A log-normal (or lognormal) distribution is a continuous
probability distribution of a random variable whose logarithm is
normally distributed. Thus, if the random variable X
is log-normally distributed, then Y = ln(X) has a normal distribution.
12
Degrees of freedom (d.f./ d.o.f.) – These are the number of values that
can independently vary in statistical analysis, that is, the number of
values left after accounting for every constraint involved in an analysis.
i.e., If Z1, ..., Zk are independent, standard normal random variables, then
the sum of their squares,
𝑘
2 2
𝑄 = ∑ 𝑍𝑖 ∼ χ 𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛 𝑤𝑖𝑡ℎ 𝑘 𝑑. 𝑜. 𝑓.
𝑖=1
The ratio of an independent standard normal variate and the square root
of a chi-square variate divided by its degrees of freedom follows
t-distribution.
i.e., If Z has a standard normal, V has a 𝓧 2 distribution with k degrees of
freedom, and Z and V are independent. Then,
𝑍
𝑡 = ∼ 𝑡 𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛 𝑤𝑖𝑡ℎ 𝑘 𝑑. 𝑜. 𝑓.
𝑉/𝑘
13
14
15
16
17
where Y n×1 is the vector of values of the response variable, Xn×(k+1) is the
design matrix, β(k+1)×1 is a vector of the model parameters, εn×1 is a vector
of residuals.
Assumptions:
There are four assumptions associated with a linear regression model.
18
^ ' −1
β = (𝑋 𝑋) 𝑋'𝑌
^ ^ ^
Let β0, β1, …, βk be the estimated parameters. A hypothesis test for the
significance of an independent variable is to be performed. Then,
Test Statistic:
^
β𝑖
𝑡= ^ ~ t(n-(k+1))
𝑠𝑒(β𝑖)
^
where se(βi) is the standard error of the estimated parameter.
19
Confidence Interval:
20
21
22
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy.stats as stats
import seaborn as sns
import statsmodels.api as sm
import sklearn.metrics as sk
After downloading the dataset (download link), you can use the
read_csv() function of Pandas to read a CSV file. Make sure that the file
is in the same directory otherwise you will have to specify the complete
path.
prodata = pd.read_csv("dataset.csv")
23
prodata.head()
prodata.info()
24
Ans. Further analysis is required to know more about the levels of the
factor variables.
25
You can use the include parameter to specifically display the summary of
a character (object) variable.
26
prodata.describe()
Q3. Is there any evidence that the distributions of Sales and Price
are somewhat symmetric?
Ans. As seen above, the mean and median values for these two variables
are almost equivalent, indicating that their distributions are somewhat
symmetric.
27
prodata.describe(include = np.object)
Notice that the variable ShelveLoc has 3 levels. On the other hand, US
and Urban have 2 levels each. To view the categories of each of these
variables you can use the unique() function as shown below.
prodata['ShelveLoc'].unique()
28
The idea is to make a special table where each data value is split into a
"stem" (the first digit or digits) and a "leaf" (usually the last digit). For
example, 15 is split into 1 (stem) and 5 (leaf).
To create stem and leaf plots in Python, you can use the stem_graphic()
function from the stemgraphic package. Additionally, the scale argument
can be used to alter the number of stems in a plot.
import stemgraphic
stemgraphic.stem_graphic(prodata['Sales'],
scale=1)
29
30
stemgraphic.stem_graphic(prodata['Advertising'],
scale=2)
31
The hist() function from Pandas allows you to plot the histogram for a
particular variable of a DataFrame. The aesthetics of the plot can be
edited by color, edgecolor, and grid parameters of the hist() function.
32
prodata.hist(column = 'Advertising', ax =
axes[1,0], color = 'white', edgecolor = 'black',
grid = False)
33
Ans. For each of the given variables, the following observations can be
made:
● Sales: it appears that the distribution is symmetric and leptokurtic.
● Price: the distribution seems to be again symmetric and leptokurtic.
● Advertising: a leptokurtic right-skew in the distribution.
● Income seems to be uniformly distributed.
34
The y-coordinates of a Q-Q plot are the sample quantiles and the
x-coordinates are the theoretical quantiles from the normal distribution.
It is a method to check for normality in a dataset through visualization.
A reference line in a Q-Q plot that passes through the first and third
quartiles of the sample and the theoretical normal sample is also plotted.
This line is usually called the Q-Q Line.
To create multiple plots in a single window, you can also use the
subplot() function as illustrated below. It adds a subplot to a current
figure at the specified grid position. It is similar to the subplots()
function however unlike subplots() it adds one subplot at a time.
35
i) Sales
plt.subplot(2,2,1)
stats.probplot(prodata['Sales'], dist="norm",
plot=plt, fit = True)
plt.title('Normal Q-Q plot of Sales')
plt.subplot(2,2,2)
stats.probplot(np.log(prodata['Sales']),
dist="norm", plot=plt, fit = True)
plt.title('Normal Q-Q plot of log(Sales)')
ii)Price
plt.subplot(2,2,3)
stats.probplot(prodata['Price'], dist="norm",
plot=plt, fit = True)
plt.title('Normal Q-Q plot of Price')
plt.subplot(2,2,4)
stats.probplot(np.log(prodata['Price']),
dist="norm", plot=plt, fit = True)
plt.title('Normal Q-Q plot of log(Price)')
36
37
Ans. As evidenced by the plots above, the scatter plot of theoretical and
sample quantiles fit the reference line better for normal distribution.
Hence, both Sales and Price seem to have a normal distribution.
38
Hypothesis:
Suppose you want to test the hypothesis that a random sample xi (i =
1,2,..,n) has been drawn from a normal population with a specified mean
μ0. Then,
Test Statistic:
(𝑋− µ0)
𝑡= 𝑠 ~ t(n-1)
𝑛
39
Task 5: (i) Apply t-test for the null hypothesis that the true mean
of Sales is 7.
x = prodata['Sales']
np.mean(x)
7.496325
t,p = stats.ttest_1samp(x,7)
print('t-Statistic = %.4f, p-value = %.4f' % (t,
p))
40
(ll,ul) = stats.t.interval(alpha=0.95,
df=len(x)-1, loc=np.mean(x), scale=stats.sem(x))
print("CI for Sales = (%.4f,%.4f)" % (ll,ul))
Q6. What is the estimated population mean and its 95% confidence
interval?
Ans. The estimated population mean is 7.496 and the 95% CI is (7.219,
7.774).
41
Ans. Since the p-value is less than 0.05, reject the null hypothesis and
conclude that the true Sales mean is not equal to 7.
(ii) Apply t-test for the null hypothesis that the true mean of
Price is 115.
y = prodata['Price']
np.mean(y)
115.795
t,p = stats.ttest_1samp(y,115)
print('t-Statistic = %.4f, p-value = %.4f' % (t,
p))
Can you make a conclusion about the t-test for the true mean of Price
now?
42
To subset the data, use the subsetting operators [,] along with the
conditional equality operator (==).
Now, two new DataFrames, proUSno and proUSyes have been created
for the non-US and US data respectively.
proUSno.describe()
43
44
Before performing the t-test, it is good to have some intuition about the
distribution of the Sales and Price inside and outside the US. Therefore, a
side-by-side boxplot is used to compare the distribution of a continuous
variable as subsetted by a categorical variable.
The boxplot() function from the seaborn library can be used in Python to
create side-by-side boxplots as illustrated in the code below:
plt.figure(figsize = (10,5))
plt.subplot(1,2,1)
sns.boxplot(x = prodata['US'],y =
prodata['Sales'])
plt.subplot(1,2,2)
sns.boxplot(x = prodata['US'],y =
prodata['Price'])
plt.show()
45
Note: You can also use the boxplot function from Pandas to make a
similar boxplot.
prodata.boxplot(['Sales'],by='US')
prodata.boxplot(['Price'],by='US')
Q10. Is there any difference between Price, Sales inside and outside
the US?
46
Two sample t-tests are used to determine if two population means are
equal. Here, conducting a t-test for unpaired data is appropriate.
Let us suppose there are two independent samples X1i(i = 1 to n1) and
X2i(i = 1 to n2) from a normal population.
Test hypothesis:
The null hypothesis, H0: μ1 = μ2 vs.
Alternate hypothesis, H1: μ1 ≠ μ2
Test Statistic:
1) Unequal variances,
2
( )
2 2
𝑠1 𝑠2
𝑋1 − 𝑋2 𝑛1
+𝑛
~ t distribution with 𝑑. 𝑓 =
2
𝑡= 2 2
2 2
𝑠1 𝑠2
⎛𝑠21 ⎞ ⎛𝑠22 ⎞
𝑛1
+ 𝑛2 ⎛ ⎞ ⎛ ⎞
𝑛 𝑛
⎜ ⎝ 1⎠ ⎟+⎜ ⎝ 2⎠ ⎟
𝑛1−1 𝑛2−1
⎜ ⎟ ⎜ ⎟
⎝ ⎠ ⎝ ⎠
This test is also known as Welch’s t-test.
2) Equal variances,
47
2 (𝑛1−1)𝑠21+ (𝑛2−1)𝑠22
where𝑠𝑝 = 𝑛1 + 𝑛2−1
n1 and n2 are the sample sizes, 𝑋1 and 𝑋2 are sample means, and s12 and
s22 are sample variances.
Test Criteria:
If | t | > t(α/2) then reject the null hypothesis of no difference (in other
words, equality of means of two independent data). Otherwise, do not
reject H0.
Additionally, you can also specify the alternative hypothesis using the
‘alternative’ argument.
ttest_ind() also returns the value of the test statistic along with the
p-value.
48
For Sales, since the p-value is less than 0.05, reject the null hypothesis of
no difference and conclude that sales differ significantly inside and
outside the US.
t,p = stats.ttest_ind(proUSno['Price'],
proUSyes['Price'], equal_var = False)
print("t-statistic for independent t-test on
Price = %.4f and p-value = %.4f" % (t,p))
For Price, since the p-value is greater than 0.05, do not reject the
hypothesis and conclude that prices do not differ significantly inside and
outside the US in the given sample.
49
plt.figure(figsize = (10,5))
plt.subplot(1,2,1)
sns.boxplot(x = prodata['ShelveLoc'],y =
prodata['Sales'])
plt.subplot(1,2,2)
sns.boxplot(x = prodata['ShelveLoc'],y =
prodata['Price'])
plt.show()
50
Let yij be the jth observation in the ith class where i = 1,2,...,k and j =
1,2,...,ni. Here, the homogeneity of population means for the k classes is
to be tested.
The hypothesis is
Null hypothesis, H0: μ1 = μ2 = … μk = μ vs.
Alternate hypothesis, H1: μi ≠ μj for at least one i and j
Test Statistic:
51
Test Criteria:
Reject H0 at α% level of significance if p-value < α. Otherwise, do not
reject H0.
F,p = stats.f_oneway(
prodata['Sales'][prodata['ShelveLoc']=='Bad'],
prodata['Sales'][prodata['ShelveLoc']=='Good'],
prodata['Sales'][prodata['ShelveLoc']=='Medium']
)
52
You can see that the p-value for the F-statistic is almost zero (i.e. p-value
< 0.05). Thus, reject the null hypothesis and conclude that the mean
population Sales are not homogenous across levels of ShelveLoc.
53
Make the scatter plots of all the quantitative variables in the Carseats
dataset using the scatter_matrix() function from the plotting module of
Pandas.
54
55
A scatter plot uses dots to represent values for two different numeric
variables. The position of each dot on the horizontal and vertical axis
indicates values for an individual data point. Scatter plots are used to
observe relationships between variables.
A scatter plot along with a regression line can be plotted by using the
regplot() function which is defined in the seaborn library. As an
argument, you have to pass the x and y variables along with the optional
data argument.
56
Ans. The relationship between Price and Sales is negative. That is, at a
higher price, sales tend to be lower. The line of best fit also has a
negative slope in the above plot.
57
Task 12: Perform bivariate correlation analysis for the pairs for
which the visualization was done.
58
59
Test Hypothesis:
Test Statistic:
𝑟 𝑛−2
𝑡= 2
~ t(n-2)
1−𝑟
where n is the number of observations in each variable.
Test Criteria:
In Python, the corr() function from the Pandas library calculates the
correlation coefficient. You can compute different types of correlation
coefficients by specifying the “method” argument of the corr() function.
By default, it calculates the Pearson correlation coefficient.
60
print(prodata[['Sales','Price']].corr().round(5)
)
Sales Price
Sales 1.00000 -0.44495
Price -0.44495 1.00000
As shown above by the scatter plot between Sales and Price, the
correlation coefficient also suggests a mildly negative correlation. As a
next step, to test the significance of the population correlation
coefficient, the pearsonr() function has been applied to Sales and Price.
c,p = stats.pearsonr(prodata['Sales'],
prodata['Price'])
print('p-value = %f' % p)
p-value = 0.000000
Ans. The correlation is certainly different from zero since the p-value is
almost zero. Thus, there is a significant correlation between Sales and
Price.
61
c,p = stats.pearsonr(prodata['Sales'],
prodata['Advertising'])
print('correlation coefficient between Sales and
Advertising = %f \n p-value = %f' % (c,p))
In this case, the p-value below 0.05 supports the observation about some
positive correlation between Advertising and Sales.
62
63
2 𝑆𝑆𝐸
𝑅 =1− 𝑇𝑆𝑆
𝑛 𝑛
2 ^ 2
𝑇𝑆𝑆 = ∑ (𝑌𝑖 − 𝑌) 𝑎𝑛𝑑 𝑆𝑆𝐸 = ∑ (𝑌𝑖 − 𝑌𝑖)
𝑖=1 𝑖=1
𝑆𝑆𝐸
2 2 (𝑛−(𝑘+1))
𝑎𝑑𝑗𝑢𝑠𝑡𝑒𝑑 𝑅 = 𝑅 = 1 − 𝑇𝑆𝑆
(𝑛−1)
Note: SSE has n-(k+1) degrees of freedom since 1 intercept and k slope
coefficients have been estimated in the model.
64
Due to the advantage of the summary() method, this book uses the
statsmodels library over sklearn for fitting.
65
X = prodata['Price']
Y = prodata['Sales']
X = sm.add_constant(X)
print(model.summary())
66
Q15. What is the best predictive equation for Sales given Price?
What is its interpretation?
67
Ans. The adjusted R2 of the above model is 0.196. Hence, 19.6% of the
variability in Sales is explained by Price.
68
Task 14: Plot the actual values along with the fitted regression
line
69
For the second plot, a quadratic fit seems to be more appropriate than a
linear fit whereas in the third and fourth plot, due to the influence of a
single point, a much better model could not be fitted to the datasets.
70
The fitted values of Sales can be calculated using the predict() function.
fig1,ax1 = plt.subplots()
sns.regplot('Price','Sales', data = prodata, ax
= ax1, ci = False, marker = '.')
ax1.plot([prodata['Price'],prodata['Price']],[pr
odata['Sales'],model.predict()], linestyle='--',
color = 'black')
plt.show()
71
Ans. They should be identical, that is, they should fall on the 1:1 line. Of
course, they are not equal because of the presence of errors in model
fitting. In any case, they should be symmetric about a 1:1 line (i.e. the
length of the residual segments should be approximately equal above and
below the line) throughout the range.
Task 15: Plot Residuals vs. Predicted with 3,2,1 sigma limits (to
test the presence of heteroscedasticity).
72
Consider the examples shown above. Apart from the first one, all the
other plots have some pattern and hence show signs of the presence of
heteroscedasticity or non-constant variance of the error terms.
Therefore, a model showing these kinds of outputs on the Residuals vs.
Fitted values plot violates the assumption of homoscedasticity.
73
Y_pred = model.predict(X)
residuals = Y - Y_pred
sd_red=np.std(residuals)
a=[-3,-2,-1,0,1,2,3]
b=['r','g','b','k','b','g','r']
sns.scatterplot(Y_pred, residuals)
plt.xlabel('Predicted Sales')
plt.ylabel('Residuals')
for i,j in zip(a,b):
plt.axhline(i*sd_red,color=j)
std() function from the numpy library computes the standard deviation of
the array of residuals. To compute the residuals, subtract the fitted values
of Sales from the actual Sales.
The color argument of axhline specifies the color of the plotted line.
74
Ans. All the residuals should ideally fall on the horizontal line with a
y-intercept as 0. However, they don’t fall on that line due to the presence
of error. But in any case, they should be symmetric about this line
throughout the range and have the same degree of spread.
There is no visible pattern in the plot and the spread seems to be random.
This confirms that there is no heteroscedasticity in the data.
Note: The points on the upper and lower half of the plot are almost
equal. You can also verify it yourself.
Task 16: Visualise the distribution of residuals (To test for the
normality of error terms).
The assumption of the normality of the error terms should reflect in their
visualization. Therefore, a histogram or a stem-and-leaf plot of residuals
should show bars that can be approximated to a bell-shaped curve.
stemgraphic.stem_graphic(residuals, scale=1)
plt.show()
75
76
Along with the histogram, the QQ Plot of the residuals should show most
of the points lying on or very close to the QQ line throughout the range.
77
Q20. Do the residuals follow the normal distribution? Where are the
discrepancies, if any?
Ans. For most of the part, the residuals follow the theoretical normal
distribution well: they are very close to the normal line. However, there
is a slight deviation towards the left end. This deviation does not seem
significant for the analysis and can be ignored.
78
79
RMSE = 2.52599
You would have noticed before that the summary() function prints the
coefficient of determination (R2) value among other things. Another
method of extracting it is by using the r2_score() function from the
metrics module of scikit learn. This function takes two parameters y_true
and y_pred.
r2 = sk.r2_score(Y, Y_pred)
print('R2' + ' = ' + str(r2.round(5)))
R2 = 0.19798
80
81
In this section, the previously created model has been used to predict the
dependent variable using two different methods.
Task 19: Predict the values for Sales for the mean value of Price.
mean_p = np.mean(X['Price'])
print(mean_p)
115.795
82
mp = model.params
print(mp)
const 13.641915
Price -0.053073
dtype: float64
Now once you know the coefficients, you can simply replace them in the
model equation to find the fitted value of the dependent variable.
7.49633
83
p = [1, 115.795]
predsales2 = model.predict(p)
print(predsales2.round(5))
[7.49633]
You will have to add a constant term manually while using the predict
function with a model fitted from statsmodels library.
The output returns the fitted value of the dependent variable.
You can observe that the fitted value comes out to be the same from both
methods.
84
Then,
(
𝑃 −𝑡
( )
α
2
<
𝑋−μ
𝑠
𝑛
<𝑡
( )
α
2
) =1−α
(
𝑋−𝑡
( )
α
2
𝑠
𝑛
,𝑋 + 𝑡
( ) α
2
𝑠
𝑛 )
85
result = model.get_prediction().summary_frame()
result.head()
86
In this section, you will find out that the prediction interval is wider than
the confidence interval by visualization. For doing so, plot the lines of
the confidence interval and the prediction interval along with the original
data points to compare.
A simple regplot() command along with the plot() command will help
you in creating this graph as shown below.
fig1,ax1 = plt.subplots()
sns.regplot('Price','Sales', data=prodata,
ax=ax1, ci=False, marker='.', color='black')
ax1.plot(prodata['Price'],result['mean_ci_lower'
], color='red')
ax1.plot(prodata['Price'],result['mean_ci_upper'
], color='red')
ax1.plot(prodata['Price'],result['obs_ci_lower']
, color='b')
87
You can see that the prediction interval (blue line) is much wider than the
confidence interval (red line).
88
To fit a multiple linear regression model, you have to figure out the
possible explanatory variables that can impact our response variable. To
do so, again start by analyzing the correlations between the variables of
the Carseats dataset.
89
To find the pairwise simple correlations, you can use the corr() function
from the Pandas library.
It returns a matrix of Pearson’s correlation coefficient between the
variables.
prodata.corr()
You can observe that the diagonal values of the matrix are 1. This is
because the correlation of a variable with itself is 1. The off-diagonal
entries range between -1 and +1.
90
Ans.
● The sales variable has a moderate negative correlation with Price, a
low positive correlation with Advertising and Income, and a very
low positive correlation with CompPrice.
91
92
This is the most important step of the analysis. Almost always you will
work on more than two variables for modeling. It is crucial to understand
the steps that follow.
In this section, you will compare these models to observe if there is any
improvement in terms of their capacity to explain the dependent variable.
Begin by fitting the required regression models and then proceed to find
more information about each of them using the summary() function.
93
a) Null Model
X1 = np.ones(400)
null = sm.OLS(Y, X1).fit()
null.summary()
94
You have already fitted a simple linear regression model for Sales on
price during the previous section of the book. Hence, you can use the
summary() function on the previously created model.
model.summary()
95
You also must have noticed that there exists some correlation between
Sales and Advertising as well. Hence, to compare, you can fit another
simple linear regression model with Sales as the dependent variable and
Advertising as the independent variable.
X2 = prodata[['Advertising']]
X2 = sm.add_constant(X2)
96
Now fit a Multiple Linear Regression Model using both Price and
Advertising as the independent variables. It seems logical that the
advertising budget can influence sales.
X3 = prodata[['Price', 'Advertising']]
X3 = sm.add_constant(X3)
97
98
To extract the adjusted R2 value, use the rsquared_adj method from the
statsmodels library.
print(null.rsquared_adj.round(5))
print(model.rsquared_adj.round(5))
print(modelS_Ad.rsquared_adj.round(5))
print(model_mult.rsquared_adj.round(5))
0.0
0.19597
0.0703
0.27824
99
To apply AIC in practice, start with a set of candidate models and then
find the models’ corresponding AIC values. Select, from among the
candidate models, the model that minimizes the information loss.
100
For the four candidate models, you can find the AIC as shown below.
print(null.aic.round(5))
print(model.aic.round(5))
print(modelS_Ad.aic.round(5))
print(model_mult.aic.round(5))
1966.70562
1880.45635
1938.54287
1838.27177
Ans. Observe that the multiple linear regression model of Sales on both
Price and Advertising has the lowest AIC among the set of candidate
models and hence it is the best.
101
Let us say that two models are to be compared, a full model and a
reduced model. The full model might have more variables than the
reduced model.
The test hypothesis is that the full model adds explanatory value over the
reduced model i.e. full model is better than the reduced model.
Test Statistic:
(𝑆𝑆𝐸𝑟𝑒𝑑𝑢𝑐𝑒𝑑 − 𝑆𝑆𝐸𝑓𝑢𝑙𝑙)
(𝑝−𝑘)
𝐹= (𝑆𝑆𝐸𝑓𝑢𝑙𝑙) ~F(p-k, n-p-1)
(𝑛−𝑝−1)
Test Criteria:
Reject H0 at α% level of significance if p-value < α. Otherwise, do not
reject H0.
102
The p-value < 0.05 for the ANOVA test. Hence, we reject the null
hypothesis and conclude that the multiple regression model of Sales is
better than the simple model of Sales on Price.
103
Ans. The simple model has one higher degree of freedom as it has one
less independent variable in the model. A 10.5% reduction is seen in the
Residual Sum of Squares as compared to the simple model.
The p-value < 0.05 for the ANOVA test. Hence, reject the null
hypothesis and conclude that the multiple regression model of Sales is
better than the simple model of Sales on Advertising.
104
105
Task 24: Analyse the regression diagnostics for the best model.
106
pred_mult = model_mult.predict(X3)
resid_mult = Y - pred_mult
It seems safe to assume that the residuals are normally distributed as the
points are lying close to the Q-Q Line.
107
Points seem to have a lot of deviation from the reference line. Thus, the
model is not performing up to the mark.
108
sns.scatterplot(pred_mult, resid_mult,
color='black')
plt.xlabel('Fitted Values')
plt.ylabel('Residuals')
plt.axhline(0)
plt.title('Residuals vs Fitted')
plt.show()
The points appear to be scattered randomly. You can see that there is no
visible pattern in the plot.
109
110
The dataset, however, has other variables that might be useful in fitting a
linear regression which have not been explored yet.
While dealing with datasets, it is important to find the variables that are
most important in predicting the outcome variable. It is necessary due to
the computational costs as well as the bias-variance trade-off.
In backward stepwise regression, you start with a full model, that is, a
model that uses all independent variables, say k, to predict the dependent
variable. Then at each step, a variable is removed based on some criteria
(here, the algorithm uses AIC). A model is fitted again using the
remaining k-1 independent variables and repeat the procedure until the
111
1) Data Preparation
To prepare the data, you will have to convert the categorical variables to
integer variables. This is done through one-hot encoding.
When a categorical variable does not have a natural ordering, then using
integer encoding (labeling the categories as 1,2,3, and so on) is not
sufficient. In fact, using this encoding may result in poor performance or
unexpected results.
112
Y = data['Sales']
X = data.drop('Sales', axis=1)
You can use the drop() function from Pandas to keep all the variables of
‘data’ except the ‘Sales’ variable in X.
113
You will also need the LinearRegression() function for fitting a linear
regression model. This will be used as a parameter within the RFE
function.
lm = LinearRegression()
lm.fit(X, Y)
rfe = RFE(lm,n_features_to_select=8,verbose=1)
rfe = rfe.fit(X,Y)
114
list(zip(X.columns,rfe.support_,rfe.ranking_))
115
X.columns[rfe.support_]
Now, subset the X DataFrame to contain the desired variables only and
use it to fit the linear regression model.
X_final = X[X.columns[rfe.support_]]
lmbest = sm.OLS(Y,
sm.add_constant(X_final)).fit()
Use the summary() function to find the table of parameter estimates, R2,
and adjusted R2.
lmbest.summary()
116
X_final = X_final.drop(['US_Yes','Urban_Yes'],
axis=1)
117
118
Q32. Are all the explanatory variables in the best model for Sales
significant?
Ans. Since the p-value for the t-test of the significance of all the
regression coefficients is almost zero. We conclude that all the
explanatory variables in the best model are significant.
119
pred = lmbest.predict()
resid = Y - pred
120
121
122
123
1
VIF or Variance Inflation Factor is defined as 𝑉𝐼𝐹 = 2
1 − 𝑅𝑗
124
def calc_vif(X):
vif = pd.DataFrame()
vif["variables"] = X.columns
vif["VIF"] =
[variance_inflation_factor(X.values, i) for i in
range(X.shape[1])]
return(vif)
calc_vif(X_final)
125
Q33. According to the VIF >10 criteria, which variables are highly
correlated with the others?
126
final_model =
sm.OLS(Y,sm.add_constant(X_final)).fit()
final_model.summary()
127
128
130
The previously fitted model object of Sales on Price has been called with
the summary() method.
model.summary()
131
X4 = pd.get_dummies(prodata['ShelveLoc'],
drop_first = True)
modelS_Sh = sm.OLS(Y, sm.add_constant(X4)).fit()
modelS_Sh.summary()
Adjusted R2 for this model is 31.4% which is greater than the model with
Price as the independent variable.
132
X5 = pd.concat([prodata['Price'],X4],axis=1)
modelS_PSh=sm.OLS(Y, sm.add_constant(X5)).fit()
modelS_PSh.summary()
133
Ans. Yes. Since the p-value of ShelveLoc is almost zero in the combined
model, it contributes significantly to explain the dependent variable
Sales.
134
pred_parallel = modelS_PSh.predict()
resid_parallel = Y - pred_parallel
135
stats.probplot(resid_parallel, dist="norm",
plot=plt, fit = True)
plt.title('Normal Q-Q plot of residuals')
plt.show()
136
Ans. Based on the histogram and Q-Q plot, the residuals appear to have a
normal distribution.
In this task, the data points of Sales and Price have been plotted along
with its line of best fit as determined by the parallel slopes model.
To make this plot, use the hue parameter of the scatterplot() function to
color the points based on ShelveLoc.
137
The regression lines (black dots) are parallel to each other in this case.
Note: Using lmplot() function directly with the hue parameter will not
give you the parallel slopes model.
sns.lmplot('Price', 'Sales', hue = 'ShelveLoc', data=prodata)
The lmplot function will fit 3 different regressions of Sales on Price as
subsetted by the ShelveLoc variable.
138
Ans. Yes, the intercepts of the lines are different though the slope is the
same.
139
140
141
Ans. adjusted R2 = 0.539 which is almost the same as the additive model.
142
This brings you to the end of this book. We hope that you have learnt
something from it. We surely did while making it!
143