0% found this document useful (0 votes)
17 views

Assumption of Linear Regression

The document loads advertising data and performs linear regression to predict sales based on TV and radio data. It checks that the linear regression assumptions of linearity, no multicollinearity, normality of residuals, homoscedasticity, and no autocorrelation are met by plotting the data and residuals. TV and radio are found to have a linear relationship with sales while newspaper does not and is dropped. The model achieves a training score of 0.90.

Uploaded by

Kagade Ajinkya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

Assumption of Linear Regression

The document loads advertising data and performs linear regression to predict sales based on TV and radio data. It checks that the linear regression assumptions of linearity, no multicollinearity, normality of residuals, homoscedasticity, and no autocorrelation are met by plotting the data and residuals. TV and radio are found to have a linear relationship with sales while newspaper does not and is dropped. The model achieves a training score of 0.90.

Uploaded by

Kagade Ajinkya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

In 

[1]: import pandas as pd


import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import numpy as np

In [2]: # loading dataset


df = pd.read_csv('advertising.csv')

In [3]: df.head()

Out[3]: TV Radio Newspaper Sales

0 230.1 37.8 69.2 22.1

1 44.5 39.3 45.1 10.4

2 17.2 45.9 69.3 9.3

3 151.5 41.3 58.5 18.5

4 180.8 10.8 58.4 12.9

In [4]: feature= list(df.describe().columns)


feature.remove('Sales')
print(feature)

['TV', 'Radio', 'Newspaper']

Assumption 1
Linearity: Linear Regression assumes a linear relationship between the independent variables and the
dependent variable. It assumes that the relationship can be represented by a straight line, allowing us to
estimate the impact of each independent variable on the outcome.

In [5]: for i in feature:


fig=plt.figure(figsize=(9,5))
ax=fig.gca()
plt.scatter(df[i],df['Sales'])
plt.title('Linear relationship between'+ i +'and Sales')
plt.xlabel(i)
plt.ylabel('Sales')

Loading [MathJax]/extensions/Safe.js
TV and Radio shows linear relationship with Sales. Newspaper not shows any relationship with Sales
Loading [MathJax]/extensions/Safe.js
In [6]: df = df.drop(columns=['Newspaper'],axis=1)
df.head()

Out[6]: TV Radio Sales

0 230.1 37.8 22.1

1 44.5 39.3 10.4

2 17.2 45.9 9.3

3 151.5 41.3 18.5

4 180.8 10.8 12.9

Assumption 2 :
No Multicollinearity: Linear Regression assumes that there is little or no multicollinearity among the
independent variables. Multicollinearity occurs when the independent variables are highly correlated with
each other, which can lead to unstable coefficient estimates and difficulty in interpreting the model.

In [11]: import seaborn as sns


sns.heatmap(df.corr(),annot = True)

<AxesSubplot:>
Out[11]:

If scale independent variables between feature is between 0.9 and 1.0 indicates very highly correlated
variables. to avoid highly correlated variables in our prediction we can use Feature Engineering or drop one

In [25]: X = df[['TV',"Radio"]]
y = df['Sales']

In [42]: from sklearn.preprocessing import StandardScaler


sc = StandardScaler()
X = sc.fit_transform(X)

In [43]: X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=0)

In [44]: model = LinearRegression()

In [45]: model.fit(X_train,y_train)

Loading [MathJax]/extensions/Safe.js
Out[45]: ▾ LinearRegression

LinearRegression()

In [46]: model.score(X_train,y_train)

0.906590009997456
Out[46]:

In [68]: y_pred = model.predict(X_train)

In [69]: y_pred

array([12.12910171, 9.15580944, 15.03241037, 16.30334926, 17.14716657,


Out[69]:
13.30141363, 3.7442173 , 12.23433166, 15.75030475, 8.72053764,
10.63080662, 19.47315218, 18.40761899, 15.28187734, 9.97767471,
8.18442538, 21.51466181, 14.16377258, 16.31913548, 8.76173868,
15.3232895 , 12.43110582, 13.7323925 , 14.17416248, 18.32683299,
19.18210765, 20.26787047, 17.41350364, 9.2948777 , 11.7162453 ,
19.75212705, 9.88650856, 20.77707152, 23.23212847, 10.12298739,
17.1702549 , 19.57200672, 18.45956026, 16.89979032, 18.48460831,
17.06097604, 8.87711452, 9.92151758, 5.37423437, 3.61268846,
16.62992832, 12.67714289, 18.08966325, 11.70414944, 12.64627113,
13.80162459, 7.02617728, 16.56492853, 9.82454417, 8.10140123,
15.71810356, 24.8236722 , 10.89223692, 21.2456741 , 13.77916502,
10.67543603, 8.42066842, 12.45095892, 20.57350278, 10.46540505,
14.60394292, 16.38952182, 17.142417 , 13.17250923, 17.35076974,
21.17219997, 8.21351412, 16.14984219, 15.14382412, 8.77536534,
13.75492091, 16.41353838, 9.57141305, 14.27633084, 18.08106614,
20.96133734, 9.02853088, 20.25085962, 20.72493711, 13.69127828,
4.48797341, 17.75774028, 11.93958855, 11.03831089, 23.7750009 ,
11.91393641, 18.88783611, 20.84960921, 8.02434976, 5.39836394,
14.35422219, 15.62305239, 4.5174207 , 14.96247916, 17.19408806,
6.93837735, 17.39652874, 16.69270639, 12.76255569, 7.83850076,
12.60407148, 14.47316562, 14.87158322, 21.42869884, 18.14787514,
8.63502004, 11.83397385, 23.20856705, 10.08213515, 19.27559207,
20.0987164 , 9.87376597, 22.32356514, 7.48494988, 19.31724002,
15.56832949, 9.97766649, 11.37395041, 11.08808285, 6.52165542,
19.90457643, 7.57124521, 19.24819132, 17.67664966, 23.34299052,
9.21761664, 17.11020605, 10.26623555, 9.61843934, 13.12688122,
12.50992234, 18.57548627, 10.58632465, 13.87907726, 15.33802624,
14.05423996, 14.42682203, 18.39651198, 13.51559161, 12.73286286,
20.46049761, 22.01778386, 9.53746995, 11.86002719, 17.78800126,
15.80482629, 23.38295048, 14.5151901 , 12.35335522, 14.67363835,
11.98308167, 4.51747309, 6.50865906, 21.73203085, 7.74763102])

Assumption 3:
Normality: Linear Regression assumes that the residuals follow a normal distribution. This assumption
ensures the accuracy of statistical inference and hypothesis testing. Deviations from normality may lead to
biased estimates and incorrect statistical inferences.

In [70]: residual = y_train-y_pred


mean_residual = np.mean(residual)

In [71]: sns.kdeplot(residual)

<AxesSubplot:xlabel='Sales', ylabel='Density'>
Out[71]:

Loading [MathJax]/extensions/Safe.js
Assumption 4:
Homoscedasticity: Homoscedasticity assumes that the variance of the error term is constant across all
levels of the independent variables. In simpler terms, it means that the spread of the residuals remains the
same across the predicted values. Departure from this assumption may indicate heteroscedasticity, which
can affect the model's reliability.

In [73]: plt.scatter(y_pred,residual)

<matplotlib.collections.PathCollection at 0x1e20a8a6260>
Out[73]:

Assumption 5:
No Autocorrelation of error: The residuals in the linear regression model are assumed to be independently
and identically distributed. This implies that each error term is independent and unrelated to the other error
terms.

In [87]: plt.figure(figsize=(10,5))
p = sns.lineplot(x=y_pred,y=residual,marker='o',color='blue')
plt.xlabel('y_pred/predicted values')
plt.ylabel('Residuals')
plt.ylim(-10,10)
plt.xlim(0,26)
Loading [MathJax]/extensions/Safe.js
p = plt.title('Residuals vs fitted values plot for autocorrelation check')

In [ ]:

Loading [MathJax]/extensions/Safe.js

You might also like