4/18/24, 6:43 AM Linear Regression With Python - Housing data Analysis
Linear Regression with Python
Here neighbor is a real estate agent and wants some help predicting housing prices for
regions in the INDIA. It would be great if i could somehow create a model for her that allows
her to put in a few features of a house and returns back an estimate of what the house
would sell for.
She has asked me if i could help her out with your new data science skills. me say yes, and
decide that Linear Regression might be a good path to solve this problem!
My neighbor then gives you some information about a bunch of houses in regions of the
India,it is all in the data set: INDIA_Housing.csv.
The data contains the following columns:
'Avg. Area Income': Avg. Income of residents of the city house is located in. 'Avg. Area House
Age': Avg Age of Houses in same city 'Avg. Area Number of Rooms': Avg Number of Rooms
for Houses in same city 'Avg. Area Number of Bedrooms': Avg Number of Bedrooms for
Houses in same city 'Area Population': Population of city house is located in 'Price': Price that
the house sold at 'Address': Address for the house
Let's get started!
Check out the data
In [1]: ## Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
In [2]: INDIAhousing = pd.read_csv("INDIA_Housing.csv")
INDIAhousing.head()
localhost:8888/nbconvert/html/Linear Regression With Python - Housing data Analysis .ipynb?download=false 1/18
4/18/24, 6:43 AM Linear Regression With Python - Housing data Analysis
Out[2]: Avg.
Avg. Avg. Area
Area
Avg. Area Area Number Area
Number Price Address
Income House of Population
of
Age Bedrooms
Rooms
208 Michael Ferry Apt.
0 79545.458574 5.682861 7.009188 4.09 23086.800503 1.059034e+06 674\nLaurabury, NE
3701..
188 Johnson Views
1 79248.642455 6.002900 6.730821 3.09 40173.072174 1.505891e+06 Suite 079\nLake
Kathleen, CA...
9127 Elizabeth
2 61287.067179 5.865890 8.512727 5.13 36882.159400 1.058988e+06 Stravenue\nDanieltown,
WI 06482..
USS Barnett\nFPO AP
3 63345.240046 7.188236 5.586729 3.26 34310.242831 1.260617e+06
44820
USNS Raymond\nFPO
4 59982.197226 5.040555 7.839388 4.23 26354.109472 6.309435e+05
AE 09386
In [3]: INDIAhousing.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Avg. Area Income 5000 non-null float64
1 Avg. Area House Age 5000 non-null float64
2 Avg. Area Number of Rooms 5000 non-null float64
3 Avg. Area Number of Bedrooms 5000 non-null float64
4 Area Population 5000 non-null float64
5 Price 5000 non-null float64
6 Address 5000 non-null object
dtypes: float64(6), object(1)
memory usage: 273.6+ KB
In [4]: INDIAhousing.describe()
Out[4]: Avg. Area Avg. Area
Avg. Area Avg. Area Area
Number of Number of Price
Income House Age Population
Rooms Bedrooms
count 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5.000000e+03
mean 68583.108984 5.977222 6.987792 3.981330 36163.516039 1.232073e+06
std 10657.991214 0.991456 1.005833 1.234137 9925.650114 3.531176e+05
min 17796.631190 2.644304 3.236194 2.000000 172.610686 1.593866e+04
25% 61480.562388 5.322283 6.299250 3.140000 29403.928702 9.975771e+05
50% 68804.286404 5.970429 7.002902 4.050000 36199.406689 1.232669e+06
75% 75783.338666 6.650808 7.665871 4.490000 42861.290769 1.471210e+06
max 107701.748378 9.519088 10.759588 6.500000 69621.713378 2.469066e+06
In [5]: INDIAhousing.columns
localhost:8888/nbconvert/html/Linear Regression With Python - Housing data Analysis .ipynb?download=false 2/18
4/18/24, 6:43 AM Linear Regression With Python - Housing data Analysis
Index(['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms',
Out[5]:
'Avg. Area Number of Bedrooms', 'Area Population', 'Price', 'Address'],
dtype='object')
Exploratory Data Analysis for House Price
Prediction
In [6]: sns.pairplot(INDIAhousing)
<seaborn.axisgrid.PairGrid at 0x21385c9ef50>
Out[6]:
In [7]: sns.distplot(INDIAhousing['Price'])
C:\Users\HP\AppData\Local\Temp\ipykernel_4772\867072288.py:1: UserWarning:
`distplot` is a deprecated function and will be removed in seaborn v0.14.0.
Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).
For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751
sns.distplot(INDIAhousing['Price'])
localhost:8888/nbconvert/html/Linear Regression With Python - Housing data Analysis .ipynb?download=false 3/18
4/18/24, 6:43 AM Linear Regression With Python - Housing data Analysis
<Axes: xlabel='Price', ylabel='Density'>
Out[7]:
In [8]: sns.distplot(INDIAhousing['Area Population'])
C:\Users\HP\AppData\Local\Temp\ipykernel_4772\2139820291.py:1: UserWarning:
`distplot` is a deprecated function and will be removed in seaborn v0.14.0.
Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).
For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751
sns.distplot(INDIAhousing['Area Population'])
<Axes: xlabel='Area Population', ylabel='Density'>
Out[8]:
localhost:8888/nbconvert/html/Linear Regression With Python - Housing data Analysis .ipynb?download=false 4/18
4/18/24, 6:43 AM Linear Regression With Python - Housing data Analysis
In [9]: sns.distplot(INDIAhousing['Avg. Area Income'])
C:\Users\HP\AppData\Local\Temp\ipykernel_4772\3131757723.py:1: UserWarning:
`distplot` is a deprecated function and will be removed in seaborn v0.14.0.
Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).
For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751
sns.distplot(INDIAhousing['Avg. Area Income'])
<Axes: xlabel='Avg. Area Income', ylabel='Density'>
Out[9]:
localhost:8888/nbconvert/html/Linear Regression With Python - Housing data Analysis .ipynb?download=false 5/18
4/18/24, 6:43 AM Linear Regression With Python - Housing data Analysis
In [10]: sns.distplot(INDIAhousing['Avg. Area House Age'])
C:\Users\HP\AppData\Local\Temp\ipykernel_4772\1332842614.py:1: UserWarning:
`distplot` is a deprecated function and will be removed in seaborn v0.14.0.
Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).
For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751
sns.distplot(INDIAhousing['Avg. Area House Age'])
<Axes: xlabel='Avg. Area House Age', ylabel='Density'>
Out[10]:
localhost:8888/nbconvert/html/Linear Regression With Python - Housing data Analysis .ipynb?download=false 6/18
4/18/24, 6:43 AM Linear Regression With Python - Housing data Analysis
In [11]: sns.distplot(INDIAhousing['Avg. Area Number of Rooms'])
C:\Users\HP\AppData\Local\Temp\ipykernel_4772\2831880010.py:1: UserWarning:
`distplot` is a deprecated function and will be removed in seaborn v0.14.0.
Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).
For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751
sns.distplot(INDIAhousing['Avg. Area Number of Rooms'])
<Axes: xlabel='Avg. Area Number of Rooms', ylabel='Density'>
Out[11]:
localhost:8888/nbconvert/html/Linear Regression With Python - Housing data Analysis .ipynb?download=false 7/18
4/18/24, 6:43 AM Linear Regression With Python - Housing data Analysis
In [12]: sns.distplot(INDIAhousing['Avg. Area Number of Bedrooms'])
C:\Users\HP\AppData\Local\Temp\ipykernel_4772\334197827.py:1: UserWarning:
`distplot` is a deprecated function and will be removed in seaborn v0.14.0.
Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).
For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751
sns.distplot(INDIAhousing['Avg. Area Number of Bedrooms'])
<Axes: xlabel='Avg. Area Number of Bedrooms', ylabel='Density'>
Out[12]:
localhost:8888/nbconvert/html/Linear Regression With Python - Housing data Analysis .ipynb?download=false 8/18
4/18/24, 6:43 AM Linear Regression With Python - Housing data Analysis
In [13]: sns.distplot(INDIAhousing['Area Population'])
C:\Users\HP\AppData\Local\Temp\ipykernel_4772\2139820291.py:1: UserWarning:
`distplot` is a deprecated function and will be removed in seaborn v0.14.0.
Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).
For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751
sns.distplot(INDIAhousing['Area Population'])
<Axes: xlabel='Area Population', ylabel='Density'>
Out[13]:
localhost:8888/nbconvert/html/Linear Regression With Python - Housing data Analysis .ipynb?download=false 9/18
4/18/24, 6:43 AM Linear Regression With Python - Housing data Analysis
In [15]: sns.histplot(INDIAhousing['Price'])
<Axes: xlabel='Price', ylabel='Count'>
Out[15]:
In [16]: sns.histplot(INDIAhousing['Avg. Area Income'])
<Axes: xlabel='Avg. Area Income', ylabel='Count'>
Out[16]:
localhost:8888/nbconvert/html/Linear Regression With Python - Housing data Analysis .ipynb?download=false 10/18
4/18/24, 6:43 AM Linear Regression With Python - Housing data Analysis
In [17]: sns.histplot(INDIAhousing['Avg. Area House Age'])
<Axes: xlabel='Avg. Area House Age', ylabel='Count'>
Out[17]:
In [18]: sns.histplot(INDIAhousing['Avg. Area Number of Rooms'])
<Axes: xlabel='Avg. Area Number of Rooms', ylabel='Count'>
Out[18]:
localhost:8888/nbconvert/html/Linear Regression With Python - Housing data Analysis .ipynb?download=false 11/18
4/18/24, 6:43 AM Linear Regression With Python - Housing data Analysis
In [19]: sns.histplot(INDIAhousing['Avg. Area Number of Bedrooms'])
<Axes: xlabel='Avg. Area Number of Bedrooms', ylabel='Count'>
Out[19]:
In [20]: sns.histplot(INDIAhousing['Area Population'])
<Axes: xlabel='Area Population', ylabel='Count'>
Out[20]:
localhost:8888/nbconvert/html/Linear Regression With Python - Housing data Analysis .ipynb?download=false 12/18
4/18/24, 6:43 AM Linear Regression With Python - Housing data Analysis
In [21]: INDIAhousing_numeric = INDIAhousing.drop(columns=['Address'])
In [22]: sns.heatmap(INDIAhousing_numeric.corr(), annot=True)
<Axes: >
Out[22]:
localhost:8888/nbconvert/html/Linear Regression With Python - Housing data Analysis .ipynb?download=false 13/18
4/18/24, 6:43 AM Linear Regression With Python - Housing data Analysis
Training a Linear Regression Model
In [23]: X = INDIAhousing[['Avg. Area Income','Avg. Area House Age','Avg. Area Number of Roo
Y = INDIAhousing['Price']
Split Data into Train ,Test
In [24]: from sklearn.model_selection import train_test_split
In [25]: X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.4, random_sta
Creating and Training the Linear Regression Model
In [26]: from sklearn.linear_model import LinearRegression
In [27]: lm = LinearRegression()
In [28]: lm.fit(X_train,Y_train)
Out[28]: ▾ LinearRegression
LinearRegression()
localhost:8888/nbconvert/html/Linear Regression With Python - Housing data Analysis .ipynb?download=false 14/18
4/18/24, 6:43 AM Linear Regression With Python - Housing data Analysis
Linear Regression Model Evaluation
In [29]: print(lm.intercept_)
-2640159.79685267
In [30]: coeff_df = pd.DataFrame(lm.coef_,X.columns,columns=['Coefficient'])
coeff_df
Out[30]: Coefficient
Avg. Area Income 21.528276
Avg. Area House Age 164883.282027
Avg. Area Number of Rooms 122368.678027
Avg. Area Number of Bedrooms 2233.801864
Area Population 15.150420
Interpreting the coefficients:
Holding all other features fixed, a 1 unit increase in Avg. Area Income is associated with an
increase of $21.52
Holding all other features fixed, a 1 unit increase in Avg. Area House Age is associated with
an increase of $164883.28
Holding all other features fixed, a 1 unit increase in Avg. Area Number of Rooms is
associated with an increase of $122368.67
Holding all other features fixed, a 1 unit increase in Avg. Area Number of Bedrooms is
associated with an increase of $2233.80
Holding all other features fixed, a 1 unit increase in Area Population is associated with an
increase of $15.15
Does this make sense? Probably not because I made up this data. If you want real data to
repeat this sort of analysis, check out the boston dataset:
from sklearn.datasets import load_boston
boston = load_boston()
print(boston.DESCR)
boston_df = boston.data
Predictions from our Model
Let's grab predictions off our test set and see how well it did!
In [31]: predictions = lm.predict(X_test)
localhost:8888/nbconvert/html/Linear Regression With Python - Housing data Analysis .ipynb?download=false 15/18
4/18/24, 6:43 AM Linear Regression With Python - Housing data Analysis
In [32]: plt.scatter(Y_test,predictions)
<matplotlib.collections.PathCollection at 0x2138b335750>
Out[32]:
Residual Histogram
In [33]: sns.distplot((Y_test-predictions),bins=50);
C:\Users\HP\AppData\Local\Temp\ipykernel_4772\1960946261.py:1: UserWarning:
`distplot` is a deprecated function and will be removed in seaborn v0.14.0.
Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).
For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751
sns.distplot((Y_test-predictions),bins=50);
localhost:8888/nbconvert/html/Linear Regression With Python - Housing data Analysis .ipynb?download=false 16/18
4/18/24, 6:43 AM Linear Regression With Python - Housing data Analysis
In [34]: sns.histplot((Y_test-predictions),bins=50);
Regression Evaluation Metrics
In [35]: from sklearn import metrics
localhost:8888/nbconvert/html/Linear Regression With Python - Housing data Analysis .ipynb?download=false 17/18
4/18/24, 6:43 AM Linear Regression With Python - Housing data Analysis
In [36]: print('MAE:', metrics.mean_absolute_error(Y_test, predictions))
print('MSE:', metrics.mean_absolute_error(Y_test, predictions))
print('RMSE:', np.sqrt(metrics.mean_squared_error(Y_test, predictions)))
MAE: 82288.22251914942
MSE: 82288.22251914942
RMSE: 102278.82922290884
Thank You
In [ ]:
localhost:8888/nbconvert/html/Linear Regression With Python - Housing data Analysis .ipynb?download=false 18/18