Linear Regression Models
Creating a model to predict the Housing prices
based on existing features
In [1]: import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
In [2]: %matplotlib inline
In [3]: hs=pd.read_csv('USA_Housing.csv')
hs
Out[3]: Avg. Area
Avg. Area Avg. Area Avg. Area
Area
House Number of Number of Price Address
Income Population
Age Rooms Bedrooms
208 Michael Ferry Apt.
0 79545.458574 5.682861 7.009188 4.09 23086.800503 1.059034e+06 674\nLaurabury, NE
3701...
188 Johnson Views Suite
1 79248.642455 6.002900 6.730821 3.09 40173.072174 1.505891e+06
079\nLake Kathleen, CA...
9127 Elizabeth
2 61287.067179 5.865890 8.512727 5.13 36882.159400 1.058988e+06 Stravenue\nDanieltown,
WI 06482...
USS Barnett\nFPO AP
3 63345.240046 7.188236 5.586729 3.26 34310.242831 1.260617e+06
44820
USNS Raymond\nFPO AE
4 59982.197226 5.040555 7.839388 4.23 26354.109472 6.309435e+05
09386
... ... ... ... ... ... ... ...
USNS Williams\nFPO AP
4995 60567.944140 7.830362 6.137356 3.46 22837.361035 1.060194e+06
30153-7653
PSC 9258, Box
4996 78491.275435 6.999135 6.576763 4.02 25616.115489 1.482618e+06 8489\nAPO AA 42991-
3352
4215 Tracy Garden Suite
4997 63390.686886 7.250591 4.805081 2.13 33266.145490 1.030730e+06 076\nJoshualand, VA
01...
USS Wallace\nFPO AE
4998 68001.331235 5.534388 7.130144 5.44 42625.620156 1.198657e+06
73316
37778 George Ridges
4999 65510.581804 5.992305 6.792336 4.07 46501.283803 1.298950e+06 Apt. 509\nEast Holly, NV
2...
5000 rows × 7 columns
In [4]: hs.info() #gives total number of columns, total number of entries and type of data ty
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Avg. Area Income 5000 non-null float64
1 Avg. Area House Age 5000 non-null float64
2 Avg. Area Number of Rooms 5000 non-null float64
3 Avg. Area Number of Bedrooms 5000 non-null float64
4 Area Population 5000 non-null float64
5 Price 5000 non-null float64
6 Address 5000 non-null object
dtypes: float64(6), object(1)
memory usage: 273.6+ KB
In [5]: hs.describe() #to get the statistical information about the dataframe
Out[5]: Avg. Area Avg. Area House Avg. Area Number Avg. Area Number of Area
Price
Income Age of Rooms Bedrooms Population
count 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5.000000e+03
mean 68583.108984 5.977222 6.987792 3.981330 36163.516039 1.232073e+06
std 10657.991214 0.991456 1.005833 1.234137 9925.650114 3.531176e+05
min 17796.631190 2.644304 3.236194 2.000000 172.610686 1.593866e+04
25% 61480.562388 5.322283 6.299250 3.140000 29403.928702 9.975771e+05
50% 68804.286404 5.970429 7.002902 4.050000 36199.406689 1.232669e+06
75% 75783.338666 6.650808 7.665871 4.490000 42861.290769 1.471210e+06
max 107701.748378 9.519088 10.759588 6.500000 69621.713378 2.469066e+06
In [6]: #in above address column is not available as it has string values.
In [7]: hs.columns #gives number of columns in the given dataframe.
Index(['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms',
Out[7]:
'Avg. Area Number of Bedrooms', 'Area Population', 'Price', 'Address'],
dtype='object')
In [8]: sns.pairplot(hs)
<seaborn.axisgrid.PairGrid at 0x21970b9e790>
Out[8]:
In [9]: #in above histogram looks like normally distributed except number of bedrooms which are 4
#around it.
In [10]: #to check distribution of the price
sns.histplot(hs['Price'])
<AxesSubplot:xlabel='Price', ylabel='Count'>
Out[10]:
In [11]: #to check the corelation between two variables use heatmap
sns.heatmap(hs.corr())
#you can see the daignol of corelation and it shows relation is perfectly ploted with so
<AxesSubplot:>
Out[11]:
In [12]: sns.heatmap(hs.corr(),annot=True)
<AxesSubplot:>
Out[12]:
In [13]: #to check the corelation between two variables
hs.corr()
Out[13]: Avg. Area Avg. Area Avg. Area Number Avg. Area Number Area
Price
Income House Age of Rooms of Bedrooms Population
Avg. Area Income 1.000000 -0.002007 -0.011032 0.019788 -0.016234 0.639734
Avg. Area House
-0.002007 1.000000 -0.009428 0.006149 -0.018743 0.452543
Age
Avg. Area Number
-0.011032 -0.009428 1.000000 0.462695 0.002040 0.335664
of Rooms
Avg. Area Number
0.019788 0.006149 0.462695 1.000000 -0.022168 0.171071
of Bedrooms
Area Population -0.016234 -0.018743 0.002040 -0.022168 1.000000 0.408556
Price 0.639734 0.452543 0.335664 0.171071 0.408556 1.000000
In [14]: # for training a linear regression model split the data into X axis that features to tra
# in this case target variable is Price column wr we are predciting the price
#here we wont deal with Address column as it has text information but in NLP we will.
hs.columns # to grab the columns
Index(['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms',
Out[14]:
'Avg. Area Number of Bedrooms', 'Area Population', 'Price', 'Address'],
dtype='object')
In [15]: #take feature variables
X=hs[['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms',
'Avg. Area Number of Bedrooms', 'Area Population']]
In [16]: #take target variable that we trying to predict
y=hs['Price']
Splitting the data into training set and testing set(to
test the model that we have trained)
In [17]: #now do Train Test split in the data
#we need to import scikitlearn for splitting the data
from sklearn.model_selection import train_test_split
In [18]: x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=10
#tuple unpacking to grab training set and testing set
#test size is percentage of test data you want to allocate for testing our model that is
#random state means data is allocated randomly.
In [19]: from sklearn.linear_model import LinearRegression
#to import linear regression functions
In [20]: #we need to create instance for the linear regression model
lm=LinearRegression()
In [21]: lm.fit(x_train,y_train) # at first i will fit my data into model for training the data
#use shift+tab to get the broiler code
LinearRegression()
Out[21]:
evaluate our model while checking coeffcients
In [22]: #grab the intercept while calling lm
print(lm.intercept_)
-2640159.79685267
In [23]: #grab the coeffcient for each features each of this coef relates to columns
lm.coef_
array([2.15282755e+01, 1.64883282e+05, 1.22368678e+05, 2.23380186e+03,
Out[23]:
1.51504200e+01])
In [24]: X.columns
Index(['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms',
Out[24]:
'Avg. Area Number of Bedrooms', 'Area Population'],
dtype='object')
In [25]: x_train.columns
Index(['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms',
Out[25]:
'Avg. Area Number of Bedrooms', 'Area Population'],
dtype='object')
In [26]: #create a dataframe for better understanding on the coefficients
cdf=pd.DataFrame(lm.coef_,X.columns,columns=['Coeff'])
In [27]: cdf
Out[27]: Coeff
Avg. Area Income 21.528276
Avg. Area House Age 164883.282027
Avg. Area Number of Rooms 122368.678027
Avg. Area Number of Bedrooms 2233.801864
Area Population 15.150420
In [28]: #coefficient shows that when you hold all other features fixed and if there is one unit
#is assoicated with the increase of 21.528$ in the house price.and rest all the same.
Predicting the test set
In [37]: predictions=lm.predict(x_test)
In [38]: predictions #predicted prices for the house
array([1260960.70567627, 827588.7556033 , 1742421.24254342, ...,
Out[38]:
372191.40626916, 1365217.15140897, 1914519.5417888 ])
In [39]: y_test #actual vlaues of the house
1718 1.251689e+06
Out[39]:
2511 8.730483e+05
345 1.696978e+06
2521 1.063964e+06
54 9.487883e+05
...
1776 1.489520e+06
4269 7.777336e+05
1661 1.515271e+05
2410 1.343824e+06
2302 1.906025e+06
Name: Price, Length: 2000, dtype: float64
In [40]: #to analyse above these values draw a scatter plot using both variables
plt.scatter(y_test,predictions)
<matplotlib.collections.PathCollection at 0x2197496b880>
Out[40]:
In [ ]: #it is pretty good bcs the predicted value and actual value fits into the straight line
In [41]: #create a histogram of distribution for residuals.residuals are the diffrence between ac
#predicted values (predictions).
sns.distplot((y_test-predictions))
C:\Users\Sai Shri\anaconda3\lib\site-packages\seaborn\distributions.py:2619: FutureWarni
ng: `distplot` is a deprecated function and will be removed in a future version. Please
adapt your code to use either `displot` (a figure-level function with similar flexibilit
y) or `histplot` (an axes-level function for histograms).
warnings.warn(msg, FutureWarning)
<AxesSubplot:xlabel='Price', ylabel='Density'>
Out[41]:
In [42]: #the curve shows residuals are normally distributed it means the model selected is a per
#if it is not normaly distributed and showing weired behaviour then look back the data a
#is good choice for the data or not for the dataset
Regression Evaluation Matrics
Mean Absolute Error
Mean sqaured Error
Root Mean sqaured Error
In [43]: from sklearn import metrics
In [44]: metrics.mean_absolute_error(y_test,predictions)
82288.22251914942
Out[44]:
In [46]: metrics.mean_squared_error(y_test,predictions)
10460958907.208948
Out[46]:
In [47]: np.sqrt(metrics.mean_squared_error(y_test,predictions))
102278.82922290883
Out[47]: