6 - Train - Test - Split - Ipynb - Colaboratory
6 - Train - Test - Split - Ipynb - Colaboratory
We have a dataset containing prices of used BMW cars. We are going to analyze this dataset
and build a prediction function that can predict a price by taking mileage and age of the car
as
input. We will use sklearn train_test_split method to split training and testing dataset
import pandas as pd
df = pd.read_csv("carprices.csv")
df.head()
0 69000 6 18000
1 35000 3 34000
2 57000 5 26100
3 22500 2 40000
4 46000 4 31500
import matplotlib.pyplot as plt
%matplotlib inline
plt.scatter(df['Mileage'],df['Sell Price($)'])
<matplotlib.collections.PathCollection at 0x2882746dd30>
plt.scatter(df['Age(yrs)'],df['Sell Price($)'])
<matplotlib.collections.PathCollection at 0x28826e06240>
Looking at above two scatter plots, using linear regression model makes sense as we can
clearly see a linear relationship between our dependant (i.e. Sell Price) and independant
variables (i.e. car age and car mileage)
The approach we are going to use here is to split available data in two sets
The reason we don't use same training set for testing is because our model has seen those
samples before, using same samples for making predictions might give us wrong impression
about accuracy of our model. It is like you ask same questions in exam paper as you tought the
students in the class.
X = df[['Mileage','Age(yrs)']]
y = df['Sell Price($)']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3)
X_train
Mileage Age(yrs)
11 79000 7
17 69000 5
10 83000 7
1 35000 3
0 69000 6
8 91000 8
7 72000 6
16 28000 2
6 52000 5
X_test
4 46000 4
19 52000 5
Mileage Age(yrs)
2 57000 5
3 22500 2
5 59000 5
12 59000 5
15 25400 3
14 82450 7
13 58780 4
9 67000 6
18 87600 8
y_train
11 19500
17 19700
10 18700
1 34000
0 18000
8 12000
7 19300
16 35500
6 32000
4 31500
19 28200
2 26100
5 26750
15 35000
y_test
3 40000
12 26000
14 19400
13 27500
9 22000
18 12800
from sklearn.linear_model import LinearRegression
clf = LinearRegression()
clf.fit(X_train, y_train)
X_test
Mileage Age(yrs)
3 22500 2
12 59000 5
14 82450 7
13 58780 4
9 67000 6
18 87600 8
clf.predict(X_test)
22602.44614295, 15559.98266172])
y_test
3 40000
12 26000
14 19400
13 27500
9 22000
18 12800
clf.score(X_test, y_test)
0.92713129118963111
random_state argument
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=10)
X_test
Mileage Age(yrs)
7 72000 6
10 83000 7
5 59000 5
6 52000 5
3 22500 2
18 87600 8