0% found this document useful (0 votes)
80 views5 pages

6 - Train - Test - Split - Ipynb - Colaboratory

The document describes using a dataset containing prices of used BMW cars to build a linear regression model to predict price based on mileage and age. The data is split into training and test sets using train_test_split. Scatter plots show linear relationships between price and the input variables. A linear regression model is fit on the training set and used to make predictions on the test set, achieving a score of 0.927. The random_state argument is also demonstrated.

Uploaded by

duryodhan sahoo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
80 views5 pages

6 - Train - Test - Split - Ipynb - Colaboratory

The document describes using a dataset containing prices of used BMW cars to build a linear regression model to predict price based on mileage and age. The data is split into training and test sets using train_test_split. Scatter plots show linear relationships between price and the input variables. A linear regression model is fit on the training set and used to make predictions on the test set, achieving a score of 0.927. The random_state argument is also demonstrated.

Uploaded by

duryodhan sahoo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Training And Testing Available Data

We have a dataset containing prices of used BMW cars. We are going to analyze this dataset
and build a prediction function that can predict a price by taking mileage and age of the car
as
input. We will use sklearn train_test_split method to split training and testing dataset

import pandas as pd
df = pd.read_csv("carprices.csv")
df.head()

Mileage Age(yrs) Sell Price($)

0 69000 6 18000

1 35000 3 34000

2 57000 5 26100

3 22500 2 40000

4 46000 4 31500

import matplotlib.pyplot as plt
%matplotlib inline

Car Mileage Vs Sell Price ($)

plt.scatter(df['Mileage'],df['Sell Price($)'])

<matplotlib.collections.PathCollection at 0x2882746dd30>

Car Age Vs Sell Price ($)

plt.scatter(df['Age(yrs)'],df['Sell Price($)'])

<matplotlib.collections.PathCollection at 0x28826e06240>

Looking at above two scatter plots, using linear regression model makes sense as we can
clearly see a linear relationship between our dependant (i.e. Sell Price) and independant
variables (i.e. car age and car mileage)

The approach we are going to use here is to split available data in two sets

1. Training: We will train our model on this dataset


2. Testing: We will use this subset to make actual predictions using trained model

The reason we don't use same training set for testing is because our model has seen those
samples before, using same samples for making predictions might give us wrong impression
about accuracy of our model. It is like you ask same questions in exam paper as you tought the
students in the class.

X = df[['Mileage','Age(yrs)']]

y = df['Sell Price($)']

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3) 

X_train

Mileage Age(yrs)

11 79000 7

17 69000 5

10 83000 7

1 35000 3

0 69000 6

8 91000 8

7 72000 6

16 28000 2

6 52000 5

X_test
4 46000 4

19 52000 5
Mileage Age(yrs)
2 57000 5
3 22500 2
5 59000 5
12 59000 5
15 25400 3
14 82450 7

13 58780 4

9 67000 6

18 87600 8

y_train

11 19500

17 19700

10 18700

1 34000

0 18000

8 12000

7 19300

16 35500

6 32000

4 31500

19 28200

2 26100

5 26750

15 35000

Name: Sell Price($), dtype: int64

y_test

3 40000

12 26000

14 19400

13 27500

9 22000

18 12800

Name: Sell Price($), dtype: int64

Lets run linear regression model now

from sklearn.linear_model import LinearRegression

clf = LinearRegression()

clf.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

X_test

Mileage Age(yrs)

3 22500 2

12 59000 5

14 82450 7

13 58780 4

9 67000 6

18 87600 8

clf.predict(X_test)

array([ 38166.23426912, 25092.95646646, 16773.29470749, 24096.93956163,

22602.44614295, 15559.98266172])

y_test

3 40000

12 26000

14 19400

13 27500

9 22000

18 12800

Name: Sell Price($), dtype: int64

clf.score(X_test, y_test)

0.92713129118963111

random_state argument

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=10)

X_test

Mileage Age(yrs)

7 72000 6

10 83000 7

5 59000 5

6 52000 5

3 22500 2

18 87600 8

Colab paid products


-
Cancel contracts here

You might also like