Chapter 2
Chapter 2
Chapter 2
regression
S U P E R V I S E D L E A R N I N G W I T H S C I K I T- L E A R N
George Boorman
Core Curriculum Manager, DataCamp
Predicting blood glucose levels
import pandas as pd
diabetes_df = pd.read_csv("diabetes.csv")
print(diabetes_df.head())
(752,) (752,)
X_bmi = X_bmi.reshape(-1, 1)
print(X_bmi.shape)
(752, 1)
George Boorman
Core Curriculum Manager, DataCamp
Regression mechanics
y = ax + b
Simple linear regression uses one feature
y = target
x = single feature
a, b = parameters/coefficients of the model - slope, intercept
How do we choose a and b?
Define an error function for any given line
y = a1 x1 + a2 x2 + a3 x3 + ... + an xn + b
High R2 : Low R2 :
0.356302876407827
RM SE = √M SE
24.028109426907236
George Boorman
Core Curriculum Manager, DataCamp
Cross-validation motivation
Model performance is dependent on the way we split up the data
Not representative of the model's ability to generalize to unseen data
Solution: Cross-validation!
10 folds = 10-fold CV
k folds = k-fold CV
print(np.mean(cv_results), np.std(cv_results))
0.7418682216666667 0.023330243960652888
array([0.7054865, 0.76874702])
George Boorman
Core Curriculum Manager, DataCamp
Why regularize?
Recall: Linear regression minimizes a loss function
It chooses a coefficient, a, for each feature variable, plus b