Scikitold
Scikitold
Scikitold
mrdbourke add old version of scikit-learn notebook (just in case) 4eb2a4c · 3 years ago
https://github.com/mrdbourke/zero-to-mastery-ml/blob/master/section-2-data-science-and-ml-tools/introduction-to-scikit-learn-OLD.ipynb 1/81
7/23/24, 1:26 PM zero-to-mastery-ml/section-2-data-science-and-ml-tools/introduction-to-scikit-learn-OLD.ipynb at master · mrdbourke/zero-to-m…
NOTE 🚫
This notebook is deprecated, it uses a dataset from Scikit-Learn that has been
removed (the Boston housing dataset) in version 1.0.
It's long but it's called quick because of how vast the Scikit-Learn library is.
Covering everything requires a full-blown documentation, of which, if you ever get
stuck, you should read.
It's built on top on NumPy (Python library for numerical computing) and Matplotlib
(Python library for data visualization).
Why Scikit-Learn?
Although the field of machine learning is vast, the main goal is finding patterns
within data and then using those patterns to make predictions.
And there are certain categories which a majority of problems fall into.
If you're trying to create a machine learning model to predict the price of houses
https://github.com/mrdbourke/zero-to-mastery-ml/blob/master/section-2-data-science-and-ml-tools/introduction-to-scikit-learn-OLD.ipynb 2/81
7/23/24, 1:26 PM zero-to-mastery-ml/section-2-data-science-and-ml-tools/introduction-to-scikit-learn-OLD.ipynb at master · mrdbourke/zero-to-m…
If you re trying to create a machine learning model to predict the price of houses
given their characteristics, you're working on a regression problem (predicting a
number).
Once you know what kind of problem you're working on, there are also similar
steps you'll take for each. Steps like splitting the data into different sets, one for
your machine learning algorithms to learn on and another to test them on.
Choosing a machine learning model and then evaluating whether or not your
model has learned anything.
Scikit-Learn offers Python implementations for doing all of these kinds of tasks.
Saving you having to build them from scratch.
Note: all of the steps in this notebook are focused on supervised learning (having
data and labels).
After going through it, you'll have the base knolwedge of Scikit-Learn you need to
keep moving forward.
1. Try it - Since Scikit-Learn has been designed with usability in mind, your first
step should be to use what you know and try figure out the answer to your
own question (getting it wrong is part of the process). If in doubt, run your
code.
2. Press SHIFT+TAB - See you can the docstring of a function (information on
what the function does) by pressing SHIFT + TAB inside it. Doing this is a
good habit to develop It'll improve your research skills and give you a better
https://github.com/mrdbourke/zero-to-mastery-ml/blob/master/section-2-data-science-and-ml-tools/introduction-to-scikit-learn-OLD.ipynb 3/81
7/23/24, 1:26 PM zero-to-mastery-ml/section-2-data-science-and-ml-tools/introduction-to-scikit-learn-OLD.ipynb at master · mrdbourke/zero-to-m…
good habit to develop. It ll improve your research skills and give you a better
understanding of the library.
3. Search for it - If trying it on your own doesn't work, since someone else has
probably tried to do something similar, try searching for your problem. You'll
likely end up in 1 of 2 places:
Scikit-Learn documentation/user guide - the most extensive resource
you'll find for Scikit-Learn information.
Stack Overflow - this is the developers Q&A hub, it's full of questions and
answers of different problems across a wide range of software
development topics and chances are, there's one related to your problem.
The next steps here are to read through the documentation, check the examples
and see if they line up to the problem you're trying to solve. If they do, rewrite the
code to suit your needs, run it, and see what the outcomes are.
4. Ask for help - If you've been through the above 3 steps and you're still stuck,
you might want to ask your question on Stack Overflow. Be as specific as
possible and provide details on what you've tried.
Remember, you don't have to learn all of the functions off by heart to begin with.
Start by answering that question and then practicing finding the code which does it.
In [1]:
# Standard imports
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
Once we've seen an end-to-end workflow, we'll dive into each step a little deeper.
Note: Since Scikit-Learn is such a vast library, capable of tackling many problems,
the workflow we're using is only one example of how you can use it.
https://github.com/mrdbourke/zero-to-mastery-ml/blob/master/section-2-data-science-and-ml-tools/introduction-to-scikit-learn-OLD.ipynb 4/81
7/23/24, 1:26 PM zero-to-mastery-ml/section-2-data-science-and-ml-tools/introduction-to-scikit-learn-OLD.ipynb at master · mrdbourke/zero-to-m…
In [2]:
import pandas as pd
heart_disease = pd.read_csv('../data/heart-disease.csv')
heart_disease.head()
Out[2]: age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca
Here, each row is a different patient and all columns except target are different
patient characteristics. target indicates whether the patient has heart disease
( target = 1) or not ( target = 0).
In [3]:
# Create X (all the feature columns)
X = heart_disease.drop("target", axis=1)
In [4]:
X.head()
Out[4]: age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca
In [5]:
y.head(), y.value_counts()
Out[5]: (0 1
1 1
2 1
3 1
4 1
Name: target, dtype: int64,
1 165
0 138
Name: target, dtype: int64)
In [6]:
# Split the data into training and test sets
from sklearn.model_selection import train_test_split
Hyperparameters are like knobs on an oven you can tune to cook your favourite
dish.
In [7]:
# We'll use a Random Forest
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()
In [8]:
# We'll leave the hyperparameters as default to begin with...
clf.get_params()
If there are labels (supervised learning), the model tries to work out the relationship
between the data and the labels.
If there are no labels (unsupervised learning), the model tries to find patterns and
group similar samples together.
In [9]:
clf.fit(X_train, y_train)
Out[9]: RandomForestClassifier()
Once our model instance is trained, you can use the predict() method to
predict a target value given a set of features. In other words, use the model, along
with some unlabelled data to predict the label.
Note, data you predict on has to be in the same shape as data you trained on.
In [10]:
# This doesn't work... incorrect shapes
y_label = clf.predict(np.array([0, 2, 3, 4]))
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/t /i k l 163644/893218067 i d l
https://github.com/mrdbourke/zero-to-mastery-ml/blob/master/section-2-data-science-and-ml-tools/introduction-to-scikit-learn-OLD.ipynb 6/81
7/23/24, 1:26 PM zero-to-mastery-ml/section-2-data-science-and-ml-tools/introduction-to-scikit-learn-OLD.ipynb at master · mrdbourke/zero-to-m…
/tmp/ipykernel_163644/893218067.py in <module>
1 # This doesn't work... incorrect shapes
----> 2 y_label = clf.predict(np.array([0, 2, 3, 4]))
~/code/zero-to-mastery-ml/env/lib/python3.9/site-packages/sklearn/ensemble/_
forest.py in predict(self, X)
628 The predicted classes.
629 """
--> 630 proba = self.predict_proba(X)
631
632 if self.n_outputs_ == 1:
~/code/zero-to-mastery-ml/env/lib/python3.9/site-packages/sklearn/ensemble/_
forest.py in predict_proba(self, X)
672 check_is_fitted(self)
673 # Check data
--> 674 X = self._validate_X_predict(X)
675
676 # Assign chunk of trees to jobs
~/code/zero-to-mastery-ml/env/lib/python3.9/site-packages/sklearn/ensemble/_
forest.py in _validate_X_predict(self, X)
420 check_is_fitted(self)
421
--> 422 return self.estimators_[0]._validate_X_predict(X, check_inpu
t=True)
423
424 @property
~/code/zero-to-mastery-ml/env/lib/python3.9/site-packages/sklearn/tree/_clas
ses.py in _validate_X_predict(self, X, check_input)
405 """Validate the training data on predict (probabilities)."""
406 if check_input:
--> 407 X = self._validate_data(X, dtype=DTYPE, accept_sparse="c
sr",
408 reset=False)
409 if issparse(X) and (X.indices.dtype != np.intc or
~/code/zero-to-mastery-ml/env/lib/python3.9/site-packages/sklearn/base.py in
_validate_data(self, X, y, reset, validate_separately, **check_params)
419 out = X
420 elif isinstance(y, str) and y == 'no_validation':
--> 421 X = check_array(X, **check_params)
422 out = X
423 else:
~/code/zero-to-mastery-ml/env/lib/python3.9/site-packages/sklearn/utils/vali
dation.py in inner_f(*args, **kwargs)
61 extra_args = len(args) - len(all_args)
62 if extra_args <= 0:
---> 63 return f(*args, **kwargs)
64
65 # extra_args > 0
~/code/zero-to-mastery-ml/env/lib/python3.9/site-packages/sklearn/utils/vali
dation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, o
rder, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensur
e_min_features, estimator)
692 # If input is 1D raise error
693 if array.ndim == 1:
--> 694 raise ValueError(
695 "Expected 2D array, got 1D array instead:\narray
={}.\n"
696 "Reshape your data either using array.reshape(-
1, 1) if "
In [11]:
# In order to predict a label, data has to be in the same shape as X_train
X_test.head()
https://github.com/mrdbourke/zero-to-mastery-ml/blob/master/section-2-data-science-and-ml-tools/introduction-to-scikit-learn-OLD.ipynb 7/81
7/23/24, 1:26 PM zero-to-mastery-ml/section-2-data-science-and-ml-tools/introduction-to-scikit-learn-OLD.ipynb at master · mrdbourke/zero-to-m…
Out[11]: age sex cp trestbps chol fbs restecg thalach exang oldpeak slope
In [12]:
# Use the model to make a prediction on the test data (further evaluation)
y_preds = clf.predict(X_test)
Each model or estimator has a built-in score method. This method compares how
well the model was able to learn the patterns between the features and labels. In
other words, it returns how accurate your model is.
In [13]:
# Evaluate the model on the training set
clf.score(X_train, y_train)
Out[13]: 1.0
In [14]:
# Evaluate the model on the test set
clf.score(X_test, y_test)
Out[14]: 0.8552631578947368
There are also a number of other evaluation methods we can use for our models.
In [15]:
from sklearn.metrics import classification_report, confusion_matrix, accur
print(classification_report(y_test, y_preds))
accuracy 0.86 76
macro avg 0.86 0.84 0.85 76
weighted avg 0.86 0.86 0.85 76
In [16]:
conf_mat = confusion_matrix(y_test, y_preds)
conf_mat
In [17]:
accuracy_score(y_test, y_preds)
Out[17]: 0.8552631578947368
5. Experiment to improve
https://github.com/mrdbourke/zero-to-mastery-ml/blob/master/section-2-data-science-and-ml-tools/introduction-to-scikit-learn-OLD.ipynb 8/81
7/23/24, 1:26 PM zero-to-mastery-ml/section-2-data-science-and-ml-tools/introduction-to-scikit-learn-OLD.ipynb at master · mrdbourke/zero-to-m…
Once you've got a baseline model, like we have here, it's important to remember,
this is often not the final model you'll use.
The next step in the workflow is to try and improve upon your baseline model.
And to do this, there's two ways to look at it. From a model perspective and from a
data perspective.
From a model perspective this may involve things such as using a more complex
model or tuning your models hyperparameters.
From a data perspective, this may involve collecting more data or better quality
data so your existing model has more of a chance to learn the patterns within.
If you're already working on an existing dataset, it's often easier try a series of
model perspective experiments first and then turn to data perspective experiments
if you aren't getting the results you're looking for.
Different models you use will have different hyperparameters you can tune. For the
case of our model, the RandomForestClassifier() , we'll start trying different
values for n_estimators .
In [18]:
# Try different numbers of estimators (trees)... (no cross-validation)
np.random.seed(42)
for i in range(10, 100, 10):
print(f"Trying model with {i} estimators...")
model = RandomForestClassifier(n_estimators=i).fit(X_train, y_train)
print(f"Model accuracy on test set: {model.score(X_test, y_test) * 100
print("")
https://github.com/mrdbourke/zero-to-mastery-ml/blob/master/section-2-data-science-and-ml-tools/introduction-to-scikit-learn-OLD.ipynb 9/81
7/23/24, 1:26 PM zero-to-mastery-ml/section-2-data-science-and-ml-tools/introduction-to-scikit-learn-OLD.ipynb at master · mrdbourke/zero-to-m…
Trying model with 80 estimators...
Model accuracy on test set: 86.8421052631579%
In [19]:
from sklearn.model_selection import cross_val_score
# With cross-validation
np.random.seed(42)
for i in range(10, 100, 10):
print(f"Trying model with {i} estimators...")
model = RandomForestClassifier(n_estimators=i).fit(X_train, y_train)
print(f"Model accuracy on test set: {model.score(X_test, y_test) * 100
print(f"Cross-validation score: {np.mean(cross_val_score(model, X, y,
print("")
In [20]:
# Another way to do it with GridSearchCV...
np.random.seed(42)
from sklearn.model_selection import GridSearchCV
https://github.com/mrdbourke/zero-to-mastery-ml/blob/master/section-2-data-science-and-ml-tools/introduction-to-scikit-learn-OLD.ipynb 10/81
7/23/24, 1:26 PM zero-to-mastery-ml/section-2-data-science-and-ml-tools/introduction-to-scikit-learn-OLD.ipynb at master · mrdbourke/zero-to-m…
In [21]:
# Set the model to be the best estimator
clf = grid.best_estimator_
clf
Out[21]: RandomForestClassifier(n_estimators=80)
In [22]:
# Fit the best model
clf = clf.fit(X_train, y_train)
In [23]:
# Find the best model scores
clf.score(X_test, y_test)
Out[23]: 0.8552631578947368
This may come in the form of a teammate or colleague trying to replicate and
validate your results or through a customer using your model as part of a service or
application you offer.
Saving a model also allows you to reuse it later without having to go through
retraining it. Which is helpful, especially when your training times start to increase.
You can save a scikit-learn model using Python's in-built pickle module.
In [24]:
import pickle
In [25]:
# Load a saved model and make a prediction
loaded_model = pickle.load(open("random_forest_model_1.pkl", "rb"))
loaded_model.score(X_test, y_test)
Out[25]: 0.8421052631578947
In [26]:
# Splitting the data into X & y
heart_disease.head()
Out[26]: age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca
https://github.com/mrdbourke/zero-to-mastery-ml/blob/master/section-2-data-science-and-ml-tools/introduction-to-scikit-learn-OLD.ipynb 11/81
7/23/24, 1:26 PM zero-to-mastery-ml/section-2-data-science-and-ml-tools/introduction-to-scikit-learn-OLD.ipynb at master · mrdbourke/zero-to-m…
In [27]:
X = heart_disease.drop('target', axis=1)
X
Out[27]: age sex cp trestbps chol fbs restecg thalach exang oldpeak slope
... ... ... ... ... ... ... ... ... ... ... ...
In [28]:
y = heart_disease['target']
y
Out[28]: 0 1
1 1
2 1
3 1
4 1
..
298 0
299 0
300 0
301 0
302 0
Name: target, Length: 303, dtype: int64
In [29]:
# Splitting the data into training and test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,
y,
test_size=0.2) # you ca
In [30]:
# 80% of data is being used for the test set
https://github.com/mrdbourke/zero-to-mastery-ml/blob/master/section-2-data-science-and-ml-tools/introduction-to-scikit-learn-OLD.ipynb 12/81
7/23/24, 1:26 PM zero-to-mastery-ml/section-2-data-science-and-ml-tools/introduction-to-scikit-learn-OLD.ipynb at master · mrdbourke/zero-to-m…
X.shape[0] * 0.8
Out[30]: 242.4
In [31]:
# Import car-sales-extended.csv
car_sales = pd.read_csv("../data/car-sales-extended.csv")
car_sales
In [32]:
car_sales.dtypes
In [33]:
# Split into X & y and train/test
X = car_sales.drop("Price", axis=1)
y = car_sales["Price"]
In [34]:
# Try to predict with random forest on price column (doesn't work)
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor()
model.fit(X_train, y_train)
model.score(X_test, y_test)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-34-ecb56ad8f06d> in <module>
3
4 model = RandomForestRegressor()
----> 5 model.fit(X_train, y_train)
https://github.com/mrdbourke/zero-to-mastery-ml/blob/master/section-2-data-science-and-ml-tools/introduction-to-scikit-learn-OLD.ipynb 13/81
7/23/24, 1:26 PM zero-to-mastery-ml/section-2-data-science-and-ml-tools/introduction-to-scikit-learn-OLD.ipynb at master · mrdbourke/zero-to-m…
6 model.score(X_test, y_test)
~/Desktop/ml-course/zero-to-mastery-ml/env/lib/python3.7/site-packages/sklea
rn/ensemble/_forest.py in fit(self, X, y, sample_weight)
293 """
294 # Validate or convert input data
--> 295 X = check_array(X, accept_sparse="csc", dtype=DTYPE)
296 y = check_array(y, accept_sparse='csc', ensure_2d=False, dty
pe=None)
297 if sample_weight is not None:
~/Desktop/ml-course/zero-to-mastery-ml/env/lib/python3.7/site-packages/sklea
rn/utils/validation.py in check_array(array, accept_sparse, accept_large_spa
rse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_s
amples, ensure_min_features, warn_on_dtype, estimator)
529 array = array.astype(dtype, casting="unsafe", co
py=False)
530 else:
--> 531 array = np.asarray(array, order=order, dtype=dty
pe)
532 except ComplexWarning:
533 raise ValueError("Complex data not supported\n"
~/Desktop/ml-course/zero-to-mastery-ml/env/lib/python3.7/site-packages/nump
y/core/_asarray.py in asarray(a, dtype, order)
83
84 """
---> 85 return array(a, dtype, copy=False, order=order)
86
87
In [ ]:
# Turn the categories (Make and Colour) into numbers
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
In [ ]:
transformed_X[0]
In [ ]:
X.iloc[0]
In [ ]:
# Another way... using pandas and pd.get_dummies()
car_sales.head()
In [ ]:
dummies = pd.get_dummies(car_sales[["Make", "Colour", "Doors"]])
dummies
In [ ]:
# Have to convert doors to object for dummies to work on it...
car_sales["Doors"] = car_sales["Doors"].astype(object)
dummies = pd.get_dummies(car_sales[["Make", "Colour", "Doors"]])
dummies
In [ ]:
# The categorical categories are now either 1 or 0...
X["Make"].value_counts()
https://github.com/mrdbourke/zero-to-mastery-ml/blob/master/section-2-data-science-and-ml-tools/introduction-to-scikit-learn-OLD.ipynb 14/81
7/23/24, 1:26 PM zero-to-mastery-ml/section-2-data-science-and-ml-tools/introduction-to-scikit-learn-OLD.ipynb at master · mrdbourke/zero-to-m…
In [ ]:
# Let's refit the model
np.random.seed(42)
X_train, X_test, y_train, y_test = train_test_split(transformed_X,
y,
test_size=0.2)
model.fit(X_train, y_train)
In [ ]:
model.score(X_test, y_test)
There are two main options when dealing with missing values.
1. Fill them with some given value. For example, you might fill missing values of a
numerical column with the mean of all the other values. The practice of filling
missing values is often referred to as imputation.
2. Remove them. If a row has missing values, you may opt to remove them
completely from your sample completely. However, this potentially results in
using less data to build your model.
Note: Dealing with missing values is a problem to problem issue. And there's often
no best way to do it.
In [35]:
# Import car sales dataframe with missing values
car_sales_missing = pd.read_csv("../data/car-sales-extended-missing-data.c
car_sales_missing
In [36]:
car_sales_missing.isna().sum()
Out[36]: Make 49
Colour 50
Odometer (KM) 50
Doors 50
Price 50
https://github.com/mrdbourke/zero-to-mastery-ml/blob/master/section-2-data-science-and-ml-tools/introduction-to-scikit-learn-OLD.ipynb 15/81
7/23/24, 1:26 PM zero-to-mastery-ml/section-2-data-science-and-ml-tools/introduction-to-scikit-learn-OLD.ipynb at master · mrdbourke/zero-to-m…
Price 50
dtype: int64
In [37]:
# Let's convert the categorical columns to one hot encoded (code copied fro
# Turn the categories (Make and Colour) into numbers
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-37-2a49b486c91e> in <module>
10 categorical_features)],
11 remainder="passthrough")
---> 12 transformed_X = transformer.fit_transform(car_sales_missing)
13 transformed_X
~/Desktop/ml-course/zero-to-mastery-ml/env/lib/python3.7/site-packages/sklea
rn/compose/_column_transformer.py in fit_transform(self, X, y)
516 self._validate_remainder(X)
517
--> 518 result = self._fit_transform(X, y, _fit_transform_one)
519
520 if not result:
~/Desktop/ml-course/zero-to-mastery-ml/env/lib/python3.7/site-packages/sklea
rn/compose/_column_transformer.py in _fit_transform(self, X, y, func, fitte
d)
455 message=self._log_message(name, idx, len(transfo
rmers)))
456 for idx, (name, trans, column, weight) in enumerate(
--> 457 self._iter(fitted=fitted, replace_strings=Tr
ue), 1))
458 except ValueError as e:
459 if "Expected 2D array, got 1D array instead" in str(e):
~/Desktop/ml-course/zero-to-mastery-ml/env/lib/python3.7/site-packages/jobli
b/parallel.py in __call__(self, iterable)
1002 # remaining jobs.
1003 self._iterating = False
-> 1004 if self.dispatch_one_batch(iterator):
1005 self._iterating = self._original_iterator is not Non
e
1006
~/Desktop/ml-course/zero-to-mastery-ml/env/lib/python3.7/site-packages/jobli
b/parallel.py in dispatch_one_batch(self, iterator)
833 return False
834 else:
--> 835 self._dispatch(tasks)
836 return True
837
~/Desktop/ml-course/zero-to-mastery-ml/env/lib/python3.7/site-packages/jobli
b/parallel.py in _dispatch(self, batch)
752 with self._lock:
753 job_idx = len(self._jobs)
--> 754 job = self._backend.apply_async(batch, callback=cb)
755 # A job can complete so quickly than its callback is
756 # called before we get here, causing self._jobs to
~/Desktop/ml-course/zero-to-mastery-ml/env/lib/python3.7/site-packages/jobli
b/_parallel_backends.py in apply_async(self, func, callback)
207 def apply_async(self, func, callback=None):
208 """Schedule a func to be run"""
https://github.com/mrdbourke/zero-to-mastery-ml/blob/master/section-2-data-science-and-ml-tools/introduction-to-scikit-learn-OLD.ipynb 16/81
7/23/24, 1:26 PM zero-to-mastery-ml/section-2-data-science-and-ml-tools/introduction-to-scikit-learn-OLD.ipynb at master · mrdbourke/zero-to-m…
--> 209 result = ImmediateResult(func)
210 if callback:
211 callback(result)
~/Desktop/ml-course/zero-to-mastery-ml/env/lib/python3.7/site-packages/jobli
b/_parallel_backends.py in __init__(self, batch)
588 # Don't delay the application, to avoid keeping the input
589 # arguments in memory
--> 590 self.results = batch()
591
592 def get(self):
~/Desktop/ml-course/zero-to-mastery-ml/env/lib/python3.7/site-packages/jobli
b/parallel.py in __call__(self)
254 with parallel_backend(self._backend, n_jobs=self._n_jobs):
255 return [func(*args, **kwargs)
--> 256 for func, args, kwargs in self.items]
257
258 def __len__(self):
~/Desktop/ml-course/zero-to-mastery-ml/env/lib/python3.7/site-packages/jobli
b/parallel.py in <listcomp>(.0)
254 with parallel_backend(self._backend, n_jobs=self._n_jobs):
255 return [func(*args, **kwargs)
--> 256 for func, args, kwargs in self.items]
257
258 def __len__(self):
~/Desktop/ml-course/zero-to-mastery-ml/env/lib/python3.7/site-packages/sklea
rn/pipeline.py in _fit_transform_one(transformer, X, y, weight, message_clsn
ame, message, **fit_params)
726 with _print_elapsed_time(message_clsname, message):
727 if hasattr(transformer, 'fit_transform'):
--> 728 res = transformer.fit_transform(X, y, **fit_params)
729 else:
730 res = transformer.fit(X, y, **fit_params).transform(X)
~/Desktop/ml-course/zero-to-mastery-ml/env/lib/python3.7/site-packages/sklea
rn/preprocessing/_encoders.py in fit_transform(self, X, y)
370 """
371 self._validate_keywords()
--> 372 return super().fit_transform(X, y)
373
374 def transform(self, X):
~/Desktop/ml-course/zero-to-mastery-ml/env/lib/python3.7/site-packages/sklea
rn/base.py in fit_transform(self, X, y, **fit_params)
569 if y is None:
570 # fit method of arity 1 (unsupervised transformation)
--> 571 return self.fit(X, **fit_params).transform(X)
572 else:
573 # fit method of arity 2 (supervised transformation)
~/Desktop/ml-course/zero-to-mastery-ml/env/lib/python3.7/site-packages/sklea
rn/preprocessing/_encoders.py in fit(self, X, y)
345 """
346 self._validate_keywords()
--> 347 self._fit(X, handle_unknown=self.handle_unknown)
348 self.drop_idx_ = self._compute_drop_idx()
349 return self
~/Desktop/ml-course/zero-to-mastery-ml/env/lib/python3.7/site-packages/sklea
rn/preprocessing/_encoders.py in _fit(self, X, handle_unknown)
72
73 def _fit(self, X, handle_unknown='error'):
---> 74 X_list, n_samples, n_features = self._check_X(X)
75
76 if self.categories != 'auto':
~/Desktop/ml-course/zero-to-mastery-ml/env/lib/python3.7/site-packages/sklea
rn/preprocessing/_encoders.py in _check_X(self, X)
59 Xi = self._get_feature(X, feature_idx=i)
60 Xi = check_array(Xi, ensure_2d=False, dtype=None,
---> 61 force_all_finite=needs_validation)
62 X_columns.append(Xi)
https://github.com/mrdbourke/zero-to-mastery-ml/blob/master/section-2-data-science-and-ml-tools/introduction-to-scikit-learn-OLD.ipynb 17/81
7/23/24, 1:26 PM zero-to-mastery-ml/section-2-data-science-and-ml-tools/introduction-to-scikit-learn-OLD.ipynb at master · mrdbourke/zero-to-m…
63
~/Desktop/ml-course/zero-to-mastery-ml/env/lib/python3.7/site-packages/sklea
rn/utils/validation.py in check_array(array, accept_sparse, accept_large_spa
rse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_s
amples, ensure_min_features, warn_on_dtype, estimator)
576 if force_all_finite:
577 _assert_all_finite(array,
--> 578 allow_nan=force_all_finite == 'allow-
nan')
579
580 if ensure_min_samples > 0:
~/Desktop/ml-course/zero-to-mastery-ml/env/lib/python3.7/site-packages/sklea
rn/utils/validation.py in _assert_all_finite(X, allow_nan, msg_dtype)
63 elif X.dtype == np.dtype('object') and not allow_nan:
64 if _object_dtype_isnan(X).any():
---> 65 raise ValueError("Input contains NaN")
66
67
Ahh... this doesn't work. We'll have to either fill or remove the missing values.
In [38]:
car_sales_missing.isna().sum()
Out[38]: Make 49
Colour 50
Odometer (KM) 50
Doors 50
Price 50
dtype: int64
We could fill Price with the mean, however, since it's the target variable, we don't
want to be introducing too many fake labels.
Note: The practice of filling missing data is called imputation. And it's important to
remember there's no perfect way to fill missing data. The methods we're using are
only one of many. The techniques you use will depend heavily on your dataset. A
good place to look would be searching for "data imputation techniques".
In [39]:
# Fill the "Make" column
car_sales_missing["Make"].fillna("missing", inplace=True)
In [40]:
# Fill the "Colour" column
car_sales_missing["Colour"].fillna("missing", inplace=True)
In [41]:
# Fill the "Odometer (KM)" column
car_sales_missing["Odometer (KM)"].fillna(car_sales_missing["Odometer (KM)
In [42]:
# Fill the "Doors" column
car_sales_missing["Doors"].fillna(4, inplace=True)
[ ]
https://github.com/mrdbourke/zero-to-mastery-ml/blob/master/section-2-data-science-and-ml-tools/introduction-to-scikit-learn-OLD.ipynb 18/81
7/23/24, 1:26 PM zero-to-mastery-ml/section-2-data-science-and-ml-tools/introduction-to-scikit-learn-OLD.ipynb at master · mrdbourke/zero-to-m…
In [43]:
# Check our dataframe
car_sales_missing.isna().sum()
Out[43]: Make 0
Colour 0
Odometer (KM) 0
Doors 0
Price 50
dtype: int64
In [44]:
# Remove rows with missing Price labels
car_sales_missing.dropna(inplace=True)
In [45]:
car_sales_missing.isna().sum()
Out[45]: Make 0
Colour 0
Odometer (KM) 0
Doors 0
Price 0
dtype: int64
We've removed the rows with missing Price values, now there's less data but there's
no more missing values.
In [46]:
len(car_sales_missing)
Out[46]: 950
In [47]:
# Now let's one-hot encode the categorical columns (copied from above)
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
In [48]:
transformed_X[0]
And we can use it to fill the missing values in our DataFrame as above.
In [49]:
car_sales_missing.isna().sum()
Out[49]: Make 0
Colour 0
Odometer (KM) 0
Doors 0
Price 0
dtype: int64
Let's reimport it so it has missing values and we can fill them with Scikit-Learn.
In [50]:
# Reimport the DataFrame
car_sales_missing = pd.read_csv("../data/car-sales-extended-missing-data.c
car_sales_missing.isna().sum()
Out[50]: Make 49
Colour 50
Odometer (KM) 50
Doors 50
Price 50
dtype: int64
In [51]:
# Drop the rows with missing in the "Price" column
car_sales_missing.dropna(subset=["Price"], inplace=True)
In [52]:
car_sales_missing.isna().sum()
Out[52]: Make 47
Colour 46
Odometer (KM) 48
Doors 47
Price 0
dtype: int64
In [53]:
# Split into X and y
X = car_sales_missing.drop("Price", axis=1)
y = car_sales_missing["Price"]
Note: We split data into train & test to perform filling missing values on them
separately.
In [54]:
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
In [55]:
# Fill categorical values with 'missing' & numerical with mean
https://github.com/mrdbourke/zero-to-mastery-ml/blob/master/section-2-data-science-and-ml-tools/introduction-to-scikit-learn-OLD.ipynb 20/81
7/23/24, 1:26 PM zero-to-mastery-ml/section-2-data-science-and-ml-tools/introduction-to-scikit-learn-OLD.ipynb at master · mrdbourke/zero-to-m…
cat_imputer = SimpleImputer(strategy="constant", fill_value="missing")
door_imputer = SimpleImputer(strategy="constant", fill_value=4)
num_imputer = SimpleImputer(strategy="mean")
In [56]:
# Define different column features
categorical_features = ["Make", "Colour"]
door_feature = ["Doors"]
numerical_feature = ["Odometer (KM)"]
In [57]:
imputer = ColumnTransformer([
("cat_imputer", cat_imputer, categorical_features),
("door_imputer", door_imputer, door_feature),
("num_imputer", num_imputer, numerical_feature)])
In [58]:
# Get our transformed data array's back into DataFrame's
car_sales_filled_train = pd.DataFrame(filled_X_train,
columns=["Make", "Colour", "Doors",
car_sales_filled_test = pd.DataFrame(filled_X_test,
columns=["Make", "Colour", "Doors",
Out[58]: Make 0
Colour 0
Doors 0
Odometer (KM) 0
dtype: int64
In [59]:
# Check to see the original... still missing values
car_sales_missing.isna().sum()
Out[59]: Make 47
Colour 46
Odometer (KM) 48
Doors 47
Price 0
dtype: int64
In [60]:
# Now let's one hot encode the features with the same code as before
categorical_features = ["Make", "Colour", "Doors"]
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot",
one_hot,
categorical_features)],
i )
https://github.com/mrdbourke/zero-to-mastery-ml/blob/master/section-2-data-science-and-ml-tools/introduction-to-scikit-learn-OLD.ipynb 21/81
7/23/24, 1:26 PM zero-to-mastery-ml/section-2-data-science-and-ml-tools/introduction-to-scikit-learn-OLD.ipynb at master · mrdbourke/zero-to-m…
remainder="passthrough")
In [61]:
# Now we've transformed X, let's see if we can fit a model
np.random.seed(42)
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor()
Out[61]: 0.21229043336119102
If this looks confusing, don't worry, we've covered a lot of ground very quickly. And
we'll revisit these strategies in a future section in way which makes a lot more
sense.
Most datasets you come across won't be in a form ready to immediately start
using them with machine learning models. And some may take more
preparation than others to get ready to use.
For most machine learning models, your data has to be numerical. This will
involve converting whatever you're working with into numbers. This process is
often referred to as feature engineering or feature encoding.
Some machine learning models aren't compatible with missing data. The
process of filling missing data is referred to as data imputation.
If you know what kind of problem you're working with, one of the next places you
should look at is the Scikit-Learn algorithm cheatsheet.
This cheatsheet gives you a bit of an insight into the algorithm you might want to
use for the problem you're working on.
It's important to remember, you don't have to explicitly know what each algorithm
is doing on the inside to start using them. If you do start to apply different
algorithms but they don't seem to be working, that's when you'd start to look
deeper into each one.
Let's check out the cheatsheet and follow it for some of the problems we're
working on.
You can see it's split into four main categories. Regression, classification, clustering
and dimensionality reduction. Each has their own different purpose but the Scikit-
Learn team has designed the library so the workflows for each are relatively similar.
Let's start with a regression problem. We'll use the Boston housing dataset built
into Scikit-Learn's datasets module.
Since it's in a dictionary, let's turn it into a DataFrame so we can inspect it better.
In [63]:
boston_df = pd.DataFrame(boston["data"], columns=boston["feature_names"])
boston_df["target"] = pd.Series(boston["target"])
boston_df.head()
Out[63]: CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8
https://github.com/mrdbourke/zero-to-mastery-ml/blob/master/section-2-data-science-and-ml-tools/introduction-to-scikit-learn-OLD.ipynb 23/81
7/23/24, 1:26 PM zero-to-mastery-ml/section-2-data-science-and-ml-tools/introduction-to-scikit-learn-OLD.ipynb at master · mrdbourke/zero-to-m…
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7
In [64]:
# How many samples?
len(boston_df)
Out[64]: 506
Beautiful, our goal here is to use the feature columns, such as CRIM , which is the
per capita crime rate by town, AGE , the proportion of owner-occupied units built
prior to 1940 and more to predict the target column. Where the target
column is the median house prices.
In essence, each row is a different town in Boston (the data) and we're trying to
build a model to predict the median house price (the label) of a town given a series
of attributes about the town.
Since we have data and labels, this is a supervised learning problem. And since
we're trying to predict a number, it's a regression problem.
Knowing these two things, how do they line up on the Scikit-Learn machine
learning algorithm cheat-sheet?
In [65]:
# Import the Ridge model class from the linear_model module
from sklearn.linear_model import Ridge
Out[65]: 0.6662221670168518
One of the most common and useful ensemble methods is the Random Forest.
Known for its fast training and prediction times and adaptibility to different
problems.
An in-depth discussion of the Random Forest algorithm is beyond the scope of this
notebook but if you're interested in learning more, An Implementation and
Explanation of the Random Forest in Python by Will Koehrsen is a great read.
We can use the exact same workflow as above. Except for changing the model.
In [66]:
# Import the RandomForestRegressor model class from the ensemble module
from sklearn.ensemble import RandomForestRegressor
Out[66]: 0.873969014117403
Woah, we get a boost in score on the test set of almost 0.2 with a change of model.
At first, the diagram can seem confusing. But once you get a little practice applying
different models to different problems, you'll start to pick up which sorts of
algorithms do better with different types of data.
Say you were trying to predict whether or not a patient had heart disease based on
their medical records.
In [67]:
heart_disease = pd.read_csv("../data/heart-disease.csv")
heart_disease.head()
Out[67]: age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca
In [68]:
# How many samples are there?
len(heart_disease)
Out[68]: 303
Similar to the Boston housing dataset, here we want to use all of the available data
to predict the target column (1 for if a patient has heart disease and 0 for if they
don't).
So what do we know?
We've got 303 samples (1 row = 1 sample) and we're trying to predict whether or
not a patient has heart disease.
Because we're trying to predict whether each sample is one thing or another, we've
got a classification problem.
https://github.com/mrdbourke/zero-to-mastery-ml/blob/master/section-2-data-science-and-ml-tools/introduction-to-scikit-learn-OLD.ipynb 26/81
7/23/24, 1:26 PM zero-to-mastery-ml/section-2-data-science-and-ml-tools/introduction-to-scikit-learn-OLD.ipynb at master · mrdbourke/zero-to-m…
In [69]:
# Import LinearSVC from the svm module
from sklearn.svm import LinearSVC
/Users/daniel/Desktop/ml-course/zero-to-mastery-ml/env/lib/python3.7/site-pa
ckages/sklearn/svm/_base.py:947: ConvergenceWarning: Liblinear failed to con
verge, increase the number of iterations.
"the number of iterations.", ConvergenceWarning)
Out[69]: 0.47540983606557374
Straight out of the box (with no tuning or improvements) the model scores 47%
accuracy, which with 2 classes (heart disease or not) is as good as guessing.
With this result, we'll go back to our diagram and see what our options are.
https://github.com/mrdbourke/zero-to-mastery-ml/blob/master/section-2-data-science-and-ml-tools/introduction-to-scikit-learn-OLD.ipynb 27/81
7/23/24, 1:26 PM zero-to-mastery-ml/section-2-data-science-and-ml-tools/introduction-to-scikit-learn-OLD.ipynb at master · mrdbourke/zero-to-m…
Following the path (and skipping a few, don't worry, we'll get to this) we come up
to EnsembleMethods again. Except this time, we'll be looking at ensemble
classifiers instead of regressors.
Let's try.
In [70]:
# Import the RandomForestClassifier model class from the ensemble module
from sklearn.ensemble import RandomForestClassifier
Out[70]: 0.8524590163934426
One thing to remember, is both models are yet to receive any hyperparameter
tuning. Hyperparameter tuning is fancy term for adjusting some settings on a
model to try and make it better. It usually happens once you've found a decent
baseline result you'd like to improve upon.
In this case, we'd probably take the RandomForestClassifier and try and
improve it with hyperparameter tuning (which we'll see later on).
Why?
The first reason is time. Covering every single one would take a fair bit longer than
what we've done here. And the second one is the effectiveness of ensemble
methods.
For this notebook, we're focused on structured data, which is why the Random
https://github.com/mrdbourke/zero-to-mastery-ml/blob/master/section-2-data-science-and-ml-tools/introduction-to-scikit-learn-OLD.ipynb 28/81
7/23/24, 1:26 PM zero-to-mastery-ml/section-2-data-science-and-ml-tools/introduction-to-scikit-learn-OLD.ipynb at master · mrdbourke/zero-to-m…
Forest has been our model of choice.
If you'd like to learn more about the Random Forest and why it's the war horse of
machine learning, check out these resources:
And since a big part of being a machine learning engineer or data scientist is
experimenting, you might want to try out some of the other models on the cheat-
sheet and see how you go. The more you can reduce the time between
experiments, the better.
Data
Feature variables
Features
Labels
Target variable
Let's revisit the example of using patient data ( X ) to predict whether or not they
have heart disease ( y ).
In [71]:
# Import the RandomForestClassifier model class from the ensemble module
from sklearn.ensemble import RandomForestClassifier
# Call the fit method on the model and pass it training data
clf.fit(X_train, y_train)
Out[71]: 0.8524590163934426
Calling the fit() method will cause the machine learning algorithm to attempt
to find patterns between X and y . Or if there's no y , it'll only find the patterns
within X .
Let's see X .
In [72]:
X.head()
Out[72]: age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca
And y .
In [73]:
y.head()
Out[73]: 0 1
1 1
2 1
3 1
4 1
Name: target, dtype: int64
Passing X and y to fit() will cause the model to go through all of the
examples in X (data) and see what their corresponding y (label) is.
How the model does this is different depending on the model you use.
For now, you could imagine it similar to how you would figure out patterns if you
had enough time.
You'd look at the feature variables, X , the age , sex , chol (cholesterol) and
see what different values led to the labels, y , 1 for heart disease, 0 for not
heart disease.
A machine learning algorithm looks at a dataset, finds patterns, tries to use those
patterns to predict something and corrects itself as best it can with the available
data and labels. It stores these patterns for later use.
A machine learning algorithm uses the patterns its previously learned in a dataset
to make a prediction on some unseen data.
Scikit-Learn enables this in several ways. Two of the most common and useful are
predict() and predict_proba() .
In [74]:
# Use a trained model to make predictions
clf.predict(X_test)
Out[74]: array([0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0,
1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,
1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0])
Given data in the form of X , the predict() function returns labels in the form of
y.
It's standard practice to save these predictions to a variable named something like
y_preds for later comparison to y_test or y_true (usually same as y_test
just another name).
In [75]:
# Compare predictions to truth
y_preds = clf.predict(X_test)
np.mean(y_preds == y_test)
Out[75]: 0.8524590163934426
In [76]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_preds)
Out[76]: 0.8524590163934426
Note: For the predict() function to work, it must be passed X (data) in the
same format the model was trained on. Anything different and it will return an
error.
In [77]:
# Return probabilities rather than labels
clf.predict_proba(X_test[:5])
In [78]:
# Return labels
clf.predict(X_test[:5])
In [79]:
# Find prediction probabilities for 1 sample
clf.predict_proba(X_test[:1])
This output means the sample X_test[:1] , the model is predicting label 0 (index
0) with a probability score of 0.9.
Because the score is over 0.5, when using predict() , a label of 0 is assigned.
In [80]:
# Return the label for 1 sample
clf.predict(X_test[:1])
Out[80]: array([0])
Because our problem is a binary classification task (heart disease or not heart
disease), predicting a label with 0.5 probability every time would be the same as a
coin toss (guessing). Therefore, once the prediction probability of a sample passes
0.5, for a certain label, it's assigned that label.
In [81]:
# Import the RandomForestRegressor model class from the ensemble module
from sklearn.ensemble import RandomForestRegressor
# Make predictions
y_preds = model.predict(X_test)
In [82]:
# Compare the predictions to the truth
from sklearn.metrics import mean_absolute_error
mean_absolute_error(y_test, y_preds)
Out[82]: 2.1226372549019623
https://github.com/mrdbourke/zero-to-mastery-ml/blob/master/section-2-data-science-and-ml-tools/introduction-to-scikit-learn-OLD.ipynb 32/81
7/23/24, 1:26 PM zero-to-mastery-ml/section-2-data-science-and-ml-tools/introduction-to-scikit-learn-OLD.ipynb at master · mrdbourke/zero-to-m…
Now we've seen how to get a model how to find patterns in data using the fit()
function and make predictions using what its learned using the predict() and
predict_proba() functions, it's time to evaluate those predictions.
4. Evaluating a model
Once you've trained a model, you'll want a way to measure how trustworthy its
predictions are.
The scoring function you use will also depend on the problem you're working on.
In [83]:
# Import the RandomForestClassifier model class from the ensemble module
from sklearn.ensemble import RandomForestClassifier
# Call the fit method on the model and pass it training data
clf.fit(X_train, y_train);
Once the model has been fit on the training data ( X_train , y_train ), we can
call the score() method on it and evaluate our model on the test data, data the
model has never seen before ( X_test , y_test ).
In [84]:
# Check the score of the model (on the test set)
clf score(X test y test)
https://github.com/mrdbourke/zero-to-mastery-ml/blob/master/section-2-data-science-and-ml-tools/introduction-to-scikit-learn-OLD.ipynb 33/81
7/23/24, 1:26 PM zero-to-mastery-ml/section-2-data-science-and-ml-tools/introduction-to-scikit-learn-OLD.ipynb at master · mrdbourke/zero-to-m…
clf.score(X_test, y_test)
Out[84]: 0.8524590163934426
You can find this by pressing SHIFT + TAB within the brackets of score() when
called on a model instance.
Behind the scenes, score() makes predictions on X_test using the trained
model and then compares those predictions to the actual labels y_test .
A model which predicts everything 100% correct would receive a score of 1.0 (or
100%).
Our model doesn't get everything correct, but at 85% (0.85 * 100), it's still far better
than guessing.
Let's do the same but with the regression code from above.
In [85]:
# Import the RandomForestRegressor model class from the ensemble module
from sklearn.ensemble import RandomForestRegressor
Due to the consistent design of the Scikit-Learn library, we can call the same
score() method on model .
In [86]:
# Check the score of the model (on the test set)
model.score(X_test, y_test)
Out[86]: 0.873969014117403
Remember, you can find this by pressing SHIFT + TAB within the brackets of
score() when called on a model instance.
The best possible value here is 1.0, this means the model predicts the target
regression values exactly.
Calling the score() method on any model instance and passing it test data is a
good quick way to see how your model is going.
However, when you get further into a problem, it's likely you'll want to start using
more powerful metrics to evaluate your models performance.
As you may have guessed, the scoring parameter you set will be different
depending on the problem you're working on.
We'll see some specific examples of different parameters in a moment but first let's
check out cross_val_score() .
To do so, we'll copy the heart disease classification code from above and then add
another line at the top.
In [87]:
# Import cross_val_score from the model_selection module
from sklearn.model_selection import cross_val_score
# Call the fit method on the model and pass it training data
clf.fit(X_train, y_train);
In [88]:
# Using score()
clf.score(X_test, y_test)
Out[88]: 0.8524590163934426
In [89]:
# Using cross_val_score()
cross_val_score(clf, X, y)
Remember, you can see the parameters of a function using SHIFT + TAB from
within the brackets.
We've dealt with Figure 1.0 before using score(X_test, y_test) . But looking
deeper into this, if a model is trained using the training data or 80% of samples, this
means 20% of samples aren't used for the model to learn anything.
This also means depending on what 80% is used to train on and what 20% is used
to evaluate the model, it may achieve a score which doesn't reflect the entire
dataset. For example, if a lot of easy examples are in the 80% training data, when it
comes to test on the 20%, your model may perform poorly. The same goes for the
reverse.
Figure 2.0 shows 5-fold cross-validation, a method which tries to provide a solution
to:
Instead of training only on 1 training split and evaluating on 1 testing split, 5-fold
cross-validation does it 5 times. On a different split each time, returning a score for
each.
Why 5-fold?
In [90]:
# 5-fold cross-validation
cross_val_score(clf, X, y, cv=5) # cv is equivalent to K
Since we set cv=5 (5-fold cross-validation), we get back 5 different scores instead
of 1.
Taking the mean of this array gives us a more in-depth idea of how our model is
performing by converting the 5 scores into one.
https://github.com/mrdbourke/zero-to-mastery-ml/blob/master/section-2-data-science-and-ml-tools/introduction-to-scikit-learn-OLD.ipynb 36/81
7/23/24, 1:26 PM zero-to-mastery-ml/section-2-data-science-and-ml-tools/introduction-to-scikit-learn-OLD.ipynb at master · mrdbourke/zero-to-m…
In [91]:
np.random.seed(42)
clf_single_score, clf_cross_val_score
In this case, if you were asked to report the accuracy of your model, even though
it's lower, you'd prefer the cross-validated metric over the non-cross-validated
metric.
Wait?
In [92]:
cross_val_score(clf, X, y, cv=5, scoring=None) # default scoring
When scoring is set to None (by default), it uses the same metric as score()
for whatever model is passed to cross_val_score() .
You can change the evaluation score cross_val_score() uses by changing the
scoring parameter.
And as you might have guessed, different problems call for different evaluation
scores.
1. Accuracy
2. Area under ROC curve
3. Confusion matrix
4. Classification report
Let's have a look at each of these. We'll bring down the classification code from
above to go through some examples.
In [93]:
# Import cross_val_score from the model_selection module
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
np.random.seed(42)
X = heart_disease.drop("target", axis=1)
y = heart_disease["target"]
https://github.com/mrdbourke/zero-to-mastery-ml/blob/master/section-2-data-science-and-ml-tools/introduction-to-scikit-learn-OLD.ipynb 37/81
7/23/24, 1:26 PM zero-to-mastery-ml/section-2-data-science-and-ml-tools/introduction-to-scikit-learn-OLD.ipynb at master · mrdbourke/zero-to-m…
clf = RandomForestClassifier()
clf.fit(X_train, y_train)
clf.score(X_test, y_test)
Out[93]: 0.8524590163934426
Accuracy
Accuracy is the default metric for the score() function within each of Scikit-
Learn's classifier models. And it's probably the metric you'll see most often used for
classification problems.
However, we'll see in a second how it may not always be the best metric to use.
In [94]:
# Accuracy as percentage
print(f"Heart Disease Classifier Accuracy: {clf.score(X_test, y_test) * 100
It's usually referred to as AUC for Area Under Curve and the curve they're talking
about is the Receiver Operating Characteristic or ROC for short.
So if hear someone talking about AUC or ROC, they're probably talking about what
follows.
ROC curves are a comparison of true postive rate (tpr) versus false positive rate
(fpr).
For clarity:
Now we know this, let's see one. Scikit-Learn lets you calculate the information
required for a ROC curve using the roc_curve function.
In [95]:
from sklearn.metrics import roc_curve
Out[95]: array([0. , 0. , 0. , 0. , 0. ,
0.03448276, 0.03448276, 0.03448276, 0.03448276, 0.06896552,
https://github.com/mrdbourke/zero-to-mastery-ml/blob/master/section-2-data-science-and-ml-tools/introduction-to-scikit-learn-OLD.ipynb 38/81
7/23/24, 1:26 PM zero-to-mastery-ml/section-2-data-science-and-ml-tools/introduction-to-scikit-learn-OLD.ipynb at master · mrdbourke/zero-to-m…
0.06896552, 0.10344828, 0.13793103, 0.13793103, 0.17241379,
0.17241379, 0.27586207, 0.4137931 , 0.48275862, 0.55172414,
0.65517241, 0.72413793, 0.72413793, 0.82758621, 1. ])
Looking at these on their own doesn't make much sense. It's much easier to see
their value visually.
Since Scikit-Learn doesn't have a built-in function to plot a ROC curve, quite often,
you'll find a function (or write your own) like the one below.
In [96]:
import matplotlib.pyplot as plt
plot_roc_curve(fpr, tpr)
Looking at the plot for the first time, it might seem a bit confusing.
The main thing to take away here is our model is doing far better than guessing.
A metric you can use to quantify the ROC curve in a single number is AUC (Area
Under Curve). Scikit-Learn implements a function to caculate this called
roc_auc_score() .
The maximum ROC AUC score you can achieve is 1.0 and generally, the closer to
1.0, the better the model.
In [97]:
from sklearn.metrics import roc_auc_score
roc_auc_score(y_test, y_probs)
Out[97]: 0.9304956896551724
The most ideal position for a ROC curve to run along the top left corner of the plot.
This would mean the model predicts only true positives and no false positives And
https://github.com/mrdbourke/zero-to-mastery-ml/blob/master/section-2-data-science-and-ml-tools/introduction-to-scikit-learn-OLD.ipynb 39/81
7/23/24, 1:26 PM zero-to-mastery-ml/section-2-data-science-and-ml-tools/introduction-to-scikit-learn-OLD.ipynb at master · mrdbourke/zero-to-m…
This would mean the model predicts only true positives and no false positives. And
would result in a ROC AUC score of 1.0.
You can see this by creating a ROC curve using only the y_test labels.
In [98]:
# Plot perfect ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_test)
plot_roc_curve(fpr, tpr)
In [99]:
# Perfect ROC AUC score
roc_auc_score(y_test, y_test)
Out[99]: 1.0
Confusion matrix
The next way to evaluate a classification model is by using a confusion matrix.
A confusion matrix is a quick way to compare the labels a model predicts and the
actual labels it was supposed to predict. In essence, giving you an idea of where the
model is getting confused.
In [100…
from sklearn.metrics import confusion_matrix
y_preds = clf.predict(X_test)
confusion_matrix(y_test, y_preds)
In [101…
pd.crosstab(y_test,
y_preds,
rownames=["Actual Label"],
colnames=["Predicted Label"])
Actual Label
0 24 5
1 4 28
https://github.com/mrdbourke/zero-to-mastery-ml/blob/master/section-2-data-science-and-ml-tools/introduction-to-scikit-learn-OLD.ipynb 40/81
7/23/24, 1:26 PM zero-to-mastery-ml/section-2-data-science-and-ml-tools/introduction-to-scikit-learn-OLD.ipynb at master · mrdbourke/zero-to-m…
If you've never heard of Seaborn, it's a library which is built on top of Matplotlib. It
contains a bunch of helpful plotting functions.
And if you haven't got Seaborn installed, you can install it into the current
environment using:
In [102…
# import sys
# !conda install --yes --prefix {sys.prefix} seaborn
In [103…
# Plot a confusion matrix with Seaborn
import seaborn as sns
Ahh.. that plot isn't offering much. Let's add some commucation and functionise it.
Note: In the original notebook, the function below had the "True label" as the
x-axis label and the "Predicted label" as the y-axis label. But due to the way
confusion_matrix() outputs values, these should be swapped around. The
code below has been corrected.
In [104…
def plot_conf_mat(conf_mat):
"""
Plots a confusion matrix using Seaborn's heatmap().
"""
fig, ax = plt.subplots(figsize=(3, 3))
ax = sns.heatmap(conf_mat,
annot=True, # Annotate the boxes
cbar=False)
plt.xlabel('Predicted label')
plt.ylabel('True label');
plot_conf_mat(conf_mat)
https://github.com/mrdbourke/zero-to-mastery-ml/blob/master/section-2-data-science-and-ml-tools/introduction-to-scikit-learn-OLD.ipynb 41/81
7/23/24, 1:26 PM zero-to-mastery-ml/section-2-data-science-and-ml-tools/introduction-to-scikit-learn-OLD.ipynb at master · mrdbourke/zero-to-m…
We've got a bit more information here but... our numbers are looking a little off.
After a little digging, we figure out the version of Matplotlib we're using broke
Seaborn plots.
GitHub issue: Heatmaps are being truncated when using with seaborn
Note: The underlying issue here is the version of Matplotlib I'm using (3.1.1) is
what's causing the error. By the time you read this, a newer, fixed version may be
out.
Since we probably want to make a few confusion matrices, it makes sense to make
a function for plotting them.
In [105…
def plot_conf_mat(conf_mat):
"""
Plots a confusion matrix using Seaborn's heatmap().
"""
fig, ax = plt.subplots(figsize=(3, 3))
ax = sns.heatmap(conf_mat,
annot=True, # Annotate the boxes
cbar=False)
plt.xlabel('Predicted label')
plt.ylabel('True label')
plot_conf_mat(conf_mat)
An ideal confusion matrix no values out of the diagonal. This means all of models
predictions match the actual labels.
In [106…
# Create perfect confusion matrix
perfect_conf_mat = confusion_matrix(y_test, y_test)
plot_conf_mat(perfect_conf_mat)
https://github.com/mrdbourke/zero-to-mastery-ml/blob/master/section-2-data-science-and-ml-tools/introduction-to-scikit-learn-OLD.ipynb 42/81
7/23/24, 1:26 PM zero-to-mastery-ml/section-2-data-science-and-ml-tools/introduction-to-scikit-learn-OLD.ipynb at master · mrdbourke/zero-to-m…
In [107…
# Returns an error.... (at time of writing)
from sklearn.metrics import plot_confusion_matrix
plot_confusion_matrix(clf, X, y)
Classification report
The final major metric you should consider when evaluating a classification model is
a classification report.
In [108…
from sklearn.metrics import classification_report
print(classification_report(y_test, y_preds))
accuracy 0.85 61
https://github.com/mrdbourke/zero-to-mastery-ml/blob/master/section-2-data-science-and-ml-tools/introduction-to-scikit-learn-OLD.ipynb 43/81
7/23/24, 1:26 PM zero-to-mastery-ml/section-2-data-science-and-ml-tools/introduction-to-scikit-learn-OLD.ipynb at master · mrdbourke/zero-to-m…
macro avg 0.85 0.85 0.85 61
weighted avg 0.85 0.85 0.85 61
The number of rows will depend on how many different classes there are. But there
will always be three rows labell accuracy, macro avg and weighted avg.
For example, let's say there were 10,000 people. And 1 of them had a disease.
You're asked to build a model to predict who has it.
You build the model and find your model to be 99.99% accurate. Which sounds
great! ...until you realise, all its doing is predicting no one has the disease, in other
words all 10,000 predictions are false.
In this case, you'd want to turn to metrics such as precision, recall and F1 score.
In [109…
# Where precision and recall become valuable
disease_true = np.zeros(10000)
disease_true[0] = 1 # only one case
pd.DataFrame(classification_report(disease_true,
disease_preds,
output_dict=True))
/Users/daniel/Desktop/ml-course/zero-to-mastery-ml/env/lib/python3.7/site-pa
ckages/sklearn/metrics/_classification.py:1272: UndefinedMetricWarning: Prec
ision and F-score are ill-defined and being set to 0.0 in labels with no pre
dicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
Out[109… 0.0 1.0 accuracy macro avg weighted avg
https://github.com/mrdbourke/zero-to-mastery-ml/blob/master/section-2-data-science-and-ml-tools/introduction-to-scikit-learn-OLD.ipynb 44/81
7/23/24, 1:26 PM zero-to-mastery-ml/section-2-data-science-and-ml-tools/introduction-to-scikit-learn-OLD.ipynb at master · mrdbourke/zero-to-m…
You can see here, we've got an accuracy of 0.9999 (99.99%), great precision and
recall on class 0.0 but nothing for class 1.0.
To summarize:
Accuracy is a good measure to start with if all classes are balanced (e.g. same
amount of samples which are labelled with 0 or 1)
Precision and recall become more important when classes are imbalanced.
If false positive predictions are worse than false negatives, aim for higher
precision.
If false negative predictions are worse than false positives, aim for higher recall.
Let's see them in action. First, we'll bring down our regression model code again.
In [110…
# Import the RandomForestRegressor model class from the ensemble module
from sklearn.ensemble import RandomForestRegressor
O ' t t i d i d l th d f lt l ti t i i th
https://github.com/mrdbourke/zero-to-mastery-ml/blob/master/section-2-data-science-and-ml-tools/introduction-to-scikit-learn-OLD.ipynb 45/81
7/23/24, 1:26 PM zero-to-mastery-ml/section-2-data-science-and-ml-tools/introduction-to-scikit-learn-OLD.ipynb at master · mrdbourke/zero-to-m…
Once you've got a trained regression model, the default evaluation metric in the
score() function is R^2.
In [111…
# Calculate the models R^2 score
model.score(X_test, y_test)
Out[111… 0.873969014117403
In [112…
from sklearn.metrics import r2_score
r2_score(y_test, y_test_mean)
Out[112… 0.0
In [113…
r2_score(y_test, y_test)
Out[113… 1.0
For your regression models, you'll want to maximise R^2, whilst minimising MAE
and MSE.
In [114…
# Mean absolute error
from sklearn.metrics import mean_absolute_error
y_preds = model.predict(X_test)
mae = mean_absolute_error(y_test, y_preds)
mae
Out[114… 2.1226372549019623
Our model achieves an MAE of 2.203. This means, on average our models
predictions are 2.203 units away from the actual value.
In [115…
df = pd.DataFrame(data={"actual values": y_test,
"predictions": y_preds})
df
72 22 8 23 467
https://github.com/mrdbourke/zero-to-mastery-ml/blob/master/section-2-data-science-and-ml-tools/introduction-to-scikit-learn-OLD.ipynb 46/81
7/23/24, 1:26 PM zero-to-mastery-ml/section-2-data-science-and-ml-tools/introduction-to-scikit-learn-OLD.ipynb at master · mrdbourke/zero-to-m…
72 22.8 23.467
86 22.5 20.219
75 21.4 23.898
You can the predictions are slightly different to the actual values.
Depending what problem you're working on, having a difference like we do now,
might be okay. On the flip side, it may also not be okay, meaning the predictions
would have to be closer.
In [116…
fig, ax = plt.subplots()
x = np.arange(0, len(df), 1)
ax.scatter(x, df["actual values"], c='b', label="Acutual Values")
ax.scatter(x, df["predictions"], c='r', label="Predictions")
ax.legend(loc=(1, 0.5));
In [117…
# Mean squared error
from sklearn.metrics import mean_squared_error
Out[117… 9.242328990196082
MSE will always be higher than MAE because is squares the errors rather than only
taking the absolute difference into account.
Now you might be thinking, which regression evaluation metric should you use?
R^2 is similar to accuracy. It gives you a quick indication of how well your
model might be doing. Generally, the closer your R^2 value is to 1.0, the better
the model. But it doesn't really tell exactly how wrong your model is in terms of
how far off each prediction is.
MAE gives a better indication of how far off each of your model's predictions
https://github.com/mrdbourke/zero-to-mastery-ml/blob/master/section-2-data-science-and-ml-tools/introduction-to-scikit-learn-OLD.ipynb 47/81
7/23/24, 1:26 PM zero-to-mastery-ml/section-2-data-science-and-ml-tools/introduction-to-scikit-learn-OLD.ipynb at master · mrdbourke/zero-to-m…
g y p
are on average.
As for MAE or MSE, because of the way MSE is calculated, squaring the
differences between predicted values and actual values, it amplifies larger
differences. Let's say we're predicting the value of houses (which we are).
Pay more attention to MAE: When being $10,000 off is twice as bad as
being $5,000 off.
Pay more attention to MSE: When being $10,000 off is more than twice as
bad as being $5,000 off.
Note: What we've covered here is only a handful of potential metrics you can use
to evaluate your models. If you're after a complete list, check out the Scikit-Learn
metrics and scoring documentation.
Let's check it out with our classification model and the heart disease dataset.
In [118…
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
np.random.seed(42)
X = heart_disease.drop("target", axis=1)
y = heart_disease["target"]
clf = RandomForestClassifier(n_estimators=100)
In [119…
np.random.seed(42)
cv_acc = cross_val_score(clf, X, y, cv=5)
cv_acc
We've seen this before, now we got 5 different accuracy scores on different test
splits of the data.
In [120…
# Cross-validated accuracy
print(f"The cross-validated accuracy is: {np.mean(cv_acc)*100:.2f}%")
We can find the same using the scoring parameter and passing it "accuracy" .
In [121…
np.random.seed(42)
cv_acc = cross_val_score(clf, X, y, cv=5, scoring="accuracy")
print(f"The cross-validated accuracy is: {np.mean(cv_acc)*100:.2f}%")
The same goes for the other metrics we've been using for classification.
https://github.com/mrdbourke/zero-to-mastery-ml/blob/master/section-2-data-science-and-ml-tools/introduction-to-scikit-learn-OLD.ipynb 48/81
7/23/24, 1:26 PM zero-to-mastery-ml/section-2-data-science-and-ml-tools/introduction-to-scikit-learn-OLD.ipynb at master · mrdbourke/zero-to-m…
Let's try "precision" .
In [122…
np.random.seed(42)
cv_precision = cross_val_score(clf, X, y, cv=5, scoring="precision")
print(f"The cross-validated precision is: {np.mean(cv_precision):.2f}")
In [123…
np.random.seed(42)
cv_recall = cross_val_score(clf, X, y, cv=5, scoring="recall")
print(f"The cross-validated recall is: {np.mean(cv_recall):.2f}")
In [124…
np.random.seed(42)
cv_f1 = cross_val_score(clf, X, y, cv=5, scoring="f1")
print(f"The cross-validated F1 score is: {np.mean(cv_f1):.2f}")
In [125…
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestRegressor
np.random.seed(42)
X = boston_df.drop("target", axis=1)
y = boston_df["target"]
model = RandomForestRegressor(n_estimators=100)
In [126…
np.random.seed(42)
cv_r2 = cross_val_score(model, X, y, cv=5, scoring="r2")
print(f"The cross-validated R^2 score is: {np.mean(cv_r2):.2f}")
In [127…
np.random.seed(42)
cv_mae = cross_val_score(model, X, y, cv=5, scoring="neg_mean_absolute_erro
print(f"The cross-validated MAE score is: {np.mean(cv_mae):.2f}")
"All scorer objects follow the convention that higher return values are
better than lower return values."
https://github.com/mrdbourke/zero-to-mastery-ml/blob/master/section-2-data-science-and-ml-tools/introduction-to-scikit-learn-OLD.ipynb 49/81
7/23/24, 1:26 PM zero-to-mastery-ml/section-2-data-science-and-ml-tools/introduction-to-scikit-learn-OLD.ipynb at master · mrdbourke/zero-to-m…
In [128…
np.random.seed(42)
cv_mse = cross_val_score(model,
X,
y,
cv=5,
scoring="neg_mean_squared_error")
print(f"The cross-validated MSE score is: {np.mean(cv_mse):.2f}")
Well, we've kind of covered this third way of using evaulation metrics with Scikit-
Learn.
In essence, all of the metrics we've seen previously have their own function in
Scikit-Learn.
Classification functions
For:
In [129…
from sklearn.metrics import accuracy_score, precision_score, recall_score,
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
np.random.seed(42)
X = heart_disease.drop("target", axis=1)
y = heart_disease["target"]
clf = RandomForestClassifier(n_estimators=100)
clf.fit(X_train, y_train)
# Make predictions
y_preds = clf.predict(X_test)
https://github.com/mrdbourke/zero-to-mastery-ml/blob/master/section-2-data-science-and-ml-tools/introduction-to-scikit-learn-OLD.ipynb 50/81
7/23/24, 1:26 PM zero-to-mastery-ml/section-2-data-science-and-ml-tools/introduction-to-scikit-learn-OLD.ipynb at master · mrdbourke/zero-to-m…
The same goes for the regression problem.
Regression metrics
For:
In [130…
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_er
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
np.random.seed(42)
X = boston_df.drop("target", axis=1)
y = boston_df["target"]
model = RandomForestRegressor(n_estimators=100)
model.fit(X_train, y_train)
# Make predictions
y_preds = model.predict(X_test)
Wow. We've covered a lot. But it's worth it. Because evaluating a model's
predictions is paramount in any machine learning project.
There's nothing worse than training a machine learning model and optimizing for
the wrong evaluation metric.
Keep the metrics and evaluation methods we've gone through when training your
future models.
If you're after extra reading, I'd go through the Scikit-Learn documentation for
evaluation metrics.
Now we've seen some different metrics we can use to evaluate a model, let's see
some ways we can improve those metrics.
Two of the main methods to improve baseline metrics are from a data perspective
https://github.com/mrdbourke/zero-to-mastery-ml/blob/master/section-2-data-science-and-ml-tools/introduction-to-scikit-learn-OLD.ipynb 51/81
7/23/24, 1:26 PM zero-to-mastery-ml/section-2-data-science-and-ml-tools/introduction-to-scikit-learn-OLD.ipynb at master · mrdbourke/zero-to-m…
Two of the main methods to improve baseline metrics are from a data perspective
and a model perspective.
Could we collect more data? In machine learning, more data is generally better,
as it gives a model more opportunities to learn patterns.
Could we improve our data? This could mean filling in misisng values or
finding a better encoding (turning things into numbers) strategy.
Is there a better model we could use? If you've started out with a simple
model, could you use a more complex one? (we saw an example of this when
looking at the Scikit-Learn machine learning map, ensemble methods are
generally considered more complex models)
Could we improve the current model? If the model you're using performs well
straight out of the box, can the hyperparameters be tuned to make it even
better?
Note: Patterns in data are also often referred to as data parameters. The difference
between parameters and hyperparameters is a machine learning model seeks to
find parameters in data on its own, where as, hyperparameters are settings on a
model which a user (you) can adjust.
Since we have two existing datasets, we'll come at exploration from a model
perspective.
In [131…
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()
When you instantiate a model like above, you're using the default hyperparameters.
These get printed out when you call the model instance and get_params() .
In [132…
clf.get_params()
The same goes for imporving a machine learning model by hyperparameter tuning.
The default hyperparameters on a machine learning model may find patterns in
data well. But there's a chance a adjusting the hyperparameters may improve a
models performance.
Every machine learning model will have different hyperparameters you can tune.
And it's a good question. It's why we're focused on the Random Forest. Instead of
memorizing all of the hyperparameters for every model, we'll see how it's done with
one. And then knowing these principles, you can apply them to a different model if
needed.
Reading the Scikit-Learn documentation for the Random Forest, you'll find they
suggest trying to change n_estimators (the number of trees in the forest) and
min_samples_split (the minimum number of samples required to split an
internal node).
max_features (the number of features to consider when looking for the best
split)
max_depth (the maximum depth of the tree)
min_samples_leaf (the minimum number of samples required to be at a
leaf node)
If this still sounds like a lot, the good news is, the process we're taking with the
Random Forest and tuning its hyperparameters, can be used for other machine
learning models in Scikit-Learn. The only difference is, with a different model, the
hyperparameters you tune will be different.
To get familar with hyparameter tuning, we'll take our RandomForestClassifier and
adjust its hyperparameters in 3 ways.
https://github.com/mrdbourke/zero-to-mastery-ml/blob/master/section-2-data-science-and-ml-tools/introduction-to-scikit-learn-OLD.ipynb 53/81
7/23/24, 1:26 PM zero-to-mastery-ml/section-2-data-science-and-ml-tools/introduction-to-scikit-learn-OLD.ipynb at master · mrdbourke/zero-to-m…
1. By hand
2. Randomly with RandomSearchCV
3. Exhaustively with GridSearchCV
Now the process becomes, train a model on the training data, (try to) improve its
hyperparameters on the validation set and evaluate it on the test set.
If our starting dataset contained 100 different patient records labels indicating who
had heart disease and who didn't and we wanted to build a machine learning
model to predict who had heart disease and who didn't, it might look like this:
In [133…
clf.get_params()
max_depth
max_features
min_samples_leaf
min_samples_split
n_estimators
We'll use the same code as before, except this time we'll create a training,
validation and test split.
With the training set containing 70% of the data and the validation and test sets
each containing 15%.
Let's get some baseline results, then we'll tune the model.
And since we're going to be evaluating a few models, let's make an evaluation
function.
In [134…
def evaluate_preds(y_true, y_preds):
"""
Performs evaluation comparison on y_true labels vs. y_pred labels.
"""
accuracy = accuracy_score(y_true, y_preds)
precision = precision_score(y_true, y_preds)
recall = recall_score(y_true, y_preds)
f1 = f1_score(y_true, y_preds)
metric_dict = {"accuracy": round(accuracy, 2),
"precision": round(precision, 2),
"recall": round(recall, 2),
"f1": round(f1, 2)}
print(f"Acc: {accuracy * 100:.2f}%")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1 score: {f1:.2f}")
return metric_dict
In [135…
from sklearn.metrics import accuracy_score, precision_score, recall_score,
from sklearn.ensemble import RandomForestClassifier
np.random.seed(42)
clf = RandomForestClassifier()
clf.fit(X_train, y_train)
# Make predictions
y_preds = clf.predict(X_valid)
https://github.com/mrdbourke/zero-to-mastery-ml/blob/master/section-2-data-science-and-ml-tools/introduction-to-scikit-learn-OLD.ipynb 55/81
7/23/24, 1:26 PM zero-to-mastery-ml/section-2-data-science-and-ml-tools/introduction-to-scikit-learn-OLD.ipynb at master · mrdbourke/zero-to-m…
Acc: 82.22%
Precision: 0.81
Recall: 0.88
F1 score: 0.85
Out[135… {'accuracy': 0.82, 'precision': 0.81, 'recall': 0.88, 'f1': 0.85}
In [136…
np.random.seed(42)
# Make predictions
y_preds_2 = clf_2.predict(X_valid)
Acc: 82.22%
Precision: 0.84
Recall: 0.84
F1 score: 0.84
Not bad! Slightly worse precision by slightly better recall and f1.
Wait...
This could take a while if all we're doing is building new models with new
hyperparameters each time.
There is.
In [137…
# Hyperparameter grid RandomizedSearchCV will search over
grid = {"n_estimators": [10, 100, 200, 500, 1000, 1200],
"max_depth": [None, 5, 10, 20, 30],
"max_features": ["auto", "sqrt"],
"min_samples_split": [2, 4, 6],
"min_samples_leaf": [1, 2, 4]}
Made up?
https://github.com/mrdbourke/zero-to-mastery-ml/blob/master/section-2-data-science-and-ml-tools/introduction-to-scikit-learn-OLD.ipynb 56/81
7/23/24, 1:26 PM zero-to-mastery-ml/section-2-data-science-and-ml-tools/introduction-to-scikit-learn-OLD.ipynb at master · mrdbourke/zero-to-m…
Yes. Not completely pulled out of the air but after reading the Scikit-Learn
documentation on Random Forest's you'll see some of these values have certain
values which usually perform well and certain hyperparameters take strings rather
than integers.
Now we've got the grid setup, Scikit-Learn's RandomizedSearchCV will look at it,
pick a random value from each, instantiate a model with those values and test each
model.
Or...
The best thing? The results we get will be cross-validated (hence the CV in
RandomizedSearchCV ) so we can use train_test_split() .
And since we're going over so many different models, we'll set n_jobs to -1 of
RandomForestClassifier so Scikit-Learn takes advantage of all the cores
(processors) on our computers.
Note: Depending on n_iter (how many models you test), the different values in
the hyperparameter grid, and the power of your computer, running the cell below
may take a while.
In [138…
from sklearn.model_selection import RandomizedSearchCV, train_test_split
np.random.seed(42)
# Setup RandomizedSearchCV
rs_clf = RandomizedSearchCV(estimator=clf,
param_distributions=grid,
n_iter=20, # try 20 models total
cv=5, # 5-fold cross-validation
verbose=2) # print out results
https://github.com/mrdbourke/zero-to-mastery-ml/blob/master/section-2-data-science-and-ml-tools/introduction-to-scikit-learn-OLD.ipynb 57/81
7/23/24, 1:26 PM zero-to-mastery-ml/section-2-data-science-and-ml-tools/introduction-to-scikit-learn-OLD.ipynb at master · mrdbourke/zero-to-m…
In [139…
# Find the best hyperparameters found by RandomizedSearchCV
rs_clf.best_params_
In [140…
# Make predictions with the best hyperparameters
rs_y_preds = rs_clf.predict(X_test)
Acc: 83.61%
Precision: 0.78
Recall: 0.89
F1 score: 0.83
There's one more way we could try to improve our model's hyperparamters. And
it's with GridSearchCV .
In [141…
grid
And if you remember from before when we did the calculation: max_depth has 4,
max_features has 2, min_samples_leaf has 3, min_samples_split has 3,
n_estimators has 5.
This could take a long time depending on the power of the computer you're using,
the amount of data you have and the complexity of the hyperparamters (usually
higher values means a more complex model).
In our case, the data we're using is relatively small (only ~300 samples).
In [142…
# Another hyperparameter grid similar to rs_clf.best_params_
grid_2 = {'n_estimators': [1200, 1500, 2000],
'max_depth': [None, 5, 10],
'max_features': ['auto', 'sqrt'],
'min_samples_split': [4, 6],
'min_samples_leaf': [1, 2]}
We've created another grid of hyperparameters to search over, this time with less
total.
Now when we run GridSearchCV , passing it our classifier ( clf ), paramter grid
( grid_2 ) and the number of cross-validation folds we'd like to use ( cv ), it'll
create a model with every single combination of hyperparameters, 72 in total, and
check the results.
In [143
https://github.com/mrdbourke/zero-to-mastery-ml/blob/master/section-2-data-science-and-ml-tools/introduction-to-scikit-learn-OLD.ipynb 64/81
7/23/24, 1:26 PM zero-to-mastery-ml/section-2-data-science-and-ml-tools/introduction-to-scikit-learn-OLD.ipynb at master · mrdbourke/zero-to-m…
In [143…
from sklearn.model_selection import GridSearchCV, train_test_split
np.random.seed(42)
# Setup GridSearchCV
gs_clf = GridSearchCV(estimator=clf,
param_grid=grid_2,
cv=5, # 5-fold cross-validation
verbose=2) # print out progress
https://github.com/mrdbourke/zero-to-mastery-ml/blob/master/section-2-data-science-and-ml-tools/introduction-to-scikit-learn-OLD.ipynb 81/81