Machine Learning Masterclass
Machine Learning Masterclass
Machine Learning Masterclass
Masterclass
Companion E-Book
EliteDataScience.com
COPYRIGHT NOTICE
Copyright
c 2017+ EliteDataScience.com
A LL RIGHTS RESERVED .
This book or parts thereof may not be reproduced in any form, stored in any retrieval system, or
transmitted in any form by any means—electronic, mechanical, photocopy, recording, or other-
wise—without prior written permission of the publisher, except as provided by United States of
America copyright law.
Contents
3 Data Cleaning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.1 Unwanted Observations 35
3.1.1 Duplicate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.1.2 Irrelevant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2 Structural Errors 36
3.2.1 Wannabe indicator variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.2.2 Typos and capitalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.2.3 Mislabeled classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.3 Unwanted Outliers 38
3.3.1 Innocent until proven guilty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.3.2 Violin plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.3.3 Manual check . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.4 Missing Data 40
3.4.1 "Common sense" is not sensible here . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.4.2 Missing categorical data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.4.3 Missing numeric data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4 Feature Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.1 Domain Knowledge 43
4.1.1 Boolean masks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.1.2 Link with Exploratory Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.2 Interaction Features 45
4.2.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.3 Sparse Classes 46
4.3.1 Similar classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.3.2 Other classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.4 Dummy Variables 48
4.4.1 Example: Project 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.4.2 Get dummies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.5 Remove Unused 49
4.5.1 Unused features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.5.2 Redundant features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.5.3 Analytical base table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.6 Rolling Up 50
5 Model Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.1 Split Dataset 52
5.1.1 Training and test sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.1.2 Cross-validation (intuition) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.1.3 10-fold cross-validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.2 Model Pipelines 54
5.2.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.2.2 Standardization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.2.3 Preprocessing parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.2.4 Pipelines and cross-validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.2.5 Pipeline dictionaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.3 Declare Hyperparameters 57
5.3.1 Model parameters vs. hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.3.2 Hyperparameter grids (regression) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.3.3 Hyperparameter grids (classification) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.3.4 Hyperparameter dictionary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.4 Fit and tune models 60
5.4.1 Grid search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.4.2 Looping through dictionaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.5 Select Winner 62
5.5.1 Cross-validated score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.5.2 Performance metrics (regression) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.5.3 Performance metrics (classification) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.5.4 Test performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.5.5 Saving the winning model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6 Project Delivery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.1 Confirm Model 66
6.1.1 Import model and load ABT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.1.2 Recreate training and test sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.1.3 Predict the test set again . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.2 Pre-Modeling Functions 68
6.2.1 Data cleaning function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
6.2.2 Feature engineering function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.3 Model Class 70
6.3.1 Python classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.3.2 Custom class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.4 Model Deployment 71
6.4.1 Jupyter notebook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.4.2 Executable script . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
7 Regression Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
7.1 The "Final Boss" of Machine Learning 74
7.1.1 How to find the right amount of complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
7.1.2 How overfitting occurs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
7.2 Regularization 75
7.2.1 Flaws of linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
7.2.2 The number of features is too damn high! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
7.2.3 Cost functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
7.2.4 Example: sum of squared errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
7.2.5 Coefficient penalties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
7.2.6 L1 regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
7.2.7 L2 regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
7.3 Regularized Regression 78
7.3.1 Lasso . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
7.3.2 Ridge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
7.3.3 Elastic-Net . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
7.4 Ensemble Methods 80
7.4.1 Decision trees (kinda) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
7.4.2 Non-linear relationships . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
7.4.3 Unconstrained decision trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
7.4.4 Bagging vs. boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
7.5 Tree Ensembles 83
7.5.1 Random forests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
7.5.2 Boosted trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
7
8 Classification Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
8.1 Binary Classification 85
8.1.1 Positive / negative classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
8.1.2 Class probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
8.2 Noisy Conditional 86
8.2.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
8.2.2 Synthetic dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
8.3 Logistic Regression 88
8.3.1 Linear vs. logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
8.3.2 Predict vs. predict proba . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
8.4 Regularized Logistic Regression 90
8.4.1 Penalty strength . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
8.4.2 Penalty type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
8.5 Tree Ensembles 92
8.5.1 Random forests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
8.5.2 Boosted trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
9 Clustering Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
9.1 K-Means 94
9.1.1 Intuition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
9.1.2 Euclidean distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
9.1.3 Number of clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
9.2 Feature Sets 97
9.2.1 Creating feature sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
9.2.2 Finding clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
A Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
A.1 Area Under ROC Curve 99
A.1.1 Confusion matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
A.1.2 TPR and FPR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
A.1.3 Probability thresholds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
A.1.4 ROC curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
A.1.5 AUROC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
A.2 Data Wrangling 104
A.2.1 Customer-Level feature engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
A.2.2 Intermediary levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
A.2.3 Joining together the ABT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
A.3 Dimensionality Reduction 106
A.3.1 The Curse of Dimensionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
A.3.2 Method 1: Thresholding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
A.3.3 Method 2: Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
1. The Big Picture
Welcome to the Companion E-Book for The Machine Learning Masterclass by Elite Data Science.
This e-book is meant to be a guide that you can use to review all of the key concepts you’ve learned
throughout the course, and it assumes you’ve already completed at least some of the course.
Therefore, we’re going to jump right into what we consider to be the "heart of machine learning"
(and that’s no exaggeration).
You see, at its core, machine learning is about building models that can represent the world in some
important way. All of the fancy algorithms out there are really just trying to do 1 thing: learn these
representations (a.k.a. "models") from data.
As a result, there’s a concept called model complexity that is incredibly important. It really
determines how "good" a model is. (This is closely related to the bias-variance tradeoff, which you
may have also heard of. However, we’ve found the lens of model complexity to be more intuitive.)
Model complexity is one of the most important concepts for practical machine learning.
But before can we discuss model complexity, let’s take a step back and talk about what a "model"
actually is.
Often, these two terms get mixed up, but we want to be explicit about the difference.
In fact, let’s first put machine learning aside and talk about models at an abstract level.
Models are simplified representations of the real world that still accurately capture a desired rela-
tionship.
In Project 1, we gave the example of a child who learns that stove-burner because he stuck his hand
over a candle.
• That connection, "red and bright means painful," is a model.
• It’s a simplified representation; the child didn’t need to know anything about fire, combustion,
thermodynamics, human anatomy, or nerve endings.
• For a model to be useful, it must still accurately capture a useful relationship.
There are different ways to build models, and machine learning is only one of them. Machine
learning builds models by learning patterns from data.
• Therefore, you can think of a model as an "output" of the machine learning process.
• In other words, models are the "lessons" learned from the data.
Another way to understand models is to think of them as mapping functions between input
features and target variables:
10 Chapter 1. The Big Picture
ŷ = f (x)
We don’t use a lot of math in this course, but it’s helpful to introduce some basic notation so that we
can have a shared language for illustrating these concepts.
• x is the input feature.
• ŷ is the predicted target variable.
• f () is the mapping function between x and ŷ
Therefore, the purpose of machine learning is to estimate that mapping function based on the
dataset that you have.
So which type of algorithm should you use to build the model? Which algorithm "settings" should
you tune? How many features should you include?
These will be central questions that we’ll explore in this course, and SPOILER WARNING: the
answers are all tied to model complexity.
Model complexity is a model’s inherent flexibility to represent complex relationships between input
features and target variables.
This definition will become crystal clear in just a few moments when we get to the examples.
One practical challenge in real-world machine learning problems is dealing with noise.
• Think of noise as "randomness" in your data that you can’t control.
• For example, in the Iris dataset, there is natural fluctuation in petal sizes, even within the same
species.
Therefore, one way to judge models is by their ability to separate the signal from that noise.
• Think of the signal as the "true underlying relationship" between input features and target
variables.
• Different machine learning algorithms are simply different ways to estimate that signal.
We’re going to switch over to a toy example so we can really develop a practical intuition behind
model complexity.
1.2.1 Methodology
Remember, a model is only useful if it can accurately approximate the "true state of the world" (i.e.
the "true underlying relationship" between input features and target variables).
Therefore, to study this concept, we’re going to create a synthetic dataset where we already know
"true underlying relationship."
1. First, we’re going to use a single input feature, x.
2. Then, we’re going to generate values for the target variable, y, based on a predetermined
mapping function.
3. Next, we’re going to add randomly generated noise to that dataset.
4. Once we’ve done that, we can try different algorthms on our synthetic dataset.
5. We already know the "true underlying relationship"... it’s the predetermined mapping function.
6. Finally, we can compare how well models of different complexities can separate the signal
from the randomly generated noise.
Let’s dive right in.
First, for our predetermined mapping function, we’ll use the sine wave.
y = sin(x)
However, we’re going to add random noise to it, turning it into a noisy sine wave:
y = sin(x) + ε
# input feature
x = np. linspace (0, 2 ∗ np.pi , 100)
# noise
np. random .seed (321)
noise = np. random . normal (0, .5, 100)
# target variable
y = np.sin(x) + noise
1.2 Noisy Sine Wave 13
In the plot, the blue dots represent the synthetic dataset, with x on the x-axis andy = sin(x) + ε on
the y-axis.
The smooth black line represents the sine wave f () = sin(x) without any noise. This is the "true
underlying relationship" between x and y before adding random noise.
Next, we’ll see how well models of varying complexity are able to learn this signal.
14 Chapter 1. The Big Picture
Now that we have a synthetic dataset, let’s build some models and see how they do.
To be 100% clear, our goal is to build a model that can predict the target variable y from a given
value for the input feature x.
ŷ = ȳ
This literally means that we will always predict the average of all y’s in the dataset, no matter what
the value of x is.
• ŷ is the predicted value of y.
• ȳ is the average of all y’s in the dataset.
As you can guess, we call this the mean model because our prediction will always just be the mean
of the y’s in the dataset, regardless of x.
# B u i l d model
pred = np.mean(df.y)
1.3.2 Analysis
To tell if this is a good model, we don’t even need to do any calculations... We can just look at the
plot and think about it intuitively.
# Scatterplot o f x and y
plt. scatter (df.x, df.y)
Figure 1.5: These observations will always be predicted poorly using a mean model.
Take a look at Figure 1.5 See those observations in the big red circles?
• Predictions for those values of x will always be poor, no matter how much data you collect!
• This type of model does not have the flexibility to learn the curves and slopes in the
dataset.
In other words, our mean model is underfit to the data. It’s not complex enough.
16 Chapter 1. The Big Picture
Think of linear regression as fitting a straight line to the data, but there’s an important difference that
makes them more complex than the mean model.
• The mean model can only fit horizontal lines.
• A linear regression can fit lines with slopes.
Here’s the formula:
ŷ = β0 + β1 x
• ŷ is the predicted value of y for a given x
• β0 is called the intercept, and it’s where "the line crosses zero" (more on this later).
• β1 is the coefficient of the input feature x, and it’s the slope of the line.
When you fit (a.k.a. train) a linear regression model, you’re really just estimating the best values for
β0 and β1 based on your dataset.
All it takes is 5 lines of code:
# LinearRegression from S c i k i t −L e a r n
from sklearn . linear_model import LinearRegression
After fitting the model, we can access easily the intercept and coefficient.
print ( lm.coef_ )
# [ −0.26773758]
1.4 Linear Regression 17
1.4.2 Analysis
Let’s see what we can do to represent the "curvy" relationship. We’ll take a look at polynomial
linear regression.
Polynomial linear regression fits curvy lines to the data. It does so by adding polynomial terms for x,
which are simply x raised to some power.
For example, here’s how to create a "second-order" polynomial model:
• First, create x2 as another input feature into your model.
• Estimate another coefficient, β2 , which is for x2 .
Your formula now becomes:
ŷ = β0 + β1 x + β2 x2
This x2 term now allows the model to represent a curve. If you want to add even more flexibility,
you can add more and more polynomial terms.
For example, here’s a "third-order" polynomial model:
ŷ = β0 + β1 x + β2 x2 + β3 x3
You can add as many polynomials as you want. Each one has another coefficient that must be
estimated, and each one increases the complexity of the model.
1.5.2 Analysis
In the 2nd-order polynomial regression, there’s a slight bend in the line (you may have to squint to
see it).
1.5 Polynomial Regression 19
• Its formula is ŷ = β0 + β1 x + β2 x2
• It’s moving in the right direction, but the model is still too inflexible.
Finally, just for fun, let’s see what happens when you crank model complexity way up.
All it’s saying is that if the petal width is over 1.75, predict ’virginica’ for the species, otherwise
predict ’versicolor’.
That seems pretty simple... so why is this consider a complex type of model?
• Well, in that Iris example, we’ve limited the tree to one level (therefore, it only has one split).
• Theoretically, the number of levels (and thus the number of splits) can be infinite.
• Plus, this "branching" mechanism can represent non-linear relationships.
1.6.2 Analysis
Let’s fit and plot and unconstrained decision tree on this dataset.
It’s "unconstrained" because we won’t limit the depth the tree. We will let it grow to its little heart’s
content!
1.6 Decision Trees 21
Fortunately, as it turns out, there are ways to take advantage of the complexity/flexibility of
decision trees while reducing the chance of overfitting.
This can be accomplished by combining predictions from 100’s of constrained decision trees into a
single model.
These models are called tree ensembles, and they perform very well in practice.
• These are methods that fit many (usually 100+) decision trees and combine their predictions.
• They tend to perform well across many types of problems.
• We’ll cover them in much more detail later.
2. Exploratory Analysis
We recommend always completing a few essential exploratory analysis steps at the start of any
project.
The purpose is the "get to know" the dataset. Doing so upfront will make the rest of the project
much smoother, in 3 main ways.
1. You’ll gain valuable hints for Data Cleaning (which can make or break your models).
2. You’ll think of ideas for Feature Engineering (which can take your model from good to great).
3. You’ll get a "feel" for the dataset, which will help you communicate results and deliver greater
business impact.
However, exploratory analysis for machine learning should be quick, efficient, and decisive... not
long and drawn out!
Don’t skip this step, but don’t get stuck on it either.
You see, there are an infinite number of possible plots, charts, and tables, but you only need a
handful to "get to know" the data well enough to work with it.
We’ll review what those are in this chapter.
• Think of this as a "first date" with the dataset.
• Pareto Principle: We’ll try to use 20% of the effort to learn 80% of what we need.
• You’ll do ad-hoc data exploration later anyway, so you don’t need to be 100% comprehensive
right now.
2.1 Basic Information 23
The first step to understanding your dataset is to display its basic information.
2.1.1 Shape
# Dataframe dimensions
df. shape
2.1.2 Datatypes
Next, it’s a good idea to confirm the data types of your features.
# Column d a t a t y p e s
df. dtypes
Finally, display some example observations from the dataset. This will give you a "feel" for the kind
of data in each feature, and it’s a good way to confirm that the data types are indeed all correct.
# F i r s t 5 rows
24 Chapter 2. Exploratory Analysis
df.head ()
# F i r s t 10 r o w s
df.head (10)
# L a s t 10 r o w s
df.tail (10)
The purpose of displaying examples from the dataset is not to perform rigorous analysis. Instead, it’s
to get a qualitative "feel" for the dataset.
• Do the columns make sense?
• Do the values in those columns make sense?
• Are the values on the right scale?
• Is missing data going to be a big problem based on a quick eyeball test?
• What types of classes are there in the categorical features?
Often, it can helpful to display the first 10 rows of data instead of only the first 5. You can do so by
passing an argument into the head() function.
Finally, it can also save you a lot of trouble down the road if you get in the habit of looking at the
last 10 rows of data. Specifically, you should look for corrupted data hiding at the very end.
2.2 Distributions of numeric features 25
One of the most enlightening data exploration tasks is plotting the distributions of your features.
Often, a quick and dirty grid of histograms is enough to understand the distributions.
However, sometimes you may need to see formal summary statistics, which provide information
such as means, standard deviations, and quartiles.
It’s especially useful to confirm the boundaries (min and max) of each feature, just to make sure
there are no glaring errors.
2.3 Distributions of categorical features 27
One thing to look out for are sparse classes, which are classes that have a very small number of
observations.
Figure 2.3 is from Project 2. As you can see, some of the classes (e.g. ’Concrete Block’, ’Concrete’,
’Block’, ’Wood Shingle’, etc.) have very short bars. These are sparse classes.
They tend to be problematic when building models.
• In the best case, they don’t influence the model much.
28 Chapter 2. Exploratory Analysis
2.4 Segmentations
Segmentations are powerful ways to cut the data to observe the relationship between categorical
features and numeric features.
One of the first segmentations you should try is segmenting the target variable by key categorical
features.
Figure 2.4 is an example from Project 2. As you can see, in general, it looks like single family homes
are more expensive.
2.4.2 Groupby
# Segment by p r o p e r t y _ t y p e and d i s p l a y
30 Chapter 2. Exploratory Analysis
In Figure 2.5, we see an example from Project 2. Some questions to consider include:
• On average, which type of property is larger?
• Which type of property is has larger lots?
• Which type of property is in areas with more nightlife options/more restaurants/more grocery
stores?
• Do these relationships make intuitive sense, or are any surprising to you?
Finally, you are not limited to only calculating a single metric after performing a groupby. You can
use the .agg() function to calculate a list of different metrics.
2.5 Correlations 31
2.5 Correlations
Correlations allow you to look at the relationships between numeric features and other numeric
features.
2.5.1 Intuition
Correlation is a value between -1 and 1 that represents how closely values for two separate features
move in unison.
• Positive correlation means that as one feature increases, the other increases. E.g. a child’s age
and her height.
• Negative correlation means that as one feature increases, the other decreases. E.g. hours spent
studying and number of parties attended.
• Correlations near -1 or 1 indicate a strong relationship.
• Those closer to 0 indicate a weak relationship.
• 0 indicates no relationship.
Pandas DataFrames come with a useful function for calculating correlations: .corr()
This creates a big rectangular dataset. It has the same number of rows and columns, which is just the
number of numeric features in the dataset. Each cell is the correlation between the row feature and
the column feature.
Things to look out for include:
• Which features are strongly correlated with the target variable?
• Are there interesting or unexpected strong correlations between other features?
• Again, you’re primarily looking to gain a better intuitive understanding of the data, which will
help you throughout the rest of the project.
2.5.3 Heatmaps
Correlation heatmaps allow you to visualize the correlation grid and make it easier to digest.
When plotting a heatmap of correlations, it’s often helpful to do four things:
1. Change the background to white. This way, 0 correlation will show as white
2. Annotate the cell with their correlations values
3. Mask the top triangle (less visual noise)
4. Drop the legend (colorbar on the side)
32 Chapter 2. Exploratory Analysis
# Change c o l o r scheme
sns. set_style ("white")
# Make t h e figsize 9 x 8
plt. figure ( figsize =(9 ,8))
# P l o t heatmap of correlations
sns. heatmap ( correlations ∗ 100,
annot=True ,
fmt=’.0f’,
mask=mask ,
cbar=False)
2.5 Correlations 33
Data cleaning is one those things that everyone does but no one really talks about.
Sure, it’s not the "sexiest" part of machine learning. And no, there aren’t hidden tricks and secrets to
uncover.
However, proper data cleaning can make or break your project. Professional data scientists usually
spend a very large portion of their time on this step.
Why? Because of a simple truth in machine learning:
Better data beats fancier algorithms.
In other words... garbage in gets you garbage out. Even if you forget everything else from this
course, please remember this.
In fact, if you have a properly cleaned dataset, even simple algorithms can learn impressive insights
from the data!
• Use this as a "blueprint" for efficient data cleaning.
• Obviously, different types of data will require different types of cleaning.
• However, the systematic approach laid out in this module can always serve as a good starting
point.
3.1 Unwanted Observations 35
The first step to data cleaning is removing unwanted observations from your dataset.
This includes duplicate observations and irrelevant observations.
3.1.1 Duplicate
Duplicate observations most frequently arise during data collection, such as when you:
• Combine datasets from multiple places
• Scrape data
• Receive data from clients/other departments
# Drop d u p l i c a t e s
df = df. drop_duplicates ()
3.1.2 Irrelevant
Irrelevant observations are those that don’t actually fit the specific problem that you’re trying to
solve.
You should always keep your project scope in mind. What types of observations do you care about?
What are you trying to model? How will the model be applied?
# Drop t e m p o r a r y workers
df = df[df. department != ’temp ’]
This is also a great time to review your charts from the Exploratory Analysis step. You can look at
the distribution charts for categorical features to see if there are any classes that shouldn’t be there.
For example, in Project 3, we dropped temporary workers from the dataset because we only needed
to build a model for permanent, full-time employees.
Checking for irrelevant observations before engineering features can save you many headaches
down the road.
36 Chapter 3. Data Cleaning
The next bucket under data cleaning involves fixing structural errors.
Structural errors are those that arise during measurement, data transfer, or other types of "poor
housekeeping."
First, check for variables that should actually be binary indicator variables.
These are variables that should be either 0 or 1. However, maybe they were saved under different
logic. The .fillna() function is very handy here.
# Before :
df. basement . unique ()
# a r r a y ( [ nan , 1.])
# After :
df. basement . unique ()
# array ([ 0. , 1.])
In the example from Project 2, even though NaN represents "missing" values, those are actually
meant to indicate properties without basements.
Therefore, we filled them in with the value 0 to turn ’basement’ into a true indicator variable.
Next, check for typos or inconsistent capitalization. This is mostly a concern for categorical features.
One easy way to check is by displaying the class distributions for your categorical features (which
you’ve probably already done during Exploratory Analysis).
# ’ c o m p o s i t i o n ’ s h o u l d be ’ C o m p o s i t i o n ’
df.roof. replace (’composition ’, ’Composition ’, inplace =True)
# ’ a s p h a l t ’ s h o u l d be ’ A s p h a l t ’
df.roof. replace (’asphalt ’, ’Asphalt ’, inplace =True)
# s h o u l d be ’ Shake
Shingle ’
df.roof. replace ([’shake −shingle ’, ’asphalt ,shake −shingle ’],
’Shake Shingle ’, inplace =True)
These types of errors can be fixed easily with the .replace() function.
• The first argument is the class to replace. As you can see in the third line of code, this can also
be a list of classes.
• The second argument is the new class label to use instead.
• The third argument, inplace=True, tells Pandas to change the original column (instead of
creating a copy).
Finally, check for classes that are labeled as separate classes when they should really be the same.
• e.g. If ’N/A’ and ’Not Applicable’ appear as two separate classes, you should combine them.
• e.g. ’IT’ and ’information_technology’ should be a single class.
# ’ i n f o r m a t i o n _ t e c h n o l o g y ’ s h o u l d be ’ IT ’
df. department . replace (’information_technology ’, ’IT’,
inplace =True)
Again, a quick and easy way to check is by plotting class distributions for your categorical features.
38 Chapter 3. Data Cleaning
Outliers can cause problems with certain types of models. For example, linear regression models are
less robust to outliers than decision tree models.
In general, if you have a legitimate reason to remove an outlier, it will help your model’s perfor-
mance.
However, you need justification for removal. You should never remove an outlier just because it’s a
"big number." That big number could be very informative for your model.
We can’t stress this enough: you must have a good reason for removing an outlier. Good reasons
include:
1. Suspicious measurements that are unlikely to be real data.
2. Outliers that belong in a different population.
3. Outliers that belong to a different problem.
To check for potential outliers, you can use violin plots. Violin plots serve the same purpose as box
plots, but they provide more information.
• A box plot only shows summary statistics such as median and interquartile range.
• A violin plot shows the entire probability distribution of the data. (In case you’re wondering,
you don’t need a statistics degree to interpret them.)
In Figure 3.2, we see an example of a violin plot that shows a potential outlier. The violin plot for
’lot_size’ has a long and skinny tail.
After identifying potential outliers with violin plots, you should do a manual check by displaying
those observations.
Then, you can use a boolean mask to remove outliers by filtering to only keep wanted observations.
40 Chapter 3. Data Cleaning
Unfortunately, from our experience, the 2 most commonly recommended ways of dealing with
missing data actually suck.
They are:
1. Dropping observations that have missing values
2. Imputing the missing values based on values from other observations
Dropping missing values is sub-optimal because when you drop observations, you drop information.
• The fact that the value was missing may be informative in itself.
• Plus, in the real world, you often need to make predictions on new data even if some of the
features are missing!
Imputing missing values is sub-optimal because the value was originally missing but you filled it in,
which always leads to a loss in information, no matter how sophisticated your imputation method is.
• Again, "missingness" is almost always informative in itself, and you should tell your algo-
rithm if a value was missing.
• Even if you build a model to impute your values, you’re not adding any real information.
You’re just reinforcing the patterns already provided by other features.
Figure 3.3: Missing data is like missing a puzzle piece. If you drop it, that’s like pretending the
puzzle slot isn’t there. If you impute it, that’s like trying to squeeze in a piece from somewhere else
in the puzzle.
3.4 Missing Data 41
In short, you always want to tell your algorithm that a value was missing because missingness is
informative. The two most commonly suggested ways of handling missing data don’t do that.
The best way to handle missing data for categorical features is to simply label them as ’Missing’!
• You’re essentially adding a new class for the feature.
• This tells the algorithm that the value was missing.
• This also gets around the Scikit-Learn technical requirement for no missing values.
Sometimes if you have many categorical features, it may be more convenient to write a loop that
labels missing values in all of them.
For missing numeric data, you should flag and fill the values.
1. Flag the observation with an indicator variable of missingness.
2. Then, fill the original missing value with 0 just to meet Scikit-Learn’s technical requirement
of no missing values.
By using this technique of flagging and filling, you are essentially allowing the algorithm to
estimate the optimal constant for missingness, instead of just filling it in with the mean.
When used properly, this technique almost always performs better than imputation methods in
practice.
4. Feature Engineering
Feature engineering is about creating new input features from your existing ones.
This is often one of the most valuable tasks a data scientist can do to improve model performance,
for 3 big reasons:
1. You can isolate and highlight key information, which helps your algorithms "focus" on what’s
important.
2. You can bring in your own domain expertise.
3. Most importantly, once you understand the "vocabulary" of feature engineering, you can bring
in other people’s domain expertise!
Now, we will tour various types of features that you can engineer, and we will introduce several
heuristics to help spark new ideas.
This is not an exhaustive compendium of all feature engineering because there are limitless possibili-
ties for this step.
Finally, we want to emphasize that this skill will naturally improve as you gain more experience.
• Use this as a "checklist" to seed ideas for feature engineering.
• There are limitless possibilities for this step, and you can get pretty creative.
• This skill will improve naturally as you gain more experience and domain expertise!
4.1 Domain Knowledge 43
You can often engineer informative features by tapping into your (or others’) expertise about the
domain.
Try to think of specific information you might want to isolate. You have a lot of "creative freedom"
to think of ideas for new features.
Boolean masks allow you to easily create indicator variables based on conditionals. They empower
you to specifically isolate information that you suspect might be important for the model.
# Display percent o f r o w s w h e r e t w o _ a n d _ t w o == 1
df. two_and_two .mean ()
You can also quickly check the proportion of observations that meet the condition.
Of course, we won’t always have a lot of domain knowledge for the problem. In these situations, we
should rely on exploratory analysis to provide us hints.
Figure 4.1 shows a scatterplot from Project 3. From the plot, we can see that there are 3 clusters of
people who’ve left.
1. First, we have people with high ’last_evaluation’ but low ’satisfaction’. Maybe
these people were overqualified, frustrated, or unhappy in some other way.
2. Next, we have people with low ’last_evaluation’ and medium ’satisfaction’. These
were probably underperformers or poor cultural fits.
3. Finally, we have people with high ’last_evaluation’ and high ’satisfaction’. Perhaps
these were overachievers who found better offers elsewhere.
44 Chapter 4. Feature Engineering
As you can tell, "domain knowledge" is very broad and open-ended. At some point, you’ll get stuck
or exhaust your ideas.
That’s where these next few steps come in. These are a few specific heuristics that can help spark
more ideas.
4.2 Interaction Features 45
The first of these heuristics is checking to see if you can create any interaction features that make
sense. These are features that represent operations between two or more features.
In some contexts, "interaction terms" must be products of two variables. In our context, interaction
features can be products, sums, or differences between two features.
4.2.1 Examples
In Project 2, we knew the transaction year and the year the property was built in. However, the more
useful piece of information that combining these two features provided is the age of the property at
the time of the transaction.
In Project 2, we also knew the the number of schools nearby and their median quality score. However,
we suspected that what’s really important is having many school options, but only if they are
good.
In Project 4, we knew the number of units per order (Quantity) and the price per unit (UnitPrice), but
what we really needed was the actual amount paid for the order (Sales).
46 Chapter 4. Feature Engineering
The next heuristic we’ll consider is grouping sparse classes together (in our categorical features).
Sparse classes are those that have very few total observations. They can be problematic for certain
machine learning algorithms because they can cause models to be overfit.
• There’s no formal rule of how many each class needs.
• It also depends on the size of your dataset and the number of other features you have.
• As a rule of thumb, we recommend combining classes until the each one has at least around
50 observations. As with any "rule" of thumb, use this as a guideline (not actually as a rule).
To begin, we can group classes that are pretty similar in reality. Again, the .replace() function
comes in handy.
For example, in Project 2, we decided to group ’Wood Siding’, ’Wood Shingle’, and ’Wood’ into a
single class. In fact, we just labeled all of them as ’Wood’.
Next, we can group the remaining sparse classes into a single ’Other’ class, even if there’s already
an ’Other’ class.
For example, in Project 2, we labeled ’Stucco’, ’Other’, ’Asbestos shingle’, ’Concrete Block’, and
’Masonry’ as ’Other’.
4.3 Sparse Classes 47
Figure 4.2 shows the "before" and "after" class distributions for one of the categorical features from
Project 2. After combining sparse classes, we have fewer unique classes, but each one has more
observations.
Often, an eyeball test is enough to decide if you want to group certain classes together.
48 Chapter 4. Feature Engineering
Scikit-Learn machine learning algorithms cannot directly handle categorical features. Specifically,
they cannot handle text values.
Therefore, we need to create dummy variables for our categorical features.
Dummy variables are a set of binary (0 or 1) features that each represent a single class from a
categorical feature. The information you represent is exactly the same, but this numeric representation
allows you to pass Scikit-Learn’s technical requirement.
In Project 2, after grouping sparse classes for the ’roof’ feature, we were left with 5 classes:
• ’Missing’
• ’Composition Shingle’
• ’Other’
• ’Asphalt’
• ’Shake Shingle’
These translated to 5 dummy variables:
• ’roof_Missing’
• ’roof_Composition Shingle’
• ’roof_Other’
• ’roof_Asphalt’
• ’roof_Shake Shingle’
So if an observation in the dataset had a roof made of ’Composition Shingle’, it would have the
following values for the dummy variables:
• ’roof_Missing’ = 0
• ’roof_Composition Shingle’ = 1
• ’roof_Other’ = 0
• ’roof_Asphalt’ = 0
• ’roof_Shake Shingle’ = 0
Creating a dataframe with dummy features is as easy calling the .get_dummies() function and
passing in a list of the features you’d like to create dummy variables for.
# C r e a t e new d a t a f r a m e w i t h dummy f e a t u r e s
df = pd. get_dummies (df , columns =[’exterior_walls ’,
’roof ’,
’property_type ’])
4.5 Remove Unused 49
Unused features are those that don’t make sense to pass into our machine learning algorithms.
Examples include:
• ID columns
• Features that wouldn’t be available at the time of prediction
• Other text descriptions
Redundant features would typically be those that have been replaced by other features that you’ve
added.
For example, in Project 2, since we used ’tx_year’ and ’year_built’ to create the ’property_age’
feature, we ended up removing them.
• Removing ’tx_year’ was also a good idea because we didn’t want our model being overfit
to the transaction year.
• Since we’ll be applying it to future transactions, we wanted the algorithms to focus on learning
patterns from the other features.
Sometimes there’s no clear right or wrong decision for this step, and that’s OK.
Finally, save the new DataFrame that you’ve cleaned and then augmented through feature engineer-
ing.
We prefer to name it the analytical base table because it’s what you’ll be building your models on.
Not all of the features you engineer need to be winners. In fact, you’ll often find that many of them
don’t improve your model. That’s fine because one highly predictive feature makes up for 10 duds.
The key is choosing machine learning algorithms that can automatically select the best features
among many options (built-in feature selection).
This will allow you to avoid overfitting your model despite providing many input features.
50 Chapter 4. Feature Engineering
4.6 Rolling Up
For some problems, you’ll need to roll up the data into a higher level of granularity through a process
called data wrangling.
During this process, you’ll aggregate new features. You practiced this in Project 4, and we review it
in Appendix A.2 - Data Wrangling.
5. Model Training
Let’s start with a crucial but sometimes overlooked step: Spending your data.
Think of your data as a limited resource.
• You can spend some of it to train your model (i.e. feed it to the algorithm).
• You can spend some of it to evaluate (test) your model to decide if it’s good or not.
• But you can’t reuse the same data for both!
If you evaluate your model on the same data you used to train it, your model could be very overfit
and you wouldn’t even know it! Remember, a model should be judged on its ability to predict new,
unseen data.
Therefore, you should have separate training and test subsets of your dataset.
Training sets are used to fit and tune your models. Test sets are put aside as "unseen" data to
evaluate your models.
You should always split your data before you begin building models.
• This is the best way to get reliable estimates of your models’ performance.
• After splitting your data, don’t touch your test set until you’re ready to choose your final
model!
Comparing test vs. training performance allows us to avoid overfitting. If the model performs very
well on the training data but poorly on the test data, then it’s overfit.
Splitting our data into training and test sets will help us get a reliable estimate of model performance
on unseen data, which will help us avoid overfitting by picking a generalizable model.
5.1 Split Dataset 53
Next, it’s time to introduce a concept that will help us tune (i.e. pick the best "settings" for) our
models: cross-validation.
Cross-validation is a method for getting a reliable estimate of model performance using only your
training data. By combining this technique and the train/test split, you can tune your models without
using your test set, saving it as truly unseen data.
Cross-validation refers to a family of techniques that serve the same purpose. One way to distinguish
the techniques is by the number of folds you create.
The most common one, 10-fold, breaks your training data into 10 equal parts (a.k.a. folds), essentially
creating 10 miniature train/test splits.
These are the steps for 10-fold cross-validation:
1. Split your data into 10 equal parts, or "folds".
2. Train your model on 9 folds (e.g. the first 9 folds).
3. Evaluate it on the remaining "hold-out" fold (e.g. the last fold).
4. Perform steps (2) and (3) 10 times, each time holding out a different fold.
5. Average the performance across all 10 hold-out folds.
The average performance across the 10 hold-out folds is your final performance estimate, also called
your cross-validated score. Because you created 10 mini train/test splits, this score turns out to be
pretty reliable.
54 Chapter 5. Model Training
Up to now, you’ve explored the dataset, cleaned it, and engineered new features.
Most importantly, you created an excellent analytical base table that has set us up for success.
5.2.1 Preprocessing
However, sometimes we’ll want to preprocess the training data even more before feeding it into our
algorithms.
For example, we may want to...
• Transform or scale our features.
• Perform automatic feature reduction (e.g. PCA).
• Remove correlated features.
The key is that these types of preprocessing steps should be performed inside the cross-validation
loop.
5.2.2 Standardization
Standardization is the most popular preprocessing technique in machine learning. It transforms all
of your features to the same scale by subtracting means and dividing by standard deviations.
This is a useful preprocessing step because some machine learning algorithms will "overemphasize"
features that are on larger scales. Standardizing a feature is pretty straightforward:
1. First, subtract its mean from each of its observations.
2. Then, divide each of its observations by its standard deviation.
This makes the feature’s distribution centered around zero, with unit variance. To standardize the
entire dataset, you’d simply do this for each feature.
To perform standardization, we needed 2 pieces of information directly from the training set:
1. The means of each feature
2. The standard deviations of each feature
These are called preprocessing parameters because they actually needed to be learned from the
training data.
So if we want to preprocess new data (such as our test set) in the exact same way, we’d need to use
the exact same preprocessing parameters that we learned from the training set (not those from the
test set!).
• Mathematically speaking, this preserves the exact original transformation.
5.2 Model Pipelines 55
• Practically speaking, if new observations come in 1 at a time, you can’t standardize them
without already having preprocessing parameters to use (i.e. those learned from your training
set).
Now, what you just learned is a sophisticated modeling process that has been battle-tested and proven
in the field. Luckily, you won’t have to implement it from scratch!
Scikit-Learn comes with with a handy make_pipeline() function that helps you glue these steps
together. (See? Told you Scikit-Learn comes with everything!)
# For s t a n d a r d i z a t i o n
from sklearn . preprocessing import StandardScaler
# Single pipeline
make_pipeline ( StandardScaler (), Lasso( random_state =123))
However, we recommend storing all your model pipelines in a single dictionary, just so they’re
better organized.
56 Chapter 5. Model Training
Up to now, we’ve been casually talking about "tuning" models, but now it’s time to treat the topic
more formally.
When we talk of tuning models, we specifically mean tuning hyperparameters.
There are two types of parameters we need to worry about when using machine learning algorithms.
Model parameters are learned attributes that define individual models.
• e.g. regression coefficients
• e.g. decision tree split locations
• They can be learned directly from the training data.
Hyperparameters express "higher-level" structural settings for modeling algorithms.
• e.g. strength of the penalty used in regularized regression
• e.g. the number of trees to include in a random forest
• They are decided before training the model because they cannot be learned from the data.
The key distinction is that model parameters can be learned directly from the training data while
hyperparameters cannot!
Most algorithms have many different options of hyperparameters to tune. Fortunately, for each
algorithm, typically only a few hyperparameters really influence model performance.
You’ll declare the hyperparameters to tune in a dictionary, which is called a "hyperparameter
grid".
See below for recommended values to try.
# Lasso hyperparameters
lasso_hyperparameters = {
’lasso__alpha ’ : [0.001 , 0.005 , 0.01 ,
0.05 , 0.1, 0.5, 1, 5, 10]
}
# Ridge hyperparameters
ridge_hyperparameters = {
’ridge__alpha ’: [0.001 , 0.005 , 0.01 ,
0.05 , 0.1, 0.5, 1, 5, 10]
}
# Elastic Net h y p e r p a r a m e t e r s
58 Chapter 5. Model Training
enet_hyperparameters = {
’elasticnet__alpha ’: [0.001 , 0.005 , 0.01 ,
0.05 , 0.1, 0.5, 1, 5, 10],
’elasticnet__l1_ratio ’ : [0.1 , 0.3, 0.5, 0.7, 0.9]
}
# Random f o r e s t hyperparameters
rf_hyperparameters = {
’randomforestregressor__n_estimators ’ : [100 , 200] ,
’randomforestregressor__max_features ’: [’auto ’, ’sqrt ’,
0.33] ,
}
l2_hyperparameters = {
’logisticregression__C ’ : np. linspace (1e −3, 1e3 , 10),
}
# Random F o r e s t hyperparameters
rf_hyperparameters = {
’randomforestclassifier__n_estimators ’: [100 , 200] ,
’randomforestclassifier__max_features ’: [’auto ’, ’sqrt ’,
0.33]
}
Just as with the model pipelines, we recommend storing all your hyperparameter grids in a single
dictionary to stay organized.
Now that we have our pipelines and hyperparameters dictionaries declared, we’re ready to
tune our models with cross-validation.
Remember the cross-validation loop that we discussed earlier? Well GridSearchCV essentially
performs that entire loop for each combination of values in the hyperparameter grid (i.e. it wraps
another loop around it).
• As its name implies, this class performs cross-validation on a hyperparameter grid.
• It will try each combination of values in the grid.
• It neatly wraps the entire cross-validation process together.
It then calculates cross-validated scores (using performance metrics) for each combination of
hyperparameter values and picks the combination that has the best score.
# F i t and t u n e model
model .fit(X_train , y_train )
Because we already set up and organized our pipelines and hyperparameters dictionaries, we
can fit and tune models with all of our algorithms in a single loop.
# Loop t h r o u g h model p i p e l i n e s ,
# t u n i n g each one and s a v i n g i t to fitted_models
for name , pipeline in pipelines .items ():
# Create cross −v a l i d a t i o n object
model = GridSearchCV (pipeline , hyperparameters [name],
cv=10, n_jobs= −1)
5.4 Fit and tune models 61
# Fit m o d e l on X _ t r a i n , y_train
model.fit(X_train , y_train )
# S t o r e model in f i t t e d _ m o d e l s [ name ]
fitted_models [name] = model
This step can take a few minutes, depending on the speed and memory of your computer.
62 Chapter 5. Model Training
Finally, it’s time to evaluate our models and pick the best one.
Regression models are generally more straightforward to evaluate than classification models.
One of the first ways to evaluate your models is by looking at their cross-validated score on the
training set.
These scores represent different metrics depending on your machine learning task.
Holdout R2 : For regression, the default scoring metric is the average R2 on the holdout folds.
• In rough terms, R2 is the "percent of the variation in the target variable that can be explained
by the model."
• Because is the average R2 from the holdout folds, higher is almost always better.
• That’s all you’ll need to know for this course, but there are plenty of additional explanations
of this metric that you can find online.
Mean absolute error (MAE): Another metric we looked as was mean absolute error, or MAE.
• Mean absolute error (or MAE) is the average absolute difference between predicted and actual
values for our target variable.
• It has the added benefit that it’s very easily interpretable. That means it’s also easier to
communicate your results with key stakeholders.
To evaluate our models, we also want to see their performance on the test set, and not just their
cross-validated scores. These functions from Scikit-Learn help us do so.
Holdout accuracy: For classification, the default scoring metric is the average accuracy on the
holdout folds.
• Accuracy is simply the percent of observations correctly classified by the model.
5.5 Select Winner 63
• Because is the average accuracy from the holdout folds, higher is almost always better.
• However, straight accuracy is not always the best way to evaluate a classification model,
especially when you have imbalanced classes.
Area under ROC curve (AUROC): Area under ROC curve is the most reliable metric for classifi-
cation tasks.
• It’s equivalent to the probability that a randomly chosen Positive observation ranks higher (has
a higher predicted probability) than a randomly chosen Negative observation.
• Basically, it’s saying... if you grabbed two observations and exactly one of them was the
positive class and one of them was the negative class, what’s the likelihood that your model
can distinguish the two?
• Therefore, it doesn’t care about imbalanced classes.
• See Appendix A.1 - Area Under ROC Curve for a more detailed explanation.
# Classification metrics
from sklearn . metrics import roc_curve , auc
After importing the performance metric functions, you can write a loop to apply them to the test set.
64 Chapter 5. Model Training
Performance on the test set is the best indicator of whether a model is generalizable or not.
# Test performance
for name , model in fitted_models .items ():
pred = model. predict ( X_test )
print( name )
print( ’−−−−−−−−’ )
print( ’R^2: ’, r2_score (y_test , pred ))
print ( ’MAE:’, mean_absolute_error (y_test , pred ))
print ()
# Pickle package
import pickle
You’ve come a long way up to this point. You’ve taken this project from a simple dataset all the way
to a high-performing predictive model.
Now, we’ll show you how you can use your model to predict new (raw) data and package your work
together into an executable script that can be called from the command line.
One nice and quick sanity check we can do is confirm that our model was saved correctly.
Here’s what we’ll do:
1. Load the original analytical base table that was used to train the model.
2. Split it into the same training and test sets (with the same random seed).
3. See if we get the same performance on the test set as we got during model training.
Occasionally, if your project is very complex, you might end up saving the wrong object or acciden-
tally overwriting an object before saving it.
This 5-minute sanity check is cheap insurance against those situations.
First, we can import our model and load the same analytical base table used to save the model.
# Load f i n a l _ m o d e l . p k l as model
with open(’final_model .pkl ’, ’rb’) as f:
model = pickle .load(f)
Now that you have the dataset, if you split it into training/test sets using the exact same settings and
random seed as you used during model training, you can perfectly replicate the subsets from before.
Finally, you can use your model to predict X_test again and confirm if you get the same scores for
your performance metrics (which will be different between classification and regression).
# Predict X_test
pred = model. predict_proba ( X_test )
# P r i n t AUROC
print ( ’AUROC:’, roc_auc_score (y_test , pred) )
68 Chapter 6. Project Delivery
If new data arrives in the same format as the original raw data, we wouldn’t be able to apply our
model to it in the same way. We’d need to first clean the new data the same way and engineer the
same features.
All we need to do is write a few functions to convert the raw data to the same format as the
analytical base table.
• That means we need to bundle together our data cleaning steps.
• Then we need to bundle together our feature engineering steps.
• We can skip the exploratory analysis steps because we didn’t alter our dataframe then.
The key is to only include steps that altered your dataframe. Here’s the example from Project 3:
# Drop t e m p o r a r y workers
df = df[df. department != ’temp ’]
# ’ i n f o r m a t i o n _ t e c h n o l o g y ’ s h o u l d be ’ IT ’
df. department . replace (’information_technology ’, ’IT’,
inplace =True)
# C r e a t e new d a t a f r a m e w i t h dummy f e a t u r e s
df = pd. get_dummies (df , columns =[’department ’, ’salary ’])
# R e t u r n augmented DataFrame
return df
70 Chapter 6. Project Delivery
Next, we packaged these functions together into a single model class. This is a convenient way to
keep all of the logic for a given model in one place.
Remember how when we were training our model, we imported LogisticRegression, RandomForestClassifier,
and GradientBoostingClassifier?
We called them "algorithms," but they are technically Python classes.
Python classes are structures that allow us to group related code, logic, and functions in one place.
Those familiar with object-oriented programming will have recognized this concept.
For example, each of those algorithms have the fit() and predict_proba() functions that allow
you to train and apply models, respectively.
We can construct our own custom Python classes for our models.
• Thankfully, it doesn’t need to be nearly as complex as those other algorithm classes because
we’re not actually using this to train the model.
• Instead, we already have the model saved in a final_model.pkl file.
• We only need to include logic for cleaning data, feature engineering, and predicting new
observations.
class EmployeeRetentionModel :
def __init__ (self , model_location ):
with open( model_location , ’rb’) as f:
self.model = pickle .load(f)
if augment :
X_new = self. engineer_features (X_new)
We’re going to start with the easiest way to deploy your model: Keep it in Jupyter Notebook.
Not only is this the easiest way, but this is also our recommended way for Data Scientists and
Business Analysts who are personally responsible for maintaining and applying the model.
Keeping the model in Jupyter Notebook has several huge benefits.
• It’s seamless to update your model.
• You can perform ad-hoc data exploration and visualization.
• You can write detailed comments and documentation, with images, bullet lists, etc.
We didn’t choose Jupyter Notebook for teaching this course just because it’s a great interactive
environment... we also chose it because it’s one of the most popular IDE’s among professional data
scientists.
If you keep your model in Jupyter Notebook, you can directly use the model class you defined earlier.
# Initialize an i n s t a n c e
retention_model = EmployeeRetentionModel (’final_model .pkl ’)
# P r e d i c t raw d a t a
_, pred1 = retention_model . predict_proba (raw_data ,
clean=True , augment =True)
However, there will definitely be situations in which Jupyter notebooks are not enough. The most
common scenarios are if you need to integrate your model with other parts of an application or
automate it.
For these use cases, you can simply package your model into an executable script.
72 Chapter 6. Project Delivery
You can call these scripts from any command line program. That means you can:
• Host your model on the cloud (like AWS)
• Integrate it as part of a web application (have other parts of the application call the script)
• Or automate it (schedule recurring jobs to run the script)
Implementing each of those use cases is outside the scope of this course, but we’ve shown you how
to actually set up the script, which is the first step for each of those use cases.
We’ve included an example script in the project_files/ directory of the Project 3 Workbook
Bundle.
Once you open it up (you can open it up in Jupyter browser or in any text editor), you’ll see that
you’ve already written the libraries to import and the custom model class, including the data cleaning,
feature engineering, and predict probability functions.
All that’s left is a little bit of logic at the bottom to handle command line requests.
To run the script, you can call it from the command line after navigating to the Workbook Bundle’s
folder.
Before we can get to the algorithms, we need to first cover a few key concepts. Earlier, we introduced
model complexity, even calling it the "heart of machine learning."
Now we’re going to dive deeper into a practical challenge that arises from model complexity:
overfitting.
As it turns out, for practical machine learning, it’s typically better to allow more complexity in
your models, and then implement other safeguards against overfitting.
• The reason is that if your models are too simple, they can’t learn complex patterns no matter
how much data you collect.
• However, if your models are more complex, you can use easy-to-implement tactics to avoid
overfitting. In addition, collecting more data will naturally reduce the chance of overfitting.
And that’s why we consider overfitting (and not underfitting) the "final boss" of machine learning.
Overfitting is a dangerous mistake that causes your model to "memorize" the training data instead
of the true underlying pattern.
• This prevents the model from being generalizable to new, unseen data.
• Many of the practical skills from in this course (regularization, ensembling, train/test
splitting, cross-validation, etc.) help you fight overfitting!
• In fact, our entire model training workflow is designed to defeat overfitting.
However, picking the right algorithms can also prevent overfitting! Just to be clear, there are two
main ways models can become too complex, leading to overfitting.
1.) The first is the inherent structure of different algorithms.
• e.g. Decision trees are more complex than linear regression because they can represent
non-linear relationships.
• e.g. Polynomial regression is more complex than simple linear regression because they can
represent curvy relationships.
2.) The second is the number of input features into your model.
• e.g. A linear regression with 1 input feature is less complex than a linear regression with the
same input feature plus 9 others.
• e.g. In a way, third-order polynomial regression is just linear regression plus 2 other features
(x2 ,x3 ) that were engineered from the original one, x.
Both of these two forms of model complexity can lead to overfitting, and we’ve seen how to deal
with both of them.
7.2 Regularization 75
7.2 Regularization
Next, we’ll discuss the first "advanced" tactic for improving the performance of our models. It’s
considered pretty "advanced" in academic machine learning courses, but we believe it’s really pretty
easy to understand and implement.
But before we get to it, let’s bring back our friend simple linear regression and drag it through the
mud a little bit.
You see, in practice, simple linear regression models rarely perform well. We actually recommend
skipping them for most machine learning problems.
Their main advantage is that they are easy to interpret and understand. However, our goal is not
to study the data and write a research report. Our goal is to build a model that can make accurate
predictions.
In this regard, simple linear regression suffers from two major flaws:
1. It’s prone to overfit with many input features.
2. It cannot easily express non-linear relationships.
Let’s take a look at how we can address the first flaw.
Remember, one type of over-complexity occurs if the model has too many features relative to the
number of observations.
• If you fit a linear regression model with all of those 100 features, you can perfectly "memorize"
the training set.
• Each coefficient would simply memorize one observation. This model would have perfect
accuracy on the training data, but perform poorly on unseen data. It hasn’t learned the true
underlying patterns; it has only memorized the noise in the training data.
So, how can we reduce the number of features in our model (or dampen their effect)?
The answer is regularization.
The standard cost function for regression tasks is called the sum of squared errors. Again, there’s
not a lot of math in this course, but here’s where a simple formula makes this much easier to explain:
N
C = ∑ (yi − ŷi )2
i=1
Therefore, the more accurate your model is, the lower this cost will be. Linear regression algorithms
seek to minimize this function, C, by trying different values for the coefficients.
Overfitting occurs when the cost for a model is low on the training data, but high for new, unseen
data.
7.2 Regularization 77
7.2.6 L1 regularization
L1 penalizes the absolute size of model coefficients. The new regularized cost function becomes:
N F
C= ∑ (yi − ŷi )2 + λ ∑ | βj |
1=1 j=1
7.2.7 L2 regularization
Likewise, L2 penalizes the squared size of model coefficients. The new regularized cost function
becomes:
N F
C= ∑ (yi − ŷi )2 + λ ∑ β j2
1=1 j=1
As you can see, the regularized cost functions are simply the unregularized cost function plus the L1
or L2 penalty.
78 Chapter 7. Regression Algorithms
As discussed previously, in the context of linear regression, regularization works by adding penalty
terms to the cost function. The algorithms that employ this technique are known as regularized linear
regression.
There are 3 types of regularized linear regression algorithms we’ve used in this course:
1. Lasso
2. Ridge
3. Elastic-Net
They correspond with the amount of L1L1 and L2L2 penalty included in the cost function.
Oh and in case you’re wondering, there’s really no definitive "better" type of penalty. It really
depends on the dataset and the problem. That’s why we recommend trying different algorithms that
use a range of penalty mixes (Algorithms are commodities).
7.3.1 Lasso
Lasso, or LASSO, stands for Least Absolute Shrinkage and Selection Operator.
• Lasso regression completely relies on the L1 penalty (absolute size).
• Practically, this leads to coefficients that can be exactly 0 (automatic feature selection).
• Remember, the "strength" of the penalty should be tuned.
• A stronger penalty leads to more coefficients pushed to zero.
7.3.2 Ridge
Ridge stands for Really Intense Dangerous Grapefruit Eating just kidding... it’s just ridge.
• Ridge regression completely relies on the L2 penalty (squared).
• Practically, this leads to smaller coefficients, but it doesn’t force them to 0 (feature shrinkage).
• Remember, the "strength" of the penalty should be tuned.
• A stronger penalty leads coefficients pushed closer to zero.
7.3.3 Elastic-Net
By the way, you may be wondering why we bothered importing separate algorithms for Ridge and
Lasso if Elastic-Net combines them.
• It’s because Scikit-Learn’s implementation of ElasticNet can act funky if you set the penalty
ratio to 0 or 1.
• Therefore, you should use the separate Ridge and Lasso classes for those special cases.
• Plus, for some problems, Lasso or Ridge are just as effective while being easier to tune.
80 Chapter 7. Regression Algorithms
Awesome, we’ve just seen 3 algorithms that can prevent linear regression from overfitting. But if
you remember, linear regression suffers from two main flaws:
1. It’s prone to overfit with many input features.
2. It cannot easily express non-linear relationships.
How can we address the second flaw? Well, we need to move away from linear models to do that.
We need to bring in a new category of algorithms.
Decision trees model data as a "tree" of hierarchical branches. They make branches until they reach
"leaves" that represent predictions. They can be used for both regression and classification.
Intuition: To explain the intuition behind decision trees, we’ll borrow an answer from a related
Quora discussion.
Let’s imagine you are playing a game of Twenty Questions. Your opponent has secretly
chosen a subject, and you must figure out what she chose. At each turn, you may ask a
yes-or-no question, and your opponent must answer truthfully. How do you find out the
secret in the fewest number of questions?
It should be obvious some questions are better than others. For example, asking "Is it a
basketball" as your first question is likely to be unfruitful, whereas asking "Is it alive" is
a bit more useful. Intuitively, we want each question to significantly narrow down the
space of possibly secrets, eventually leading to our answer.
That is the basic idea behind decision trees. At each point, we consider a set of questions
that can partition our data set. We choose the question that provides the best split
(often called maximizing information gain), and again find the best questions for the
partitions. We stop once all the points we are considering are of the same class (in the
naive case). Then classifying is easy. Simply grab a point, and chuck him down the tree.
The questions will guide him to his appropriate class.
That explanation was for classification, but their use in regression is analogous.
When predicting new data points, the tree simply passes it down through the hierarchy (going down
through the branches based on the splitting criteria) until it reaches a leaf.
Due to their branching structure, decision trees have the advantage of being able to model non-
linear relationships.
Let’s take a hypothetical scenario for the real-estate dataset from Project 2.
For example, let’s say that...
• Within single family properties, ’lot_size’ is positively correlated with ’tx_price’. In
other words, larger lots command higher prices for single family homes.
• But within apartments, ’lot_size’ is negatively correlated with ’tx_price’. In other
words, smaller lots command higher prices for apartments (i.e. maybe they are more upscale).
On the other hand, decision trees can naturally express this relationship.
• First, they can split on ’property_type’, creating separate branches for single family
properties and apartments.
• Then, within each branch, they can create two separate splits on ’lot_size’, each represent-
ing a different relationship.
This branching mechanism is exactly what leads to higher model complexity.
In fact, if you allow them to grow limitlessly, they can completely "memorize" the training data, just
from creating more and more branches.
As a result, individual unconstrained decision trees are very prone to being overfit.
We already saw this in our toy example with the noisy sine wave from Project 1.
82 Chapter 7. Regression Algorithms
Figure 7.1: Training set predictions from unconstrained decision tree (Project 1)
So, how can we take advantage of the flexibility of decision trees while preventing them from
overfitting the training data?
That’s where tree ensembles come in!
But before talking about tree ensembles, let’s first discuss ensemble methods in general.
Ensembles are machine learning methods for combining multiple individual models into a single
model.
There are a few different methods for ensembling. In this course, we cover two of them:
1. Bagging attempts to reduce the chance of overfitting complex models.
• It trains a large number of "strong" learners in parallel.
• A strong learner is a model that’s allowed to have high complexity.
• It then combines all the strong learners together in order to "smooth out" their predictions.
2. Boosting attempts to improve the predictive flexibility of simple models.
• It trains a large number of "weak" learners in sequence.
• A weak learner is a model that has limited complexity.
• Each one in the sequence focuses on learning from the mistakes of the one before it.
• It then combines all the weak learners into a single strong learner.
While bagging and boosting are both ensemble methods, they approach the problem from opposite
directions.
Bagging uses complex base models and tries to "smooth out" their predictions, while boosting uses
simple base models and tries to "boost" their aggregate complexity.
7.5 Tree Ensembles 83
Ensembling is a general term, but when the base models are decision trees, we call them special
names: random forests and boosted trees!
Random forests train a large number of "strong" decision trees and combine their predictions
through bagging.
In addition, there are two sources of "randomness" for random forests:
1. Each decision tree is only allowed to choose from a random subset of features to split on.
2. Each decision tree is only trained on a random subset of observations (a process called
resampling).
# I m p o r t Random F o r e s t
from sklearn . ensemble import RandomForestRegressor
Boosted trees train a sequence of "weak", constrained decision trees and combine their predictions
through boosting.
• Each decision tree is allowed a maximum depth, which should be tuned.
• Each decision tree in the sequence tries to correct the prediction errors of the one before it.
Next, we’ll dive into a few more key concepts. In particular, we want to introduce you to 4 useful
algorithms for classification tasks:
1. L1 -regularized logistic regression
2. L2 -regularized logistic regression
3. Random forests
4. Boosted trees
As you might suspect, they are analogous to their regression counterparts, with a few key differences
that make them more appropriate for classification.
In applied machine learning, individual algorithms should be swapped in and out depending on
which performs best for the problem and the dataset.
Therefore, in this module, we will focus on intuition and practical benefits over math and theory.
8.1 Binary Classification 85
Classification with 2 classes is so common that it gets its own name: binary classification.
In Project 3, we had two possible classes for an employee’s status: ’Left’ and ’Employed’.
However, when we constructed our analytical base table, we converted the target variable from ’Left’
/ ’Employed’ into 1 / 0.
In binary classification,
• 1 (’Left’) is also known as the "positive" class.
• 0 (’Employed’) is also known as the "negative" class.
In other words, the positive class is simply the primary class you’re trying to identify.
For regression tasks, the output of a model is pretty straightforward. In Project 2: Real-Estate
Tycoon, the output was a prediction for the transaction price of the property.
Therefore, you might logically conclude that the output for a classification task should be a prediction
for the status of the employee.
However, in almost every situation, we’d prefer the output to express some level of confidence in
the prediction, instead of only the predicted class.
Therefore, we actually want the output to be class probabilities, instead of just a single class
prediction.
• For binary classification, the predicted probability of the positive class and that of the negative
class will sum to 1.
• For general classification, the predicted probabilities of all the classes will sum to 1.
This will become super clear once we get to the example shortly.
86 Chapter 8. Classification Algorithms
8.2.1 Methodology
Remember, a model is only useful if it can accurately approximate the "true state of the world" (i.e.
the "true underlying relationship" between input features and target variables).
Therefore, we’re going to create a synthetic dataset for which we already know "true underlying
relationship."
1. First, we’re going to use a single input feature, x.
2. Then, we’re going to generate values for the target variable, y, based on a predetermined
mapping function.
3. Next, we’re going to add randomly generated noise to that dataset.
4. Once we’ve done that, we can try different algorthms on our synthetic dataset.
5. We already know the "true underlying relationship"... it’s the predetermined mapping function.
6. Finally, we can compare how well models of different complexities can separate the signal
from the randomly generated noise.
First, for our predetermined mapping function, we’ll use a conditional function that checks if x > 0.5.
Therefore, the "true underlying relationship" between x and y will be:
If x > 0.5 then y = 1.
If x ≤ 0.5 then y = 0.
However, we’re going to add random noise to it, turning it into a noisy conditional:
If x + ε > 0.5 then y = 1.
If x + ε ≤ 0.5 then y = 0.
# Input feature
x = np. linspace (0, 1, 100)
# Noise
np. random .seed (555)
noise = np. random . uniform ( − 0.2,0.2, 100)
# Target variable
y = ((x + noise) > 0.5). astype (int)
As you can see, observations with x > 0.5 should have y = 1 and those with x ≤ 0.5 should have
y = 0. However, as you get closer to x = 0.5, the noise plays a bigger factor, and you get some
overlap.
Therefore, here’s what we generally want out of a good model:
• If x > 0.5, we want our model to predict higher probability for y = 1.
• If x ≤ 0.5, we want our model to predict higher probability for y = 0.
• If x is closer to 0.5, we want less confidence in the prediction.
• If x is farther from 0.5, we want more confidence in the prediction.
Finally, there’s two more important nuances that we need from a good model:
1. We don’t want probability predictions above 1.0 or below 0.0, as those don’t make sense.
2. We want very high confidence by the time we pass 0.3 or 0.7, because that’s when noise is
no longer a factor.
88 Chapter 8. Classification Algorithms
First, let’s discuss logistic regression, which is the classification analog of linear regression.
Remember, linear regression fits a "straight line" with a slope (technically, it’s a straight hyperplane
in higher dimensions, but the concept is similar).
In Figure 8.2, you can see 2 flaws with linear regression for this type of binary classification problem.
1. First, you’re getting predicted probabilities above 1.0 and below 0.0! That just doesn’t make
sense.
2. Second, remember, we want very high confidence by the time we pass 0.3 or 0.7, because
that’s when noise is no longer a factor. We can certainly do much better on this front.
Also in Figure 8.2, you can see how logistic regression differs.
1. First, we’re no longer getting predictions that are above 1.0 or below 2.0, so that’s nice.
2. However, we’re still not getting very high confidence for x < 3 and x > 0.7. It kinda looks
like the model is being "too conservative."
We’ll see how we can improve on that second point shortly.
First, it’s important to note that the .predict() function behaves quite differently for classification.
It will give you the predicted classes directly.
However, as mentioned earlier, we almost always prefer to get the class probabilities instead, as
they express a measurement of confidence in the prediction.
# Logistic regression
8.3 Logistic Regression 89
model = LogisticRegression ()
model .fit(X, y)
# Class predictions
model . predict (X)
Typically, .predict_proba() is the more useful function. It returns the actual class probabilities,
which allow you to calculate certain metrics (such as the AUROC) and set custom thresholds (i.e. if
false positives are more costly than false negatives).
Here’s the full code snippet for fitting and plotting the logistic regression predictions from Figure
8.2.
# Logistic regression
model = LogisticRegression ()
model .fit(X, y)
# Predict probabilities
pred = model. predict_proba (X)
Logistic regression has regularized versions that are analogous to those for linear regression.
Regularized logistic regression also works by adding a penalty factor to the cost function.
As in Lasso, Ridge, and Elastic-Net, the strength of regularized logistic regression’s penalty term is
tunable.
However, for logistic regression, it’s not referred to as alpha. Instead, it’s called C, which is the
inverse of the regularization strength.
• Therefore, higher C means a weaker penalty and lower C means a stronger penalty.
• By default, C=1.
In other words, our logistic regression was being regularized by default, which is why it was a bit
conservative.
In Figure 8.3, we see what happens when we make the penalty 4 times stronger (C=0.25) or 4 times
weaker (C=4).
For this synthetic dataset, it actually looks like weaker penalties produce better models.
The key is that this is a hyperparameter that you should tune using the modeling process from
earlier.
Likewise...
• For linear regression, L1 regularization was called Lasso regression.
• For logistic regression, we’ll simply call it L1 -regularized logistic regression.
We can set the penalty type when we initialize the algorithm, like so:
Even though we set the penalty type as a hyperparameter, for this course, we will treat them as
separate algorithms (just to be consistent).
92 Chapter 8. Classification Algorithms
The same tree ensembles we used for regression can be applied to classification.
They work in nearly the same way, except they expect classes for the target variable.
Random forests train a large number of "strong" decision trees and combine their predictions
through bagging.
In addition, there are two sources of "randomness" for random forests:
1. Each decision tree is only allowed to choose from a random subset of features to split on.
2. Each decision tree is only trained on a random subset of observations (a process called
resampling).
# I m p o r t Random F o r e s t C L A S S I F I E R
from sklearn . ensemble import RandomForestClassifier
Boosted trees train a sequence of "weak", constrained decision trees and combine their predictions
through boosting.
• Each decision tree is allowed a maximum depth, which should be tuned.
• Each decision tree in the sequence tries to correct the prediction errors of the one before it.
For clustering problems, the chosen input features are usually more important than which algorithm
you use.
One reason is that we don’t have a clear performance metric for evaluating the models, so it usually
doesn’t help that much to try many different algorithms.
Furthermore, clustering algorithms typically create clusters based on some similarity score or
"distance" between observations. It calculates these distances based on the input features.
Therefore, we need to make sure we only feed in relevant features for our clusters.
9.1 K-Means
For clustering, however, it’s usually not very fruitful to try multiple algorithms.
• The main reason is that we don’t have labeled data... after all, this is Unsupervised Learning.
• In other words, there are no clear performance metrics that we can calculate based on a
"ground truth" for each observation.
• Typically, the best clustering model is the one that creates clusters that are intuitive and
reasonable in the eyes of the key stakeholder.
Therefore, which algorithm you choose is typically less important than the input features that you
feed into it.
9.1.1 Intuition
# K−M e a n s a l g o r i t h m
from sklearn . cluster import KMeans
9.1 K-Means 95
So how are those distances calculated? Well, the default way is to calculate the euclidean distance
based on feature values. This is the ordinary "straight-line" distance between two points.
Imagine you had only two features, x1 and x2 .
• Let’s say you have an observation with values x1 = 5 and x2 = 4.
• And let’s say you had a centroid at x1 = 1 and x2 = 1.
To calculate the euclidean distance between the observation and the centroid, you’d simply take:
q p √
(5 − 1)2 + (4 − 1)2 = 42 + 32 = 25 = 5
You might recognize that as simply the Pythagorean theorem from geography.
• It calculates the straight-line distance between two points on the coordinate plane.
• Those points are defined by your feature values.
• This generalizes to any number of features (i.e. you can calculate straight-line distance
between two points in any-dimensional feature space).
Figure 9.3: Straight-line distance can be calculated between two points in any-dimensional feature
space.
There are other forms of distances as well, but they are not as common. They are typically used in
special situations. The default euclidean distance is usually a good starting point.
96 Chapter 9. Clustering Algorithms
Because K-Means creates clusters based on distances, and because distances are calculated by
between observations defined by their feature values, the features you choose to input into the
algorithm heavily influence the clusters that are created.
In Project 4, we looked at 3 possible feature sets and compared the clusters created from them:
1. Only purchase pattern features ("Base DF")
2. Purchase pattern features + item features chosen by thresholding ("Threshold DF")
3. Purchase pattern features + principal component features from items ("PCA DF")
The easiest way to prepare your feature sets for clustering is to create separate objects for each
feature set you wish to try.
In Project 4, we did so like this:
# I m p o r t PCA i t e m features
pca_item_data = pd. read_csv (’pca_item_data .csv ’,index_col =0)
Once you have your feature sets ready, fitting the clustering models is very simple (and the process
is a small subset of the process for Supervised Learning).
1. First, we make the pipeline with standardization.
2. Then, we fit the pipeline to the feature set (a.k.a. the training data).
3. Next, we call .predict() on the same feature set to get the clusters.
4. Finally, we can visualize the clusters.
5. Repeat steps (1) - (4) for each feature set.
98 Chapter 9. Clustering Algorithms
Here’s how you’d perform steps (1) - (4) for one feature set:
# K−M e a n s m o d e l pipeline
k_means = make_pipeline ( StandardScaler (),
KMeans ( n_clusters =3, random_state =123))
# F i t K−M e a n s p i p e l i n e
k_means .fit( base_df )
# Scatterplot , c o l o r e d by cluster
sns. lmplot (x=’total_sales ’, y=’avg_cart_value ’,
hue=’cluster ’, data=base_df , fit_reg =False)
The Seaborn library also allows us to visualize the clusters by using the hue= argument to its
.lmplot() function.
Again, it’s worth reminding that there aren’t any clear, widely-accepted performance metrics for
clustering tasks because we don’t have labeled "ground truth."
Therefore, it’s usually more useful to just compare the clusters created by different feature sets, and
then take them to the key stakeholder for feedback and more iteration.
A. Appendix
Area under ROC curve is the most reliable metric for classification tasks.
• It’s equivalent to the probability that a randomly chosen Positive observation ranks higher (has
a higher predicted probability) than a randomly chosen Negative observation.
• Basically, it’s saying... if you grabbed two observations and exactly one of them was the
positive class and one of them was the negative class, what’s the likelihood that your model
can distinguish the two?
• Therefore, it doesn’t care about imbalanced classes.
Before presenting the idea of an ROC curve, we must first discuss what a confusion matrix is.
For binary classification, there are 4 possible outcomes for any given prediction.
1. True positive - Predict 1 when the actual class is 1.
2. False positive - Predict 1 when the actual class is 0.
3. True negative - Predict 0 when the actual class is 0.
4. False negative - Predict 0 when the actual class is 1.
The confusion matrix displays this information for a set of class predictions.
# Import confusion_matrix
from sklearn . metrics import confusion_matrix
100 Chapter A. Appendix
The confusion matrix allows you to calculate two important metrics that are both combinations of
elements in the matrix.
TP
True positive rate (TPR), also known as recall, is defined as T P+FN . In other words, it’s the
proportion of all positive observations that are correctly predicted to be positive.
FP
False positive rate (FPR) is defined as FP+T N . In other words, it’s the proportion of all negative
observations that are incorrectly predicted to be positive.
Obviously, we want TPR to be higher and FPR to be lower... However, they are intertwined in an
important way.
Remember, we can predict a probability for each class using .predict_proba(), instead of the
class directly.
For example:
# Display first 10 p r e d i c t i o n s
pred [:10]
# [0.030570070257148009 ,
# 0.004441966482297899 ,
# 0.007296300193244642 ,
# 0.088097865803861697 ,
# 0.071150950128417365 ,
# 0.48160946301549462 ,
# 0.12604877174578785 ,
# 0.6152946894912692 ,
# 0.72665929094601023 ,
# 0.13703595544287492]
The default behavior of .predict() for binary classification is to predict 1 (positive class) if the
probability is greater than 0.5. So among those first 10 predictions in the code snippet above, the 8th
and 9th would be classified as 1 and the others would be classified as 0.
In other words, 0.5 is the default threshold.
However, you can theoretically alter that threshold, depending on your goals. Lowering it will make
positive class predictions more likely. Conversely, raising it will make negative class predictions
more likely. The threshold you choose is independent of the model.
Here’s where TPR and FPR are intertwined!
• If you lower the threshold, the true positive rate increases, but so does the false positive rate.
• Remember, TPR is the proportion of all actually positive observations that were predicted to
be positive.
• That means a model that always predicts the positive class will have a true positive rate of
100%.
• However, FPR is the proportion of all actually negative observations that were predicted to be
positive.
• Therefore, a model that always predicts the positive class will also have a false positive rate of
100%!
And that finally brings us to the ROC curve!
The ROC curve is a way to visualize the relationship between TPR and FPR for classification
models. It simply plots the true positive rate and false positive rate at different thresholds.
We can create one like so:
102 Chapter A. Appendix
# Initialize figure
fig = plt. figure ( figsize =(8 ,8))
plt. title(’Receiver Operating Characteristic ’)
# P l o t ROC c u r v e
plt.plot(fpr , tpr , label=’l1’)
plt. legend (loc=’lower right ’)
# D i a g o n a l 45 d e g r e e line
plt.plot ([0 ,1] ,[0 ,1] , ’k−−’)
The 45 degree dotted black line represents the ROC curve of a hypothetical model that makes
completely random predictions.
Basically, we want our model’s curve, the blue line, to sit as far above that dotted black line as
possible.
Figure A.2: As the threshold decreases, both FPR and TPR increase.
A.1.5 AUROC
# C a l c u l a t e ROC c u r v e
fpr , tpr , thresholds = roc_curve (y_test , pred)
# C a l c u l a t e AUROC
print ( auc(fpr , tpr) )
# 0.901543001458
So the interpretation of that output is that our L1 -regularized logistic regression (Project 3) has a
90.15% chance of distinguishing between a positive observation and a negative one.
104 Chapter A. Appendix
Often, the most interesting machine learning applications require you to wrangle your data first.
In Project 4, you were given a transaction-level dataset. In other words, each observation in the raw
dataset is for a single transaction - one item, one customer, one purchase.
However, based on the project scope, you needed a customer-level dataset for machine learning.
Remember, the goal was to produce clusters of customers.
Therefore, before applying machine learning models, you needed to first break down and restructure
the dataset.
Specifically, you needed to aggregate transactions by customer and engineer customer-level
features.
• This step blends together exploratory analysis, data cleaning, and feature engineering.
• Here, feature engineering comes from aggregating the transaction-level data.
• You still have a lot of room for creativity in this step.
After cleaning the transaction-level dataset, we rolled it up to the customer level and aggregated
customer-level features.
We wanted 1 customer per row, and we wanted the features to represent information such as:
• Number of unique purchases by the customer
• Average cart value for the customer
• Total sales for the customer
• Etc.
# R o l l up s a l e s data
sales_data = df. groupby (’CustomerID ’). Sales.agg ({
’total_sales ’ : ’sum ’,
’avg_product_value ’ : ’mean ’ })
A.2 Data Wrangling 105
You won’t always be able to easily roll up to your desired level directly... Sometimes, it will be easier
to create intermediary levels first.
# Reset index
cart. reset_index ( inplace =True)
Now you have multiple dataframes that each contain customer-level features.
Next, all you need to do is join them all together.
Next, we’ll introduce a concept that’s especially important for Unsupervised Learning: Dimensional-
ity Reduction.
In Project 4, our client wished to incorporate information about specific item purchases into the
clusters.
For example, our model should be more likely to group together customers who buy similar items.
One logical way to represent this is to create dummy variables for each unique gift and feed those
features into the clustering algorithm, alongside the purchase pattern features. Now, here’s the
problem... the retailer sells over 2000 unique gifts. That’s over 2000 dummy variables.
This runs into a problem called "The Curse of Dimensionality."
Figure A.4: Imagine searching for a penny as the number of dimensions increases.
For our practical purposes, it’s enough to remember that when you have many features (high
dimensionality), it makes clustering especially hard because every observation is "far away" from
each other.
The amount of "space" that a data point could potentially exist in becomes larger and larger, and
clusters become very hard to form.
Remember, in Project 4, this "curse" arose because we tried to create dummy variables for each
unique item in the dataset. This led to over 2500 additional features that were created!
Next, we’ll review the two methods for reducing the number of features in your dataset.
One very simple and straightforward way to reduce the dimensionality of this item data is to set a
threshold for keeping features.
• In Project 4, the rationale is that you might only want to keep popular items.
• For example, let’s say item A was only purchased by 2 customers. Well, the feature for item A
will be 0 for almost all observations, which isn’t very helpful.
• On the other hand, let’s say item B was purchased by 100 customers. The feature for item B
will allow more meaningful comparisons.
For example, this is how we kept item features for only the 20 most popular items.
Figure A.5: First principal component (black) and second principal component (red)
Therefore, if you have many correlated features, you can use PCA to capture the key axes of variance
and drastically reduce the number of features.
# F i t and t r a n s f o r m item_data
item_data_scaled = scaler . fit_transform ( item_data )
# F i t and t r a n s f o r m item_data_scaled
PC_items = pca. fit_transform ( item_data_scaled )
Please refer to the online lesson and your Companion Workbook for a more detailed explanation of
PCA, along with a walkthrough using a toy example.