0% found this document useful (0 votes)
153 views43 pages

DataIku Machine Learning Basics p2

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
153 views43 pages

DataIku Machine Learning Basics p2

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

Machine Learning Basics,

GUIDEBOOK

Continued: Building Your First


Machine Learning Model

www.dataiku.com
Introduction:
So You Want to Build a
Machine Learning Model?
In the past few years, machine learning (ML) has gone from a discipline restricted to data scientists
and engineers to the mainstream of business and analytics. With advancements in automated
machine learning (AutoML) and collaborative AI and machine learning platforms (like Dataiku),
the use of data — including for predictive modeling — across people of all different job types is
on the rise.

We have designed this guidebook as a means of support and reference for beginner practitioners
who want to get started with building their first predictive ML model. At the end of this practical
guide, readers should be able to understand:

The key considerations when preparing a dataset for machine learning

The value of quick modeling using Dataiku AutoML and how it can be the
basis of your first machine learning model

Concepts for evaluating and tuning a model in Dataiku

Note: While this guidebook is meant to give aspiring and beginner machine learning
practitioners the tools and practical steps for building a machine learning model, it is not
all-encompassing or a step-by-step tutorial. For a more hands-on tutorial on building
machine learning models in Dataiku DSS, we strongly recommend following up on this
guide by enrolling in the free Dataiku Academy Machine Learning Basics course track.

GUIDEBOOK - Dataiku Machine Learning Basics, Continued: Building Your First Machine Learning Model 1
Prerequisite: Key Machine Learning
Concepts & Main Considerations
This section provides a crash-course refresher on some of the basic machine learning concepts
as a primer to building a predictive model.

AI vs. Machine Learning vs. Deep Learning

Before getting into machine learning (ML), let’s take a step back and discuss artificial intelligence
(AI) more broadly. AI is an umbrella term for any computer program that does something smart
that we previously thought only humans could do. This can even include something as simple
as a computer program that uses a set of predefined rules to play checkers, although when we
talk about AI today, we are usually referring to more advanced applications.

2 Machine Learning Basics, Continued: Building Your First Machine Learning Model GUIDEBOOK - Dataiku
Specifically, we're usually talking about machine learning, which means teaching a machine to
learn from experience without explicitly programming it to do so. Usually the “thing” machine
learning is predicting is not explicitly stated in data (for example, predicting whether or not
someone will default on their loan, what kind of music people will like based on songs they’ve
listened to, etc.).

Deep learning, another hot topic, is a subset of machine learning and has further enhanced
the AI boom of the last 10 years. In a nutshell, deep learning is an advanced type of ML that can
handle complex tasks like image and sound recognition.

Figure I-1: AI vs. ML vs. DL.

GUIDEBOOK - Dataiku Machine Learning Basics, Continued: Building Your First Machine Learning Model 3
Machine Learning Model, Machine Learning Algorithm... What is the Difference?

When reading about ML, you have probably seen the expressions “machine learning model” and
“machine learning algorithm” used quasi-interchangeably, which can lead to some confusion.

Put simply, the difference between a model and algorithm is as follows:

• An algorithm is a set of rules to follow in a certain order to solve a problem.

• A model is a computation or a formula that you build by using the algorithm.


A machine learning model will take your data as input, apply the rules of the
algorithm(s) used, and produce an output model as result.

Figure I-2: A visual representation of algorithms vs. models.

4 Machine Learning Basics, Continued: Building Your First Machine Learning Model GUIDEBOOK - Dataiku
Types of Machine Learning

Machine learning can be divided into two key subcategories:

• Supervised ML, which uses a set of input variables to predict the value of an output
variable. It uses previous values to predict new values.

• Unsupervised ML, which infers patterns from an unlabeled dataset. Here, you aren’t
trying to predict anything, you’re just trying to understand patterns and groupings in
the data.

Figure I-3: Supervised vs. unsupervised learning.

This guide will focus on supervised machine learning — we will look at historical data to train
a model to learn the relationships between features, or variables, and a target, the thing we’re
trying to predict. This way, when new data comes in, we can use the feature values to make a
good prediction of the target, whose value we do not yet know.

GUIDEBOOK - Dataiku Machine Learning Basics, Continued: Building Your First Machine Learning Model 5
Supervised learning can be further split into regression (predicting numerical values) and
classification (predicting categorical values). Some algorithms can only be used for regression,
others only for classification, and many for both.

Figure I-4: Regression vs. classification problems.

Key Supervised ML Algorithms

Many of the most popular supervised learning algorithms fall into three key categories:

Table I-1: Popular supervised learning algorithms

Linear models use a simple formula to find a


Linear models best-fit line through a set of data points.

Tree-based models use a series of “if-then”


Tree-based models rules to generate predictions from one or
more decision trees.

Artificial neural networks are modeled after


the way that neurons interact in the human
Artificial neural
brain to interpret information and solve
networks
problems. This is also often referred to as
deep learning.

6 Machine Learning Basics, Continued: Building Your First Machine Learning Model GUIDEBOOK - Dataiku
Note that in Section III Building a Model, Figure 3-2 goes more in depth on the specific types
of algorithms in these three categories (including diving into their individual advantages and
disadvantages).

Now that you have brushed up on some of the key machine learning concepts, let’s dive into the
fundamental steps of building a predictive machine learning model. The following sections will
go through each stage one by one, including key considerations for each stage.

Figure I-5: Key stages of building a machine learning model.

I. Defining the Goal


Defining the business objective of a machine learning project as concretely as possible is key
to ensuring its success, and it’s the first step of any sound data project. In a business setting,
motivating the different actors and obtaining all the resources necessary to get models from
design to production requires a project that addresses a clear organizational need.

If you’re working on a personal project or playing around with a dataset or an API, this step may
seem irrelevant. It’s not — simply downloading a cool open dataset is not enough. In order to
have motivation, direction, and purpose to execute and build a machine learning model from
start to finish, you have to identify a clear objective for what you want to do with the data, the
model, and how it’s going to improve your current processes or performance at a given task (or
unlock new solutions to a given problem that you couldn’t address before).

GUIDEBOOK - Dataiku Machine Learning Basics, Continued: Building Your First Machine Learning Model 7
When it comes to identifying a business problem for your first predictive machine learning
model, you could start by looking at the different types of prediction and thinking about what
exactly it is that you would like to predict. As seen in the previous section, there are two main
types of supervised machine learning:

• Classification — Do you want to predict whether something is one thing or another?


Such as whether a customer will churn or not churn? Or whether a patient has
heart disease or not? Note, there can be more than two options, or more than two
“classes” that what you’re trying to predict can fall into. Two classes is called binary
classification, more than two classes is called multi-class classification. Multi-label is
when an item can belong to more than one class.

• Regression — Do you want to predict a specific number of something? Such as how


much a house will sell for? Or how many customers will visit your site next month?

Throughout this guide, we will be using a fictional T-shirt company called “Haiku T-shirts” as
an example. You can go ahead and explore the Dataiku T-shirts sample predictive machine
learning project built in Dataiku to follow along. Let’s say that for our T-shirt business, one
crucial pain point is predicting sales and revenue — that is, as a company, we need to be able
to anticipate how many T-shirts we’ll sell each month and how much revenue we can expect to
generate. This is critical for planning and business forecasts as well as for stock optimization.

As this is our first predictive ML model and we want to keep things fairly simple, we will narrow
down the business question even further and focus on predicting how much revenue new
customers will generate this month.

8 Machine Learning Basics, Continued: Building Your First Machine Learning Model GUIDEBOOK - Dataiku
II. Preparing the Dataset for
Machine Learning
Preparing data for machine learning can take up to 80% of the time of an entire data project. In
other words, making sure data is consistent, clean, and usable overall is no small undertaking.
However, having clean data is critical when building machine learning models — garbage in,
garbage out, as the saying goes. This section will cover the basics, from finding the right data to
preliminary analysis and exploration of datasets to feature discovery.

Getting the Data

With a goal in mind, it’s time to start looking for data: the second phase of building a model.
Mixing and merging data from many different data sources can take a data project to the next
level. For example, our fictional T-shirt company trying to predict how much revenue new
customers will generate certainly requires internal data (like sales and CRM data) but it could
also benefit from external data sources such as economical data (are people buying a lot of
goods in general?) and data that represents fashion trends.

There are a few ways to get usable data:

Connect to a database: Ask data or IT teams for the data that’s available.

Use APIs: Application Programming Interfaces (APIs) allow two applications to


talk to each other, and they can be used to access data from tools like SalesForce,
Pipedrive, and a host of other applications on the web or that your organization
might be using. If you’re not an expert coder, Dataiku Plugins provide lots of
possibilities to bring in external data.

Look for open data: The web is full of datasets to enrich what you already have with
additional information. For example, census data could help add average revenue
for the district where a user lives, or OpenStreetMap can provide the number of
coffee shops on a given street. A lot of countries have open data platforms (like
data.gov in the United States). If you're working on a fun project outside of work,
these open data sets are also an incredible resource.

GUIDEBOOK - Dataiku Machine Learning Basics, Continued: Building Your First Machine Learning Model 9
If you’re following along with the Dataiku T-shirt example, you’ll see that this project uses a
few simple datasets: historical CRM data combined with data from users’ interactions on the
company website.

EFFICIENCY TIP

Data science, machine learning, and AI platforms make it much easier to find and connect
to data sources all in one place. For example, Dataiku has more than 40 native data
connectors, from cloud databases to business applications to flat files to on-premises
warehouses and everything in between. That means regardless of data size, shape, or
location, Dataiku makes connecting to data easier.

Figure 2-1: Connecting to a wide range of data sources made easy with Dataiku.

10 Machine Learning Basics, Continued: Building Your First Machine Learning Model GUIDEBOOK - Dataiku
Analyze, Explore, and Clean

Before building a model, it is good practice to carefully understand the dataset(s) at hand.
Spending time properly analyzing, exploring, and cleaning data at this stage not only can ensure
better results, but it also helps avoid serious issues (e.g., using inherently biased or problematic
data that can easily lead to biased or otherwise problematic models).

Start digging to see what data you’re dealing with and ask questions to business people, the
IT team, or other groups to understand what all variables mean. From there, you’ve probably
noticed that even though you have a country feature, for instance, you’ve got different spellings,
or even missing data. It’s time to look at every one of your columns to make sure the data is
homogeneous and clean.

Keep an eye out for data quality issues, such as missing values or inconsistent data — too many
missing or invalid values mean that those variables won’t have any predictive power for your
model. Plus, many machine learning algorithms are not able to handle rows with missing data.
Depending on the use case, we could impute (or assign) the missing values with a constant, like
zero, or with the median or mean of the column. In other cases, we might choose to drop rows
with missing values entirely.

GUIDEBOOK - Dataiku Machine Learning Basics, Continued: Building Your First Machine Learning Model 11
Missing values are an issue for categorical variables as well. . For example, in the T-shirt project,
we could treat the missing values as an additional “Missing” category, or we could impute the
most common category of the existing ones. For instance, if the most common T-shirt category
in our dataset is White M Female T-shirts, we may want to impute this to fill the missing rows in
our T-shirt type category. Another option would be to entirely drop rows with missing values.

Again, if you’re following along with the Dataiku T-shirt sample project, you can go to the Flow
and click on the yellow Recipe circles to see the data cleaning and enhancement steps that have
been applied in this case.

Figure 2-2: The Flow in Dataiku, or a visual representation of the data pipeline.

12 Machine Learning Basics, Continued: Building Your First Machine Learning Model GUIDEBOOK - Dataiku
EFFICIENCY TIP

Dataiku removes the pain of the necessary — yet seemingly never-ending — step of data prep.
Visual tools (called Recipes) offer 90+ transformations to cleanse, combine, reshape, and
enrich data via an interactive spreadsheet-like experience.

Figure 2-3: Assess data quality at-a-glance with Columns Quick View in Dataiku.

Figure 2-4: Get a more detailed look at — plus easily address —


data quality by analyzing individual columns in Dataiku.

GUIDEBOOK - Dataiku Machine Learning Basics, Continued: Building Your First Machine Learning Model 13
Feature Selection

After exploring and cleaning datasets, the next step is selecting the features you’ll use to train
your model. Also known as an independent variable or a predictor variable, a feature is an
observable quantity, recorded and used by a prediction model — in structured datasets, features
appear as columns. Examples of features from the Dataiku T-shirt sample project include “pages
visited,” “country,” etc. Note that it’s also possible to engineer features by combining them or
adding new information to them.

The main reasons behind using feature selection are that it:

• Reduces complexity. Including only the most relevant features means less complexity,
which is good not only for model explainability, but also for training speed. Less
complexity can also mean better accuracy, as irrelevant features introduce noise.

• Reduces overfitting. Overfitting is when a model has learned too much detail or
noise in the training data that won’t necessarily exist in unseen data. In this case, the
model will appear to perform well on the training data but will perform poorly on
unseen data. This is often referred to as the model not generalizing well — more on
this later.

14 Machine Learning Basics, Continued: Building Your First Machine Learning Model GUIDEBOOK - Dataiku
Some features can negatively impact model performance, so you’ll want to identify and remove
them. Common feature selection methods include:

1. Statistical tests, which can be used to select those features that have the
strongest relationship with the output variable. For example, univariate
analysis is useful for exploring a dataset one variable at a time. This kind
of analysis does not consider relationships between two or more variables
in your dataset. Rather, the goal here is to describe and summarize the
dataset using a single variable. On the other hand, bivariate analysis is
useful for analyzing two variables to determine any existing relationship
between them.

2. Correlation matrices are useful for showing the correlation coefficients (or
degree of relationship) between numeric variables. Correlation states how
the features are related to each other or the target variable. Correlation
can be positive (increase in one value of feature increases the value of the
target variable) or negative (increase in one value of feature decreases the
value of the target variable). Correlation matrices allow you to view a visual
table of the pairwise correlations for multiple variables in your dataset, but
you can also use a heatmap, which makes it easy to identify which features
are most related to the target variable. If two features are highly correlated
(either positively or negatively) with each other, you might want to consider
removing one.

GUIDEBOOK - Dataiku Machine Learning Basics, Continued: Building Your First Machine Learning Model 15
EFFICIENCY TIP

Machine learning platforms offer many features that facilitate the feature selection process. For
example, Dataiku interactive statistics provides a dedicated interface for performing exploratory
data analysis (EDA) on datasets. That means the two methods described above (statistical tests
and correlation matrices) are a breeze with Dataiku, even if you’re not a statistics expert.

Figure 2-5: Univariate analysis for the T-shirt example in Dataiku.

16 Machine Learning Basics, Continued: Building Your First Machine Learning Model GUIDEBOOK - Dataiku
Figure 2-6: Bivariate analysis for the T-shirt example in Dataiku.

Figure 2-7: A correlation matrix in Dataiku.

GUIDEBOOK - Dataiku Machine Learning Basics, Continued: Building Your First Machine Learning Model 17
Feature Handling and Engineering

Feature handling is about making certain transformations to features that would allow them to
be better used and positively impact the performance of your model. Assuming that you have
clean data, feature handing and engineering is where you can make the biggest impact when
creating an ML model — so it’s pretty important!

A feature’s variable type determines the feature handling options during machine learning:

Numerical variables take values that can be added, subtracted, multiplied,


etc. There are times when it may be useful to treat a numerical variable with a
limited number of values as categorical. For instance, if you have two product
categories that are just labeled as “1” and “2,” they would normally be
processed as numerical variables, but it might make more sense to treat them
as classes.

Categorical variables take one of an enumerated list values. The goal of


categorical feature handling is to encode the values of a categorical variable
so that they can be treated as numeric.

Text variables are arbitrary blocks of text. If a text variable takes a limited
number of values (for instance, “white T-shirt M,” “black T-shirt S,” etc.), it may
be useful to treat it as categorical.

Feature engineering relates to building new features from the existing dataset or transforming
existing features into more meaningful representations — take, for example, raw dates. Raw
dates are not well understood by machine learning models, so a good strategy would be to parse
those dates, or convert them into numerical features to preserve their ordinal relationship.

Another example of when feature engineering might be necessary is if there are large absolute
differences between values — in this case, we might want to apply a rescaling technique. Feature
scaling is a method used to normalize the range between the values of numerical features. Why?
Because variables that are measured at different scales do not contribute equally to the model
fitting, and they might end up creating a bias.

18 Machine Learning Basics, Continued: Building Your First Machine Learning Model GUIDEBOOK - Dataiku
Note that some data science, machine learning, and AI platforms (like Dataiku) automatically
add normalization, though it’s still important to understand the basics of what’s happening.
For example, one of the most common rescaling methods is min-max rescaling (also known as
min-max scaling or min-max normalization), which is the simplest method. It consists of scaling
the data to a fixed range — usually 0 to 1. Having this bounded range means smaller standard
deviations, which can suppress the effect of outliers.

For those that want to dive deeper on the topic of feature scaling, we recommend the
article All About Feature Scaling1 by Baijayanta Roy on Towards Data Science.

Featuring engineering looks a little different for categorical variables — for example, one
transformation involves encoding categorical variables as numbers so that they can be
understood by the machine learning algorithm. Dummy-encoding (vectorization) creates a
vector of 0/1 flags of length equal to the number of categories in the categorical variable. You
can choose to drop one of the dummies so that they are not linearly dependent, or — if you’re
using it — let Dataiku decide (in which case the least frequently occurring category is dropped).
There is a limit on the number of dummies, which can be based on a maximum number of
categories, the cumulative proportion of rows accounted for by the most popular rows, or a
minimum number of samples per category. In our T-shirt example, we could apply dummy
encoding so that hats and shoes are encoded as numbers.

Addressing Data Leakage


While data leakage might not be something beginning practitioners run into right away,
it’s still a good concept to understand. Data leakage happens when training models
use features that won’t be available when predicting. While generally more of an issue
with complex datasets, it’s nonetheless important to be aware of the issue as it will
cause overly optimistic performance on validation dataset and very low generalization
performance on real data.

When building training sets, it is essential to check that all features you are using will be
available at prediction time and that none of your features contain the information you
are trying to predict. For example, let’s say you are trying to predict T-shirt sales for a given
day, and you’ve created a feature using the sales amount in the three days leading to this
day. The feature contains your target, unless you make sure that your feature only looks at
the three days prior to the target day.

1 https://towardsdatascience.com/all-about-feature-scaling-bcc0ad75cb35

GUIDEBOOK - Dataiku Machine Learning Basics, Continued: Building Your First Machine Learning Model 19
III. Building the Model
With a defined business problem and prepared data, it’s finally time to start building your
machine learning model. This section will cover building a simple model using Dataiku AutoML.

Quick Modeling in Dataiku DSS

The first model you build should be a baseline, meaning a model that is straightforward but
with a good chance of providing decent results. Building baseline models is fast and in many
cases, it can get you 90% of the way to where you need to be. In fact, a baseline model may
be enough to solve the problem at hand — with an accuracy of 90%, should you then focus on
getting the accuracy to 95%? Or would it be a better use of time to solve more problems to 90%
accuracy? The answer depends largely on the use case and business context.

AutoML is a tool that automates the process of applying machine learning and can make quick,
baseline modeling simple — even experienced data scientists use AutoML to accelerate their
work. For example, in Dataiku, users can select between the AutoML mode where the tool makes
a lot of optimized choices for you and the Expert Mode where you can control the details of the
model, write your own algorithms in code, or use deep learning models.

Note that in the AutoML mode, we’ll still be able to define the types of algorithms Dataiku will
train. This will let us choose between fast prototypes, interpretable models, or high performing
models with less interpretability.

20 Machine Learning Basics, Continued: Building Your First Machine Learning Model GUIDEBOOK - Dataiku
Designing the Model

With Dataiku AutoML, you’ll still need to make a few decisions when creating your model (even
for fast prototypes):

1. Select a target variable. Since we are going with a supervised learning model, we
need to specify the target variable, or the variable whose values are to be predicted
by our model using the other variables. In other words, it’s what we want to predict.
In the case of our Haiku T-shirts project, the variable we want to predict is the
expected revenue that will be generated by new customers in the coming month.

Figure 3-1: Selecting a target variable using Dataiku AutoML.

GUIDEBOOK - Dataiku Machine Learning Basics, Continued: Building Your First Machine Learning Model 21
2. Select the prediction type. After identifying the target variable, the next step would be
to select the prediction type. As a reminder, the main types of prediction — and how
we might use them for our example — are:

• Regression, or predicting a numerical value. As our objective is to predict the


amount of revenue that will be generated by T-shirt sales to new customers,
it makes the most sense for our sample model to choose regression as the
prediction type. Alternatively, if we were to formulate a slightly different
problem and target variable on the same Haiku T-shirts data, we could go with
classification instead.

• Classification, meaning predicting a “class,” or an outcome, between


several possible options. If, for instance, we want to determine whether new
customers are likely to be “high spenders” (which can be defined, for instance,
as anyone who has spent more than $50 on our products in one month), we
could use two-class classification to predict whether a customer will be a high
spender (class 1) or non-high spender (class 2). Since in this case, we have two
possible outcomes, this would be two-class classification, but for another
problem it might make sense to have more than two outcomes, in which case
it would be multi-class classification.

Training on a Subset of the Data

When developing a machine learning model, it is important to be able to evaluate how well it is
able to map inputs to outputs and make accurate predictions. However, if you use data that the
model has already seen (during training for example) to evaluate the performance, then you will
not be able to detect problems like overfitting.

It is a standard in machine learning to first split your training data into a set for training and a
set for testing. There is no rule as to the exact size split to make, but it is sensible to reserve a
larger sample for training — a typical split is 80% training and 20% testing data. It is important
that the data is also randomly split so that you are getting a good representation of the patterns
that exist in the data in both sets.

22 Machine Learning Basics, Continued: Building Your First Machine Learning Model GUIDEBOOK - Dataiku
EFFICIENCY TIP

Data science and machine learning platforms can do a lot of the heavy lifting for you when it
comes to train/test data. For example, during the model training phase, Dataiku “holds out” on
the test set, and the model is only trained on the train set. Once the model is trained, Dataiku
evaluates its performance on the test set. This ensures that the evaluation is done on data that
the model has never seen before.

By default, Dataiku randomly splits the input dataset into a training and a test set, and the
fraction of data used for training can be customized (though again, 80% is a standard fraction
of data to use for training). For more advanced practitioners, there are lots more settings in
Dataiku that can be tweaked for the right train/test set specifications.

Selecting the Algorithm and Hyperparameters

Different algorithms have different strengths and weaknesses, so deciding which one to use
for a model depends largely on your business goals and priorities. Questions you might ask
yourself include:

How accurate does the model need to be? For some use cases, like fraud
detection, accuracy is critical. Others, like — say — a recommendation engine,
don’t need to be as accurate to be successful and provide business value.

Is interpretability important? If people need to understand how any individual


predictor is associated with the response, it might make sense to choose more
interpretable algorithms (e.g., linear regression).

How much does speed matter? Different algorithms will take different amounts
of time to score data, and in some business contexts, time is of the essence
(even just a few fractions of a second — think getting a quote for car insurance).
Decision trees are usually fast as well as accurate which makes them a good
baseline choice.

GUIDEBOOK - Dataiku Machine Learning Basics, Continued: Building Your First Machine Learning Model 23
Figure 3-2 presents an overview of the various models available with the goal of presenting
some of the most common algorithms across a wide range of use cases. However, the list is
certainly not exhaustive —- for example, there are multiple gradient boosting algorithms (e.g.,
GBM, XGBoost, LightGMB), and other algorithms you may have heard of that aren’t presented
at all (e.g., K-nearest neighbors, or KNN). To go a step further with a more exhaustive list, we
recommend Dataquest The 10 Best Machine Learning Algorithms for Data Science Beginners2.

Figure 3-2: A brief summary of the top algorithms used in predictive analysis.

2 https://www.dataquest.io/blog/top-10-machine-learning-algorithms-for-beginners/

24 Machine Learning Basics, Continued: Building Your First Machine Learning Model GUIDEBOOK - Dataiku
It’s also important to understand the concept of libraries, which are sets of routines and
functions that are written in a given language, making it easier to perform complex ML tasks
without rewriting many lines of code. For example, Dataiku AutoML comes with support for four
different machine learning engines:

• In-memory Python (Scikit-learn / XGBoost)


• MLLib (Spark) engine
• H2O (Sparkling Water) engine
• Vertica

Beyond choosing an algorithm, one other aspect of building an ML model is tuning


hyperparameters. Hyperparameters are parameters whose values are used to control the learning
process — think of them as the settings of the machine learning model. They matter because
they control the overall behavior of a ML model, and tuning or optimizing hyperparameters can
help your model perform better.

Performing hyperparameter optimization is really about searching for the best possible model
you can find within your time constraints. One way to refine the search space is to study which
hyperparameters are most important and focus on them. For a given machine learning task, it
is likely that changing the values of some hyperparameters will make a much larger difference
to the performance than others. Tuning the value of these hyperparameters can therefore bring
the greatest benefits.

GUIDEBOOK - Dataiku Machine Learning Basics, Continued: Building Your First Machine Learning Model 25
Finding the best combination of hyperparameters is really hard, and uncovering how to do so
could be a topic in and of itself and is outside the scope of this guide. For a baseline model, AutoML
helps by quickly honing in on the most promising potential hyperparameter optimizations, and
therefore can help you build better models in a limited amount of time.

EFFICIENCY TIP

Dataiku supports several algorithms that can be used to train predictive models. We
recommend trying several different algorithms before deciding on one particular modeling
method — Dataiku AutoML makes side-by-side comparison of multiple algorithms easy.

Figure 3-3: Dataiku AutoML — quick modeling using random forest and
ridge regression for the T-shirt sales project.

26 Machine Learning Basics, Continued: Building Your First Machine Learning Model GUIDEBOOK - Dataiku
For each algorithm that you select, you can ask Dataiku to explore several values for
each hyperparameter. Dataiku will automatically train models for each combination of
hyperparameters and only keep the best one (where “best” means that it maximizes the
metric chosen in the Metric section). To do so, Dataiku resplits the train set and extracts a
“cross-validation” set. It then repeatedly trains on the train set minus cross-validation set
and then verifies how the model performed on the cross-validation set.

In addition to algorithms selection and optimization, Dataiku’s automated machine


learning performs:

• Automatic features handling, including handling of categorical and text variables,


handling of missing values, scaling, etc.
• Semi-automatic massive features generation
• Optional features selection

IV. Evaluating the Model


You’ve connected to data, explored it, cleaned it, and created a quick model. Now what? How do
you know if your model is any good? That’s where tracking and comparing model performance
across different algorithms comes in.

Metrics Evaluation and Optimization

There are several metrics for evaluating machine learning models depending on whether you
are working with a regression model or a classification model. It’s also worth noting that for
most algorithms, you’ll also choose a specific metric to optimize for during the model training.
However, that metric might not be as interpretable as some of the other evaluation metrics for
actually determining how well a model works.

GUIDEBOOK - Dataiku Machine Learning Basics, Continued: Building Your First Machine Learning Model 27
For regression models, you want to look at mean squared error and R-squared (R2)

• Mean squared error is calculated by computing the square of all errors and
averaging them over all observations. The lower this number is, the more accurate
your predictions were.

• R2 (pronounced R-Squared) is the percentage of the observed variance from the


mean that is explained (that is, predicted) by your model. R2 always falls between 0
and 1, and a higher number is better.

For classification models:

The most simple metric for evaluating a model is accuracy. Accuracy is a


common word, but in this case we have a very specific way of calculating it.
Accuracy is the percentage of observations that were correctly predicted by
the model. Accuracy is simple to understand but should be interpreted with
caution, in particular when the various classes to predict are unbalanced.

Another metric you might come across is the ROC AUC, which is a
measure of accuracy and stability. A higher ROC AUC generally means you
have a better model.

Logarithmic loss, or log loss, is a metric often used in competitions like


those run by Kaggle, and it is applied when your classification model
outputs not strict classifications (e.g., true and false) but class membership
probabilities (e.g., a 10% chance of being true, a 75% chance of being
true, etc.). Log loss applies heavier penalties to incorrect predictions that
your model made with high confidence.

28 Machine Learning Basics, Continued: Building Your First Machine Learning Model GUIDEBOOK - Dataiku
EFFICIENCY TIP

Data science and machine learning platforms can help evaluate and optimize models quickly.
For example, you can choose the metric that Dataiku will use to evaluate models, and the
models will then be optimized according to the selected metric.

Figure 4-1: In the T-shirt example, the model created is optimized based on R2 score.

Overfitting and Regularization

We’ve mentioned overfitting in some of the previous sections, but at this point, it’s worth coming
back to it in more detail as it can be one of the biggest challenges to building a predictive model.
In a nutshell, when you train your model using the training set, the model learns the underlying
patterns in that training set in order to make predictions.

GUIDEBOOK - Dataiku Machine Learning Basics, Continued: Building Your First Machine Learning Model 29
But the model also learns peculiarities of that data that don’t have any predictive value. And
when those peculiarities start to influence the prediction, we’ll do such a good job at explaining
our training set that the performance on the test set (and on any new data, for that matter) will
suffer. One remedy for overfitting is called regularization, which is basically just the process of
simplifying your model or of making it less specialized.

Figure 4-2: On the right is visual representation of overfitting, or when a model is too well trained
(compared to the model on the left, which is robust and a good fit, i.e., just right!).

For linear regression, regularization takes the form of L2 and L1 regularization. The mathematics
of these approaches are out of our scope in this guide, but conceptually they’re fairly simple.
Imagine you have a regression model with a bunch of variables and a bunch of coefficients,
in the model y = C1a + C2b + C3c…, where the Cs are coefficients and a, b, and c are variables.
What L2 regularization does is reduce the magnitude of the coefficients, so that the impact of
individual variables is somewhat dulled.

Now, imagine that you have a lot of variables — dozens, hundreds, or even more — with small
but non-zero coefficients. L1 regularization just eliminates a lot of these variables, working
under the assumption that much of what they’re capturing is just noise.

For decision tree models, regularization can be achieved through setting tree depth. A deep
tree — that is, one with a lot of decision nodes — will be complex, and the deeper it is, the more
complex it is. By limiting the depth of a tree, by making it more shallow, we accept losing some
accuracy, but it will be more general.

30 Machine Learning Basics, Continued: Building Your First Machine Learning Model GUIDEBOOK - Dataiku
V. Interpreting the Model
Model interpretability is the degree to which models — and their outcomes — can be understood
by humans. Model interpretability has become increasingly important over the past few years
for two reasons:

1. For the business, as they start to use the results of ML models for their day-to-day
work and decision making. Sometimes, they need to explain those results to end
customers — imagine, for example, a customer service representative who needs to
explain to a client why they were quoted a certain amount for car insurance.

2. As a part of larger risk management strategies, especially as the use of ML and AI by


organizations are becoming increasingly regulated by government entities.

This section will unpack a few key interpretability techniques by which you can evaluate your
first model.

Black-Box vs. White-Box Models

We live in a world of black-box models and white box models. On one hand, black-box models
have observable input-output relationships but lack clarity around inner workings (think:
a model that takes customer attributes as inputs and outputs a car insurance rate, without a
discernible “how”). This is typical of deep-learning and boosted/random forest models, which
model incredibly complex situations with high non-linearity and interactions between inputs.

On the other hand, white-box models have observable/understandable behaviors, features,


and relationships between influencing variables and the output predictions (think: linear
regressions and decision trees), but are often not as performant as black-box models (i.e, lower
accuracy, but higher explainability).

GUIDEBOOK - Dataiku Machine Learning Basics, Continued: Building Your First Machine Learning Model 31
In the ideal world, every model would be explainable and transparent. In the real world,
however, there is a time and place for both types of models. Not all decisions are the same,
and developing interpretable models is very challenging (and in some cases impossible — for
example, modeling a complex scenario or a high-dimensional space, as in image classification).
Even in less-complex scenarios, black-box models typically outperform white-box counterparts
due to black-box models' ability to capture high non-linearity and interactions between features
(e.g., a multi-layer neural network applied to a churn detection use case).

Despite higher performance, there are several downsides to black-box models:

The first downside is simply the lack of explainability internally in a firm as


well as externally to customers and regulators seeking explanations for why
a decision was made (look at the case in 2019 of a black-box algorithm that
erroneously cut medical coverage to long-time patients).

The second downside to black-box models is that there could be a host


of unseen problems impacting the output, including overfit, spurious
correlations, or "garbage in / garbage out," that are impossible to catch due
to the lack of understanding around the black-box model’s operations.

They can also be computationally expensive compared to white-box models


(not to mention more expensive in terms of potential reputational harm if
black-box models make poor decisions).

Another downside of not spending enough time understanding the reality


beyond the black-box model is that it creates a "comprehension debt" that
must be repaid over time via difficulty to sustain performance, unexpected
effects like people gaming the system, or potential unfairness.

Finally, black-box models can lead to technical debt over time whereby
the model must be more frequently reassessed and retrained as data drifts
because the model may rely on spurious and non-causal correlations that
quickly vanish, ultimately driving up OPEX costs.

32 Machine Learning Basics, Continued: Building Your First Machine Learning Model GUIDEBOOK - Dataiku
EFFICIENCY TIP

Dataiku helps balance interpretability and accuracy by providing a data science, machine
learning, and AI platform built for best-practice methodology and governance throughout
a model’s entire lifecycle. The following features in Dataiku help users find the right balance
between black- and white-box models:

• Collaboration: Dataiku is an inclusive and collaborative data science and


machine learning platform democratizing access to data and enabling
enterprises to build their own path to AI. By bringing the right people,
processes, data, and technologies together in a transparent way, strategic
decisions can be better made throughout the model lifecycle including
tradeoffs between black-box and white-box models leading to greater
understanding of, and trust in, model outputs.

• Data preparation and analysis: Dataiku offers data lineage (so you know where the
data originated from); easy to use data transformation and cleaning (to ensure data
quality); data analysis (to identify outliers and key data about the data); sensitivity
analysis (enhanced in our latest version); business knowledge enrichment
(through plugins and business meaning detection) and many other features.

• Machine learning models: Dataiku offers the freedom to approach the modeling
process in Expert Mode or leverage AutoML to quickly and easily generate models
using a variety of pre-canned white-box and black-box models, including logistic/
ridge/lasso regression; decision trees; random forests; XGBoost; SGD; K-Nearest
Neighbors, Artificial Neural Networks, Deep Learning, and more, as well as the
ability to import your own notebook-based custom algorithms.

• Model monitoring: Assess the drift on the data to be scored and assess when the
model must be retrained by comparing the new data with the model original
test set to understand the drift factor using Drift score and Fugacity.

GUIDEBOOK - Dataiku Machine Learning Basics, Continued: Building Your First Machine Learning Model 33
Interpretation Techniques for Advanced Beginners

If you want to turn your predictive models into a goldmine, you need to know how to interpret
their results, review performance, and retrieve information relative to their training. It can
get complicated because the error measure to take into account may depend on the business
problem you want to solve. Indeed, to interpret and draw most profit from your analysis, you
have to understand the basic concepts behind error assessment. And even more importantly,
you have to think about the actions you will take based on your analysis.

This section delves into a few common techniques for model interpretability, how they apply
to our T-shirt example, and how Dataiku can help. Note that this section goes in-depth on
interpretation techniques and is slightly more advanced than the rest of this guide, so if you’re
not ready for it yet, you can always come back to it once you have a few models under your belt.

Partial dependence plots

Partial dependence plots help explain the effect an individual feature has on model predictions.
For example, if you’re trying to predict patients’ hospital readmission rate and compute the
partial dependence plot of the Age feature, you may see that the likelihood of hospital
readmission increases with age.

For all models trained in Python (e.g., scikit-learn, keras, custom models), Dataiku can compute
and display partial dependence plots. The plot shows the dependence of the target on a single
selected feature. The x axis represents the value of the feature, while the y axis displays the
partial dependence. A negative partial dependence represents a negative relationship between
that feature value and the target, while a positive partial dependence represents a positive
relationship.

34 Machine Learning Basics, Continued: Building Your First Machine Learning Model GUIDEBOOK - Dataiku
For example, in the figure below from our T-shirt example, there is a negative relationship
between age_first_order and the target for people under 40. The relationship appears to be
roughly parabolic with a minimum somewhere between ages 25 and 30. After age 40, the
relationship is slowly increasing, until a precipitous dropoff in late age.

The plot also displays the distribution of the feature, so that you can determine whether there
is sufficient data to interpret the relationship between the feature and target.

Figure 5-1: A partial dependence plot in Dataiku for the Haiku T-shirt example

GUIDEBOOK - Dataiku Machine Learning Basics, Continued: Building Your First Machine Learning Model 35
Subpopulation analysis

Before pushing a model into production, one might first want to investigate whether the
model performs identically across different subpopulations. If the model is better at predicting
outcomes for one group over another, it can lead to biased outcomes and unintended
consequences when it is put into production.

For regression and binary classification models trained in Python (e.g., scikit-learn, keras, custom
models), Dataiku can compute and display subpopulation analyses. The primary analysis is a
table of various statistics that you can compare across subpopulations, as defined by the values
of the selected column. You need to establish, for your use case, what constitutes “fair.”

For example, the table below shows a subpopulation analysis for the gender column in our
Haiku T-shirt example. The model-predicted probabilities for male and female are close, but
not quite identical. Depending upon the use case, we may decide that this difference is not
significant enough to warrant further investigation.

Figure 5-2: A subpopulation analysis in Dataiku for the Haiku T-shirt example

36 Machine Learning Basics, Continued: Building Your First Machine Learning Model GUIDEBOOK - Dataiku
By clicking on a row in the table, Dataiku allows you to see more detailed statistics related to
the subpopulation represented by that row. For example, the figure below shows the expanded
display for rows whose value of gender is missing. The number of rows that are Actually True is
lower for this subpopulation than for males or females. By comparing the % of actual classes
view for this subpopulation versus the overall population, it looks like the model does a
significantly better job of predicting actually true rows with missing gender than otherwise.

The density chart suggests this may be because the class True density curve for missing
gender has a single mode around 0.75 probability. By contrast, the density curve for the overall
population has a second mode just below 0.5 probability.

Figure 5-3: More detailed subpopulation analysis in Dataiku for the Haiku T-shirt example

GUIDEBOOK - Dataiku Machine Learning Basics, Continued: Building Your First Machine Learning Model 37
Individual prediction explanations

Partial dependence plots and subpopulation analyses look at features more broadly, but they
don’t provide insight into the factors behind each specific prediction that a model outputs —
that’s where individual prediction explanations come in.

The explanations are useful for understanding the prediction of an individual row and how
certain features impact it. A proportion of the difference between the row’s prediction and the
average prediction can be attributed to a given feature, using its explanation value. In other
words, you can think of an individual explanation as a set of feature importance values that are
specific to a given prediction.

Dataiku provides the capability to compute individual explanations of predictions for all Visual
ML models that are trained using the Python backend.

Figure 5-4: Individual prediction explanations in Dataiku DSS for the Haiku T-shirt example

38 Machine Learning Basics, Continued: Building Your First Machine Learning Model GUIDEBOOK - Dataiku
Conclusion
In this ebook, we have walked through some of the main considerations when building your first
predictive ML model. Noticeably missing is the concept of operationalization, which is critically
important, but could be — and in fact is! — an ebook in and of itself. While outside the scope of
this guidebook, it’s important as a part of your journey into creating machine learning models
to understand what it means.

Operationalization is the process of converting data insights into actual large-scale business
and operational impact. This means bridging the huge gap between the exploratory work of
designing machine learning models and the industrial effort (not to mention precision) required
for deployment within actual production systems and processes. The process includes, but
is not limited to: testing, IT performance tuning, setting up a data monitoring strategy, and
monitoring operations. We highly recommend following up this guide with additional resources
to understand operationalization processes.

GUIDEBOOK - Dataiku Machine Learning Basics, Continued: Building Your First Machine Learning Model 39
Another closing thought is this: one of the biggest mistakes that people make with regard to
machine learning is thinking that once a model is built and goes live, it will continue working as
normal indefinitely. On the contrary, models will actually degrade in quality over time if they’re
not continuously improved and fed new data.

Machine learning modeling is about capturing patterns in a moving and complex world.
Working with live data, one must monitor the performance of the model over time and update
it accordingly.

Ironically, in order to successfully complete your first data project, you need to recognize that
your model will never be fully “complete." In order for it to remain useful and accurate, you need
to constantly reevaluate, retrain it, and develop new features. If there's anything you take away
from these fundamental steps in analytics and data science, it is that a data scientist’s job is
never really done, but that’s what makes working with data all the more fascinating!

Keep Exploring

Now that you’ve seen how a data science, machine learning, and AI platform can make
the job of building models much smoother and more accessible, why not give Dataiku a
try? Get started in less than two minutes with the 14-day free trial, or download Dataiku
Community edition (yours to keep forever!).

Download Now!

40 Machine Learning Basics, Continued: Building Your First Machine Learning Model GUIDEBOOK - Dataiku
Everyday AI,
Extraordinary People
Elastic Architecture Built for the Cloud

Machine Learning Visualization Data Preparation

Name Sex Age

Natural lang. Gender Integer

Braund, Mr. Owen Harris male 22


Moran, Mr. James male 38
Heikkinen, Miss. Laina
Remove rows containing Mr. female 26
Futrelle, Mrs. Jacques Heath female 35
Keep only rows containing Mr.
Allen, Mr. William Henry male 35
Split column on Mr.
McCarthy, Mr. Robert male
Replace
Hewlett, Mrs (Mary Mr. by ...
D Kingcome) 29

Remove rows equal to Moran, Mr. James

Keep only rows equal to Moran, Mr. James

Clear cells equal to Moran, Mr. James

Filter on Moran, Mr. James

Filter on Mr.

Toggle row highlight

Show complete value

DataOps Governance & MLOps Applications

450+ 45,000+
CUSTOMERS ACTIVE USERS

Dataiku is the world’s leading platform for Everyday AI, systemizing the use of data for
exceptional business results. Organizations that use Dataiku elevate their people (whether
technical and working in code or on the business side and low- or no-code) to extraordinary,
arming them with the ability to make better day-to-day decisions with data.

©2021 dataiku | dataiku.com


GUIDEBOOK
www.dataiku.com

You might also like