Build Reliable Machine Learning Pipelines With Continuous Integration

Build Reliable Machine Learning
Pipelines with Continuous Integration
Scenario
As a data scientist, you are responsible for improving the model currently in
production. After spending months fine-tuning the model, you discover one
with greater accuracy than the original.
Excited by your breakthrough, you create a pull request to merge your

model into the main branch.
Unfortunately, because of the numerous changes, your team takes over a
week to evaluate and analyze them, which ultimately impedes project
progress.
Furthermore, after deploying the model, you identify unexpected behaviors

resulting from code errors, causing the company to lose money.
In retrospect, automating code and model testing after submitting a pull

request would have prevented these problems and saved both time and
money.
Continuous Integration (CI) offers an easy solution for this issue.

What is CI?
CI is the practice of continuously merging and testing code changes into a
shared repository. In a machine learning project, CI can be very useful for
several reasons:
Catching errors early: CI facilitates the early identification of errors by

automatically testing any code changes made, enabling timely problem
detection during the development phase
Ensuring reproducibility: CI helps ensure reproducibility by establishing
clear and consistent testing procedures, making it easier to replicate
machine learning project results.
Faster feedback and decision-making: By providing clear metrics and

parameters, CI enables faster feedback and decision-making, freeing up
reviewer time for more critical tasks.
This article will show you how to create a CI pipeline for a machine-learning
project.
Feel free to play and fork the source code of this article here.
CI Pipeline Overview
The approach to building a CI pipeline for a machine-learning project can
vary depending on the workflow of each company. In this project, we will
create one of the most common workflows to build a CI pipeline:
1. Data scientists make changes to the code, creating a new model locally.
2. Data scientists push the new model to remote storage.
3. Data scientists create a pull request for the changes.

4. A CI pipeline is triggered to test the code and model.
5. If changes are approved, they are merged into the main branch.
Let’s illustrate an example based on this workflow.

Build the Workflow
Suppose experiment C performs exceptionally well after trying out various
processing techniques and ML models. As a result, we aim to merge the
code and model into the main branch.
To accomplish this, we need to perform the following steps:
1. Version the inputs and outputs of the experiment.
2. Upload the model and data to remote storage.

3. Create test files to test the code and model.
4. Create a GitHub workflow.

Now, let’s explore each of these steps in detail.
Version inputs and outputs of an experiment
We will use the DVC to version inputs and outputs of an experiment of a
pipeline, including code, data, and model.
The pipeline is defined based on the file locations in the project:

We will describe the stages of the pipeline and the data dependencies
between them in the dvc.yaml file:
stages:
process:
cmd: python src/process_data.py
deps:
- data/raw
- src/process_data.py
outs:
- data/intermediate
train:
cmd: python src/train.py
deps:
- data/intermediate
- src/train.py
outs:
- model/svm.pkl
evaluate:
cmd: python src/evaluate.py
deps:
- model
- data/intermediate
- src/evaluate.py
metrics:
- dvclive/metrics.json
To run an experiment pipeline defined in dvc.yaml , type the following

command on your terminal:
dvc exp run

We will get the following output:
'data/raw.dvc' didn't change, skipping

Running stage 'process':
> python src/process_data.py
Running stage 'train':

> python src/train.py
Updating lock file 'dvc.lock'
Running stage 'evaluate':

> python src/evaluate.py
The model's accuracy is 0.65
Updating lock file 'dvc.lock'
Ran experiment(s): drear-cusp

Experiment results have been applied to your workspace.
The run will automatically generate the dvc.lock file that stores the exact
versions of the data, code, and dependencies between them. Using the same
versions of the inputs and outputs makes sure that the same experiment can
be reproduced in the future.
schema: '2.0'
stages:
process:
cmd: python src/process_data.py
deps:
- path: data/raw
md5: 84a0e37242f885ea418b9953761d35de.dir
size: 84199
nfiles: 2
- path: src/process_data.py
md5: 8c10093c63780b397c4b5ebed46c1154
size: 1157
params:
params.yaml:
data:
raw: data/raw/winequality-red.csv
intermediate: data/intermediate
process:
feature: quality
test_size: 0.2
outs:
- path: data/intermediate
md5: 3377ebd11434a04b64fe3ca5cb3cc455.dir
size: 194875
nfiles: 4
Upload data and model to a remote storage
DVC makes it easy to upload data files and models produced by the
pipeline stages in the dvc.yaml file to a remote storage location.
Before uploading our files, we will specify the remote storage location in the
file .dvc/config :
['remote "read"']
url = https://winequality-red.s3.amazonaws.com/
['remote "read-write"']
url = s3://winequality-red/
Make sure to replace the URI of your S3 bucket with the “read-write”
remote storage URI.
Push files to the remote storage location named “read-write”:
dvc push -r read-write

Create tests
We will also generate tests that verify the performance of the code
responsible for processing data, training the model, and the model itself,
ensuring that the code and model meet our expectations.
View all test files here.

Create a GitHub workflow
Now it comes to the exciting part: Creating a GitHub workflow to automate
the testing of your data and model! If you are not familiar with GitHub
workflow, I recommend reading this article for a quick overview.
We will create the workflow called Test code and model in the file
.github/workflows/run_test.yaml :
name: Test code and model

on:
pull_request:
paths:
- src/**
- tests/**
- params.yaml

jobs:
test_model:
name: Test processed code and model
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v2
- name: Environment setup

uses: actions/setup-python@v2

- name: Install dependencies
run: pip install -r requirements.txt

- name: Pull data and model
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID
}}
AWS_SECRET_ACCESS_KEY: ${{
secrets.AWS_SECRET_ACCESS_KEY }}
run: dvc pull -r read-write
- name: Run tests

run: pytest
- name: Evaluate model

run: dvc exp run evaluate
- name: Iterative CML setup
uses: iterative/setup-cml@v1

- name: Create CML report
env:
REPO_TOKEN: ${{ secrets.TOKEN_GITHUB }}
run: |
# Add the metrics to the report
dvc metrics show --show-md >> report.md
# Add the parameters to the report
cat dvclive/params.yaml >> report.md
# Create a report in PR
cml comment create report.md
The on field specifies that the pipeline is triggered on a pull request event.
The test_model job includes the following steps:
Checking out the code
Setting up the Python environment

Installing dependencies
Pulling data and models from a remote storage location using DVC
Running tests using pytest
Evaluating the model using DVC experiments

Setting up the Iterative CML (Continuous Machine Learning)
environment
Creating a report with metrics and parameters, and commenting on the

pull request with the report using CML.
Note that for the job to function properly, it requires the following:
AWS credentials to pull the data and model
GitHub token to comment on the pull request.
To ensure the secure storage of sensitive information in our repository and

enable GitHub Actions to access them, we will use encrypted secrets.
That’s it! Now let’s try out this project and see if it works as we expected.
Try it Out
Setup
To try out this project, start with creating a new repository using the project
template.
Clone the repository to your local machine:
git clone https://github.com/your-username/cicd-mlops-demo
Set up the environment:
# Go to the project directory

cd cicd-mlops-demo
# Create a new branch

git checkout -b experiment
# Install dependencies
pip install -r requirements.txt
Pull data from the remote storage location called “read”:
dvc pull -r read

Create experiments
The GitHub workflow will be triggered if any changes are made to the
params.yaml file or files in the src and tests directories. To illustrate this,
we will make some minor changes to the params.yaml file:
Next, let’s create a new experiment with the change:
dvc exp run

Push the modified data and model to remote storage called “read-write”:
dvc push -r read-write
Add, commit, and push changes to the repository:
git add .
git commit -m 'add 100 for C'
git push origin experiment
Create a pull request
Next, create a pull request by clicking the Contribute button.
After creating a pull request in the repository, a GitHub workflow will be

triggered to run tests on the code and model.
If all the tests pass, a comment will be added to the pull request, containing
the metrics and parameters of the new experiment.
This information makes it easier for reviews to understand the changes

made to the code and model. As a result, they can quickly evaluate whether
the changes meet the expected performance criteria and decide whether to
approve the PR for merging into the main branch. How cool is that?

Build Reliable Machine Learning Pipelines With Continuous Integration

Uploaded by

Copyright:

Available Formats

Build Reliable Machine Learning Pipelines With Continuous Integration

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Build Reliable Machine Learning Pipelines With Continuous Integration

Uploaded by

Copyright:

Available Formats

Build Reliable Machine Learning

Pipelines with Continuous Integration

Excited by your breakthrough, you create a pull request to merge your

Furthermore, after deploying the model, you identify unexpected behaviors

In retrospect, automating code and model testing after submitting a pull

Continuous Integration (CI) offers an easy solution for this issue.

Catching errors early: CI facilitates the early identification of errors by

Faster feedback and decision-making: By providing clear metrics and

2. Data scientists push the new model to remote storage.

3. Data scientists create a pull request for the changes.

Let’s illustrate an example based on this workflow.

To accomplish this, we need to perform the following steps:

1. Version the inputs and outputs of the experiment.

2. Upload the model and data to remote storage.

4. Create a GitHub workflow.

The pipeline is defined based on the file locations in the project:

To run an experiment pipeline defined in dvc.yaml , type the following

dvc exp run

'data/raw.dvc' didn't change, skipping

Running stage 'train':

Running stage 'evaluate':

Ran experiment(s): drear-cusp

Push files to the remote storage location named “read-write”:

dvc push -r read-write

View all test files here.

name: Test code and model

- name: Environment setup

- name: Run tests

- name: Evaluate model

The test_model job includes the following steps:

Checking out the code

Setting up the Python environment

Evaluating the model using DVC experiments

Creating a report with metrics and parameters, and commenting on the

AWS credentials to pull the data and model

GitHub token to comment on the pull request.

To ensure the secure storage of sensitive information in our repository and

Clone the repository to your local machine:

git clone https://github.com/your-username/cicd-mlops-demo

Set up the environment:

# Go to the project directory

# Create a new branch

dvc pull -r read

Next, let’s create a new experiment with the change:

dvc exp run

dvc push -r read-write

Add, commit, and push changes to the repository:

After creating a pull request in the repository, a GitHub workflow will be

This information makes it easier for reviews to understand the changes

You might also like