Build Reliable Machine Learning Pipelines With Continuous Integration

Download as pdf or txt
Download as pdf or txt
You are on page 1of 22

Build Reliable Machine Learning

Pipelines with Continuous Integration

As a data scientist, you are responsible for improving the model currently in
production. After spending months fine-tuning the model, you discover one
with greater accuracy than the original.

Excited by your breakthrough, you create a pull request to merge your

model into the main branch.
Unfortunately, because of the numerous changes, your team takes over a
week to evaluate and analyze them, which ultimately impedes project

Furthermore, after deploying the model, you identify unexpected behaviors

resulting from code errors, causing the company to lose money.

In retrospect, automating code and model testing after submitting a pull

request would have prevented these problems and saved both time and

Continuous Integration (CI) offers an easy solution for this issue.

What is CI?
CI is the practice of continuously merging and testing code changes into a
shared repository. In a machine learning project, CI can be very useful for
several reasons:

Catching errors early: CI facilitates the early identification of errors by

automatically testing any code changes made, enabling timely problem
detection during the development phase
Ensuring reproducibility: CI helps ensure reproducibility by establishing
clear and consistent testing procedures, making it easier to replicate
machine learning project results.

Faster feedback and decision-making: By providing clear metrics and

parameters, CI enables faster feedback and decision-making, freeing up
reviewer time for more critical tasks.

This article will show you how to create a CI pipeline for a machine-learning

Feel free to play and fork the source code of this article here.
CI Pipeline Overview
The approach to building a CI pipeline for a machine-learning project can
vary depending on the workflow of each company. In this project, we will
create one of the most common workflows to build a CI pipeline:

1. Data scientists make changes to the code, creating a new model locally.

2. Data scientists push the new model to remote storage.

3. Data scientists create a pull request for the changes.

4. A CI pipeline is triggered to test the code and model.

5. If changes are approved, they are merged into the main branch.

Let’s illustrate an example based on this workflow.

Build the Workflow
Suppose experiment C performs exceptionally well after trying out various
processing techniques and ML models. As a result, we aim to merge the
code and model into the main branch.

To accomplish this, we need to perform the following steps:

1. Version the inputs and outputs of the experiment.

2. Upload the model and data to remote storage.

3. Create test files to test the code and model.

4. Create a GitHub workflow.

Now, let’s explore each of these steps in detail.
Version inputs and outputs of an experiment
We will use the DVC to version inputs and outputs of an experiment of a
pipeline, including code, data, and model.

The pipeline is defined based on the file locations in the project:

We will describe the stages of the pipeline and the data dependencies
between them in the dvc.yaml file:

  cmd: python src/
    - data/raw
    - src/
    - data/intermediate
  cmd: python src/
    - data/intermediate
    - src/
    - model/svm.pkl
  cmd: python src/
    - model
    - data/intermediate
    - src/
    - dvclive/metrics.json

To run an experiment pipeline defined in dvc.yaml , type the following

command on your terminal:

dvc exp run

We will get the following output:

'data/raw.dvc' didn't change, skipping

Running stage 'process':
> python src/

Running stage 'train':

> python src/
Updating lock file 'dvc.lock'

Running stage 'evaluate':

> python src/
The model's accuracy is 0.65
Updating lock file 'dvc.lock'

Ran experiment(s): drear-cusp

Experiment results have been applied to your workspace.
The run will automatically generate the dvc.lock file that stores the exact
versions of the data, code, and dependencies between them. Using the same
versions of the inputs and outputs makes sure that the same experiment can
be reproduced in the future.

schema: '2.0'
  cmd: python src/
  - path: data/raw
    md5: 84a0e37242f885ea418b9953761d35de.dir
    size: 84199
    nfiles: 2
  - path: src/
    md5: 8c10093c63780b397c4b5ebed46c1154
    size: 1157
        raw: data/raw/winequality-red.csv
        intermediate: data/intermediate
        feature: quality
        test_size: 0.2
  - path: data/intermediate
    md5: 3377ebd11434a04b64fe3ca5cb3cc455.dir
    size: 194875
    nfiles: 4
Upload data and model to a remote storage
DVC makes it easy to upload data files and models produced by the
pipeline stages in the dvc.yaml file to a remote storage location.

Before uploading our files, we will specify the remote storage location in the
file .dvc/config :

['remote "read"']
   url =
['remote "read-write"']
   url = s3://winequality-red/

Make sure to replace the URI of your S3 bucket with the “read-write”
remote storage URI.

Push files to the remote storage location named “read-write”:

dvc push -r read-write

Create tests
We will also generate tests that verify the performance of the code
responsible for processing data, training the model, and the model itself,
ensuring that the code and model meet our expectations.

View all test files here.

Create a GitHub workflow
Now it comes to the exciting part: Creating a GitHub workflow to automate
the testing of your data and model! If you are not familiar with GitHub
workflow, I recommend reading this article for a quick overview.

We will create the workflow called Test code and model in the file
.github/workflows/run_test.yaml :

name: Test code and model

    - src/**
    - tests/**
    - params.yaml
  name: Test processed code and model
  runs-on: ubuntu-latest
    - name: Checkout
      uses: actions/checkout@v2

    - name: Environment setup

      uses: actions/setup-python@v2
    - name: Install dependencies
      run: pip install -r requirements.txt
    - name: Pull data and model
        AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID
      run: dvc pull -r read-write

    - name: Run tests

      run: pytest

    - name: Evaluate model

      run: dvc exp run evaluate
    - name: Iterative CML setup
      uses: iterative/setup-cml@v1
    - name: Create CML report
        REPO_TOKEN: ${{ secrets.TOKEN_GITHUB }}
      run: |
        # Add the metrics to the report
        dvc metrics show --show-md >>
        # Add the parameters to the report
        cat dvclive/params.yaml >>
        # Create a report in PR
        cml comment create

The on field specifies that the pipeline is triggered on a pull request event.

The test_model job includes the following steps:

Checking out the code

Setting up the Python environment

Installing dependencies

Pulling data and models from a remote storage location using DVC
Running tests using pytest

Evaluating the model using DVC experiments

Setting up the Iterative CML (Continuous Machine Learning)

Creating a report with metrics and parameters, and commenting on the

pull request with the report using CML.
Note that for the job to function properly, it requires the following:

AWS credentials to pull the data and model

GitHub token to comment on the pull request.

To ensure the secure storage of sensitive information in our repository and

enable GitHub Actions to access them, we will use encrypted secrets.

That’s it! Now let’s try out this project and see if it works as we expected.
Try it Out
To try out this project, start with creating a new repository using the project

Clone the repository to your local machine:

git clone

Set up the environment:

# Go to the project directory

cd cicd-mlops-demo

# Create a new branch

git checkout -b experiment

# Install dependencies
pip install -r requirements.txt
Pull data from the remote storage location called “read”:

dvc pull -r read

Create experiments
The GitHub workflow will be triggered if any changes are made to the
params.yaml file or files in the src and tests directories. To illustrate this,
we will make some minor changes to the params.yaml file:

Next, let’s create a new experiment with the change:

dvc exp run

Push the modified data and model to remote storage called “read-write”:

dvc push -r read-write

Add, commit, and push changes to the repository:

git add .
git commit -m 'add 100 for C'
git push origin experiment
Create a pull request
Next, create a pull request by clicking the Contribute button.

After creating a pull request in the repository, a GitHub workflow will be

triggered to run tests on the code and model.
If all the tests pass, a comment will be added to the pull request, containing
the metrics and parameters of the new experiment.

This information makes it easier for reviews to understand the changes

made to the code and model. As a result, they can quickly evaluate whether
the changes meet the expected performance criteria and decide whether to
approve the PR for merging into the main branch. How cool is that?

You might also like