Build Reliable Machine Learning Pipelines With Continuous Integration
Build Reliable Machine Learning Pipelines With Continuous Integration
Build Reliable Machine Learning Pipelines With Continuous Integration
Scenario
As a data scientist, you are responsible for improving the model currently in
production. After spending months fine-tuning the model, you discover one
with greater accuracy than the original.
This article will show you how to create a CI pipeline for a machine-learning
project.
Feel free to play and fork the source code of this article here.
CI Pipeline Overview
The approach to building a CI pipeline for a machine-learning project can
vary depending on the workflow of each company. In this project, we will
create one of the most common workflows to build a CI pipeline:
1. Data scientists make changes to the code, creating a new model locally.
5. If changes are approved, they are merged into the main branch.
stages:
process:
cmd: python src/process_data.py
deps:
- data/raw
- src/process_data.py
outs:
- data/intermediate
train:
cmd: python src/train.py
deps:
- data/intermediate
- src/train.py
outs:
- model/svm.pkl
evaluate:
cmd: python src/evaluate.py
deps:
- model
- data/intermediate
- src/evaluate.py
metrics:
- dvclive/metrics.json
schema: '2.0'
stages:
process:
cmd: python src/process_data.py
deps:
- path: data/raw
md5: 84a0e37242f885ea418b9953761d35de.dir
size: 84199
nfiles: 2
- path: src/process_data.py
md5: 8c10093c63780b397c4b5ebed46c1154
size: 1157
params:
params.yaml:
data:
raw: data/raw/winequality-red.csv
intermediate: data/intermediate
process:
feature: quality
test_size: 0.2
outs:
- path: data/intermediate
md5: 3377ebd11434a04b64fe3ca5cb3cc455.dir
size: 194875
nfiles: 4
Upload data and model to a remote storage
DVC makes it easy to upload data files and models produced by the
pipeline stages in the dvc.yaml file to a remote storage location.
Before uploading our files, we will specify the remote storage location in the
file .dvc/config :
['remote "read"']
url = https://winequality-red.s3.amazonaws.com/
['remote "read-write"']
url = s3://winequality-red/
Make sure to replace the URI of your S3 bucket with the “read-write”
remote storage URI.
We will create the workflow called Test code and model in the file
.github/workflows/run_test.yaml :
The on field specifies that the pipeline is triggered on a pull request event.
Pulling data and models from a remote storage location using DVC
Running tests using pytest
That’s it! Now let’s try out this project and see if it works as we expected.
Try it Out
Setup
To try out this project, start with creating a new repository using the project
template.
# Install dependencies
pip install -r requirements.txt
Pull data from the remote storage location called “read”:
git add .
git commit -m 'add 100 for C'
git push origin experiment
Create a pull request
Next, create a pull request by clicking the Contribute button.