Skip to content

222 add deployment of example get started experiments model #233

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .devcontainer.json
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
{
"name": "example-repos-dev",
"image": "mcr.microsoft.com/devcontainers/python:3.10",
"runArgs": ["--ipc=host"],
"extensions": ["Iterative.dvc", "ms-python.python", "redhat.vscode-yaml"],
"features": {
"ghcr.io/devcontainers/features/nvidia-cuda:1": {
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
name: Deploy model

on:
push:
# When a new version is registered in Studio Model Registry
tags:
- "results/train=pool-segmentation#*"

workflow_dispatch:
inputs:
version:
description: 'Manual version name'
required: true
type: string

permissions:
contents: write
id-token: write

jobs:
parse:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
with:
fetch-depth: 0
- name: "Parse GTO tag"
id: gto
uses: iterative/gto-action@14723404a00bb0c1e759c02ffcd24279df5815c2
outputs:
event: ${{ steps.gto.outputs.event }}
name: ${{ steps.gto.outputs.name }}
stage: ${{ steps.gto.outputs.stage }}
version: ${{ steps.gto.outputs.version }}

deploy-model:
needs: parse
if: ${{ needs.parse.outputs.event }} == 'assignment'
environment: cloud
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
with:
fetch-depth: 0

- uses: aws-actions/configure-aws-credentials@v2
with:
aws-region: us-east-2
role-to-assume: ${{ secrets.AWS_ROLE_TO_ASSUME }}
role-duration-seconds: 43200

- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.8'
cache: 'pip'
cache-dependency-path: requirements.txt

- run: pip install -r requirements.txt

- run: dvc remote add -d --local storage s3://dvc-public/remote/get-started-pools

- run: |
MODEL_DATA=$(dvc get --show-url . model.tar.gz)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@daavoo Doesn't this get the model data for the latest commit instead of the specific version we are trying to deploy? From the DVC docs, when --rev is not specified, The latest commit (in the default branch) is used by default when this option is not specified.

Copy link
Contributor Author

@daavoo daavoo Aug 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume that the phrase in the docs is meant for the scenario where you pass a remote URL repo as the first argument.

I think the behavior here is correct, but I am going to double-check.

I think it works because we are using the local repo . and so the current status of the workspace will be used. The workflow uses actions/checkout and runs on the git tag creation so the workspace will be the revision where the tag was created

python sagemaker/deploy_model.py \
--name ${{ needs.parse.outputs.name }} \
--stage ${{ needs.parse.outputs.stage }} \
--version ${{ needs.parse.outputs.version }} \
--model_data $MODEL_DATA \
--role ${{ secrets.AWS_ROLE_TO_ASSUME }}
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,7 @@ jobs:
- uses: aws-actions/configure-aws-credentials@v1
with:
aws-region: us-east-2
role-to-assume: arn:aws:iam::342840881361:role/SandboxUser
role-to-assume: ${{ secrets.AWS_ROLE_TO_ASSUME }}
role-duration-seconds: 43200
- name: Create Runner
env:
Expand Down Expand Up @@ -90,7 +90,7 @@ jobs:
- uses: aws-actions/configure-aws-credentials@v1
with:
aws-region: us-east-2
role-to-assume: arn:aws:iam::342840881361:role/SandboxUser
role-to-assume: ${{ secrets.AWS_ROLE_TO_ASSUME }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have to generalize it in this way? (an extra step for us to take care of)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was suggested by @jesper7 and @tapadipti to use the role as a secret. I assume it has some security implications although I am not sure what is the danger of leaking a role name

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@daavoo How did you set it up for https://github.com/iterative/example-get-started-experiments? Should we add it to the readme here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that roles can be public.

role-duration-seconds: 43200

- run: pip install -r requirements.txt
Expand Down
15 changes: 5 additions & 10 deletions example-get-started-experiments/code/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -108,16 +108,11 @@ This tag also contains a GitHub Actions workflow that reruns the pipeline if any
[CML](https://cml.dev/) is used in this workflow to provision a cloud-based GPU
machine as well as report model performance results in Pull Requests.

## Deploying the model

Check out the [PR](https://github.com/iterative/example-get-started-experiments/pulls)
that adds this model to
[Iterative Studio Model Registry](https://dvc.org/doc/studio/user-guide/model-registry/what-is-a-model-registry).
You can [trigger CI/CD](https://dvc.org/doc/studio/user-guide/model-registry/use-models#deploying-and-publishing-models-in-cicd)
by [registering versions](https://dvc.org/doc/studio/user-guide/model-registry/register-version)
and [assigning stages](https://dvc.org/doc/studio/user-guide/model-registry/assign-stage)
in Model Registry, building and publishing Docker images with the model,
or deploying the model to the cloud.
## Model Deployment

Check out the [GitHub Workflow](https://github.com/iterative/example-get-started-experiments/blob/main/.github/workflows/deploy-model.yml)
that uses the [Iterative Studio Model Registry](https://dvc.org/doc/studio/user-guide/model-registry/what-is-a-model-registry).
to deploy the model to [AWS Sagemaker](https://aws.amazon.com/es/sagemaker/) whenever a new [version is registered](https://dvc.org/doc/studio/user-guide/model-registry/register-version).

## Project structure

Expand Down
3 changes: 2 additions & 1 deletion example-get-started-experiments/code/requirements.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
dvc[s3]>=3.0
dvclive>=2.11.3
fastai
python-box
python-box
sagemaker
50 changes: 50 additions & 0 deletions example-get-started-experiments/code/sagemaker/code/inference.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
"""
Reference:
https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html#id4
"""
import io
import os

import numpy as np
import torch
from PIL import Image
from torchvision.transforms import Compose, Normalize, Resize, ToTensor


def model_fn(model_dir, context):
kwargs = {
"f": os.path.join(model_dir, "code/model.pth")
}
if not torch.cuda.is_available():
kwargs["map_location"] = torch.device("cpu")
model = torch.load(**kwargs)
return model


def input_fn(request_body, request_content_type, context):
if request_content_type:
img_pil = Image.open(io.BytesIO(request_body))
img_transform = Compose([Resize(512), ToTensor(), Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])])
img_tensor = img_transform(img_pil).unsqueeze_(0)
return img_tensor
else:
raise ValueError(f"Unsupported request_content_type {request_content_type}")


def predict_fn(input_object, model, context):
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
with torch.no_grad():
result = model(input_object)
return result


def output_fn(prediction_output, content_type):
output = np.array(
prediction_output[:, 1, :] > 0.5, dtype=np.uint8
)
if torch.cuda.is_available():
output = output.cpu()
buffer = io.BytesIO()
np.save(buffer, output)
return buffer.getvalue()
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
fastai
pillow
torch
torchvision
78 changes: 78 additions & 0 deletions example-get-started-experiments/code/sagemaker/deploy_model.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
import logging
import re
import sys

from sagemaker.deserializers import JSONDeserializer
from sagemaker.pytorch import PyTorchModel
from sagemaker.serverless import ServerlessInferenceConfig


memory_size = {
"dev": 4096 ,
"staging": 4096,
"prod": 6144 ,
"default": 4096,
}
max_concurrency = {
"dev": 5,
"staging": 5,
"prod": 10,
"default": 5,
}


def deploy(
name: str,
stage: str,
version: str,
model_data: str,
role: str,
):
sagemaker_logger = logging.getLogger("sagemaker")
sagemaker_logger.setLevel(logging.DEBUG)
sagemaker_logger.addHandler(logging.StreamHandler(sys.stdout))

composed_name = re.sub(
r"[^a-zA-Z0-9\-]", "-", f"{name}-{version}-{stage}")
Comment on lines +35 to +36

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know what possible characters can appear in f"{name}-{version}-{stage}" that would need to be replaced here? I guess it's related to iterative/dvc#9821?

Copy link
Contributor Author

@daavoo daavoo Aug 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it's related to iterative/dvc#9821?

This regex is only relevant for Sagemaker and it is enforced by their API .
From top of my head, the composed name includes / , : from name and . from version

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the rationale for the naming restrictions in gto were to make it more likely they would be compatible with systems like sagemaker. Not sure it really helps though. We still hit these restrictions and users don't know why they can't use these characters.


model = PyTorchModel(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@daavoo If I created a SageMaker model for a given model version before deploying it in one environment (eg, dev), does this code recreate the same model before deploying it in another environment (eg, prod). Or does it skip recreating it and just return a reference to the old model?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It currently creates a new model. If we want to do what you said last, we should only add stage to the composed name after the model has been created.

Copy link
Contributor

@tapadipti tapadipti Aug 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Referring to this msg - you said that the existing model would be re-used, right? Would removing stage from the model name make this happen?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would removing stage from the model name make this happen?

Yes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I'll create a PR to remove stage from the model name.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR

name=composed_name,
model_data=model_data,
framework_version="1.12",
py_version="py38",
role=role,
env={
"SAGEMAKER_MODEL_SERVER_TIMEOUT": "3600",
"TS_MAX_RESPONSE_SIZE": "2000000000",
"TS_MAX_REQUEST_SIZE": "2000000000",
"MMS_MAX_RESPONSE_SIZE": "2000000000",
"MMS_MAX_REQUEST_SIZE": "2000000000",
},
)


return model.deploy(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@daavoo If I am trying to deploy a new model version to an existing stage, does this code create a new endpoint or does it update the existing endpoint? (I need to confirm this for the blog post)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@daavoo If I am trying to deploy a new model version to an existing stage, does this code create a new endpoint or does it update the existing endpoint? (I need to confirm this for the blog post)

Not sure I understand the first part but a new endpoint is created from combining the name, version, and stage, so any change to those 3 things result in a new endpoint

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if I deploy v1 to prod today, and v2 to prod tomorrow, will I have 2 different endpoints with 2 different endpoint names? Shouldn't the prod endpoint always be the same? (just like studio prod version is always studio.iterative.ai) So that any clients running inference against the prod endpoint don't have to update their endpoint addresses each time there's a new deployment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The version shouldn't be part of the endpoint name, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the prod endpoint always be the same?

I think it is a matter of opinion and depends on the pattern we expect/want to showcase:

A) The endpoint is directly queried by external consumers
B) There is some app&url that uses the right endpoint internally without exposing it to external consumers.

(just like studio prod version is always studio.iterative.ai) So that any clients running inference against the prod endpoint don't have to update their endpoint addresses each time there's a new deployment.

That is the user-facing app&url and it only "points" to the correct software because there are gitops bumping the internal studio version (i.e. https://github.com/iterative/itops/pull/2238)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So then we should also create a service to delete stale endpoints, right? Else, we will end up with a bunch of endpoints over time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So then we should also create a service to delete stale endpoints, right? Else, we will end up with a bunch of endpoints over time.

I would expect that to be configured / handled on the AWS side.

Anyhow, as per your suggestions, I will send a new P.R. creating/updating a single endpoint per stage

initial_instance_count=1,
deserializer=JSONDeserializer(),
endpoint_name=composed_name,
serverless_inference_config=ServerlessInferenceConfig(
memory_size_in_mb=memory_size[stage],
max_concurrency=max_concurrency[stage]
)
)


if __name__ == "__main__":
import argparse

parser = argparse.ArgumentParser(description="Deploy a model to Amazon SageMaker")

parser.add_argument("--name", type=str, required=True, help="Name of the model")
parser.add_argument("--stage", type=str, required=True, help="Stage of the model")
parser.add_argument("--version", type=str, required=True, help="Version of the model")
parser.add_argument("--model_data", type=str, required=True, help="S3 location of the model data")
parser.add_argument("--role", type=str, required=True, help="ARN of the IAM role to use")

args = parser.parse_args()

deploy(name=args.name, stage=args.stage, version=args.version, model_data=args.model_data, role=args.role)
56 changes: 56 additions & 0 deletions example-get-started-experiments/code/src/endpoint_prediction.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
from io import BytesIO
from pathlib import Path

import dvc.api
import numpy as np
from PIL import Image
from sagemaker.deserializers import NumpyDeserializer
from sagemaker.pytorch import PyTorchPredictor
from sagemaker.serializers import IdentitySerializer


def paint_mask(mask, color_map={0: (0, 0, 0), 1: (0, 0, 255)}):
vis_shape = mask.shape + (3,)
vis = np.zeros(vis_shape)
for i, c in color_map.items():
vis[mask == i] = color_map[i]
return Image.fromarray(vis.astype(np.uint8))


def endpoint_prediction(
img_path: str,
endpoint_name: str,
output_path: str = "predictions",
):
params = dvc.api.params_show()
img_size = params["train"]["img_size"]
predictor = PyTorchPredictor(endpoint_name, serializer=IdentitySerializer(), deserializer=NumpyDeserializer())
name = endpoint_name

output_file = Path(output_path) / name / Path(img_path).name
output_file.parent.mkdir(exist_ok=True, parents=True)

io = BytesIO()
Image.open(img_path).resize((img_size, img_size)).save(io, format="PNG")
result = predictor.predict(io.getvalue())[0]

img_pil = Image.open(img_path)
overlay_img_pil = Image.blend(
img_pil.convert("RGBA"),
paint_mask(result).convert("RGBA").resize(img_pil.size),
0.5
)
overlay_img_pil.save(str(output_file.with_suffix(".png")))


if __name__ == "__main__":
import argparse

parser = argparse.ArgumentParser(description='Run inference on an image using a SageMaker endpoint')
parser.add_argument('--img_path', type=str, help='path to the input image')
parser.add_argument('--endpoint_name', type=str, help='name of the SageMaker endpoint to use')
parser.add_argument('--output_path', type=str, default='predictions', help='path to save the output predictions')

args = parser.parse_args()

endpoint_prediction(args.img_path, args.endpoint_name, args.output_path)
1 change: 1 addition & 0 deletions example-get-started-experiments/code/src/train.py
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,7 @@ def train():
models_dir = Path("models")
models_dir.mkdir(exist_ok=True)
learn.export(fname=(models_dir / "model.pkl").absolute())

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still needed/useful? Seems odd to have export and save here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have to change the evaluate stage to use the .pth format.
I was not able to make sagemaker work with the fast.ai exported in .pkl format.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have to change the evaluate stage to use the .pth format.

Have you tried it? I would say let's do it if it's not that hard.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tried it? I would say let's do it if it's not that hard.

I assume it is possible but I didn't try to not include other changes

torch.save(learn.model, (models_dir / "model.pth").absolute())
live.log_artifact(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@daavoo If we logged the tar file of the model, then we wouldn't need to specify the file name model.tar.gz in the deployment script, right? The deployment script would be able to get the file name from the Git tag itself, which means that a single deployment script could be used for deploying several models with different names. Right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now, in the blog post, I show that there's an artifact entry for model.pkl (so the model registry shows model.pkl). But the file that is getting deployed is model.tar.gz. We know that they are the same model, but it still looks like a disconnect between the model registry and the deployment script.

str(models_dir / "model.pkl"),
type="model",
Expand Down
14 changes: 8 additions & 6 deletions example-get-started-experiments/generate.sh
Original file line number Diff line number Diff line change
Expand Up @@ -90,11 +90,10 @@ git add .
tick
git commit -m "Run notebook and apply best experiment"
git tag -a "1-notebook-dvclive" -m "Experiment using Notebook"
gto register results/train:pool-segmentation --version v1.0.0
gto assign results/train:pool-segmentation --version v1.0.0 --stage dev


cp -r $HERE/code/src .
cp -r $HERE/code/sagemaker .
cp $HERE/code/params.yaml .
sed -e "s/base_lr: 0.01/base_lr: $BEST_EXP_BASE_LR/" -i".bkp" params.yaml
rm params.yaml.bkp
Expand All @@ -109,14 +108,18 @@ dvc remove models/model.pkl.dvc
dvc stage add -n train \
-p base,train \
-d src/train.py -d data/train_data \
-o models/model.pkl \
-o models/model.pkl -o models/model.pth \

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to the comment above, do we still need model/model.pkl?

python src/train.py

dvc stage add -n evaluate \
-p base,evaluate \
-d src/evaluate.py -d models/model.pkl -d data/test_data \
python src/evaluate.py

dvc stage add -n sagemaker \
-d models/model.pth -o model.tar.gz \
'cp models/model.pth sagemaker/code/model.pth && cd sagemaker && tar -cpzf model.tar.gz code/ && cd .. && mv sagemaker/model.tar.gz . && rm sagemaker/code/model.pth'

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor thoughts to try to clean this up a bit:

I haven't tested this, but wondering if we can simplify this at all with something like this?

Suggested change
'cp models/model.pth sagemaker/code/model.pth && cd sagemaker && tar -cpzf model.tar.gz code/ && cd .. && mv sagemaker/model.tar.gz . && rm sagemaker/code/model.pth'
'cp models/model.pth sagemaker/out && cp sagemaker/code sagemaker/out && tar -cpzf model.tar.gz sagemaker/out'

Also, I wonder if it would be better to append directly to dvc.yaml and use list syntax for cmd instead of &&?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I wonder if it would be better to append directly to dvc.yaml and use list syntax for cmd instead of &&?

I didn't manage to use list cmd with stage add (which we use in the generate.sh script , but might be missing something


git add .
tick
git commit -m "Convert Notebook to dvc.yaml pipeline"
Expand All @@ -127,9 +130,8 @@ git add .
tick
git commit -m "Run dvc.yaml pipeline"
git tag -a "2-dvc-pipeline" -m "Experiment using dvc pipeline"
gto register results/train:pool-segmentation --version v1.0.1
gto assign results/train:pool-segmentation --version v1.0.0 --stage prod
gto assign results/train:pool-segmentation --version v1.0.1 --stage dev
gto register results/train:pool-segmentation --version v0.1.0
gto assign results/train:pool-segmentation --version v0.1.0 --stage dev

export GIT_AUTHOR_NAME="David de la Iglesia"
export GIT_AUTHOR_EMAIL="daviddelaiglesiacastro@gmail.com"
Expand Down