-
Notifications
You must be signed in to change notification settings - Fork 13
Update to YOLO. Add Sagemaker deployment #229
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update to YOLO. Add Sagemaker deployment #229
Conversation
@daavoo could you point me please to some details / summary or update the description please - e.g. why YOLO (do we want / do we use the callback there?), what else has changed any why, etc. That make it easier for me to review. Thanks! |
from ultralytics import YOLO | ||
|
||
|
||
def add_callbacks(live, yolo): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shcheklein I am not using the built-in callback because:
-
I couldn't find a way to customize arguments (i.e. pass a custom dir, disable report)
It seems the other loggers in ultralytics rely on env vars. -
Most of the image plots didn't feel useful / were redundant with existing plots.
I was planning to send a patch to ultralytics repo to skip some of the images (i.e. the confusion matrix, a big image with all the linear plots, etc)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code below looks complicated for the example repo . It should not belong there I think.
Most of the image plots didn't feel useful / were redundant with existing plots.
I was planning to send a patch to ultralytics repo to skip some of the images (i.e. the confusion matrix, a big image with all the linear plots, etc)
I would let users decide on that in the UX / UI (that should handle that case)
I couldn't find a way to customize arguments (i.e. pass a custom dir, disable report)
It seems the other loggers in ultralytics rely on env vars.
If we go with YOLO I think we don't have other options but figure this out - we can't maintain that level of duplication + code below is not simple at all to my taste (fine for an internal callback, but not for the example repo, especially not if we need to put it into notebooks)
- run: dvc pull results/train/artifacts/best.pt.dvc | ||
|
||
- env: | ||
VERSION: ${{steps.clean_version_name.outputs.version}} | ||
run: | | ||
bash sagemaker/bundle_and_upload_model.sh \ | ||
results/train/artifacts/best.pt \ | ||
s3://pool-segmentation/$VERSION/model.tar.gz |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is option A
of How to make the bundled model accessible to Sagemaker
.
Explained here : https://www.notion.so/iterative/Sagemaker-Deployment-fce94fa0d81c44f395c10fc32940ab82#735578764d964f9d9c6d7ab9d8c5fa39
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we avoid moving and rebundling the model?
VERSION: ${{steps.clean_version_name.outputs.version}} | ||
run: | | ||
python sagemaker/deploy_model.py \ | ||
--name $VERSION \ | ||
--model_data s3://pool-segmentation/$VERSION/model.tar.gz \ | ||
--role ${{ secrets.AWS_ROLE_TO_ASSUME }} \ | ||
--instance_type 'ml.c4.xlarge' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is option A
of How to create endpoints based on MR events
.
Explained here : https://www.notion.so/iterative/Sagemaker-Deployment-fce94fa0d81c44f395c10fc32940ab82#f3773819ebce4659ad397ed92686d56c
on: | ||
push: | ||
# When a new version is registered in Studio Model Registry | ||
tags: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did not use gto action because it did simplify / add much value for the deployment pattern used
@@ -1,6 +1,7 @@ | |||
{ | |||
"name": "example-repos-dev", | |||
"image": "mcr.microsoft.com/devcontainers/python:3.10", | |||
"runArgs": ["--ipc=host"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could remind please, why it was needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without it, the PyTorch dataloaders ran out of memory
@@ -0,0 +1,46 @@ | |||
import logging |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just standard code suggested by sagemaker docs :
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file could be removed if we create a version of the dataset in YOLO format instead of the current format with png masks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a simple script to showcase how to query the endpoint
thanks for clarifying this. does it make it easier to deploy? I wonder if those changes are related at all to each other? could have we done them one my one? (easier to review, etc) |
I found a working example of YOLO + Sagemaker that I could adapt but I did not find that for fast.ai
I can split into 2 parts. Yolo and then sagemaker |
make sense. Nw, let's keep it as is. |
- name: Set up Python | ||
uses: actions/setup-python@v4 | ||
with: | ||
python-version: '3.8' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doesn' work with the newer one?
- uses: aws-actions/configure-aws-credentials@v2 | ||
with: | ||
aws-region: us-east-1 | ||
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we use openid please?
@@ -11,8 +8,6 @@ This is an auto-generated repository for use in [DVC](https://dvc.org) | |||
This is a Computer Vision (CV) project that solves the problem of segmenting out | |||
swimming pools from satellite images. | |||
|
|||
[Example results](./results/evaluate/plots/images/) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we still try to include some examples?
@@ -1,3 +1,2 @@ | |||
/pool_data | |||
/test_data | |||
/train_data | |||
/yolo_dataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it need to be prefixed with yolo_
- is that dataset yolo specific?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is not needed. just to give a name that indicates the format
example-get-started-experiments/code/sagemaker/bundle_and_upload_model.sh
Show resolved
Hide resolved
fire |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does fire and shapely do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fire is to remove the boilerplate of adding argparse. takes function signature and makes it work from CLI arguments
shapely needed to convert the png masks to yolo format, which expect polygons in a .txt
output_path: str = "predictions", | ||
): | ||
|
||
if endpoint_name is not None and model_path is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need two inferences (two python scripts) ... both of them something related to endpoints
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(sagemaker/code/inference.py
) is bundled alongside the model file and uploaded to Sagemaker, it can't be used from a user perspective.
It includes methods following conventions required by Sagemaker.
pip install -r requirements.txt --extra-index-url https://download.pytorch.org/whl/cu118 | ||
pip install jupyter | ||
yolo settings datasets_dir=/workspaces/example-repos-dev/example-get-started-experiments/build/example-get-started-experiments/data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so, how will it look like in docs - we'll we have to explain all this in the get started?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, this is only needed during generation inside codespaces because YOLO resolves the paths wrong inside generate.sh
.
For someone cloning the repo it works without running the command
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be it's worth splitting this indeed to unblock it faster. I like that we are trying yolo, it make it more useful for actual customers and scenarios, but it seems we need to do some homework for that to look decent / simple to use and something that can fit into get started docs for the experiments. I would love to hear @dberenbaum opinion as well. May be we should do a leap of faith and just push though it and learn from it and keep doing improvements, etc. Let's decide on this. Duplication and complexity (e.g. custom logger) bothers me in this case.
On the deployment:
- I think we need an app that we could drop an image, so that we can demo it
- Do we keep instances / deployment running 24/7 for the demo purposes? What will be the workflow there? Can it be deployed in a serverless / on-demand mode?
- Good questions on the lifecycle - which endpoint we keep alive, etc. Don't know the answer yet to this.
- What is the lifecycle for code + models - can I change code and deploy with an older version of a model - how is it determined?
Moved the YOLO part to #232 . |
Migrate to YOLO
Make the decision because:
Not only in general, but in our particular case more than 1 customer/prospect has explicitly reported using YOLO but (afaik) we did not hear that from fast.ai.
If we remove the custom callback, it is
YOLO.train
vs ~60 lines of fast.aiSagemaker deployment
Not that I am sure the approach here is the best but at least we could start the discussion.
Create https://www.notion.so/iterative/Sagemaker-Deployment-fce94fa0d81c44f395c10fc32940ab82#f3773819ebce4659ad397ed92686d56c to discuss.
TODO: