-
Notifications
You must be signed in to change notification settings - Fork 13
222 add deployment of example get started experiments model #233
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
222 add deployment of example get started experiments model #233
Conversation
example-get-started-experiments/code/.github/workflows/deploy-model.yml
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some change suggestions for the deploy GH action. Will add related changes to the deploy-model file.
example-get-started-experiments/code/.github/workflows/deploy-model.yml
Outdated
Show resolved
Hide resolved
example-get-started-experiments/code/.github/workflows/deploy-model.yml
Outdated
Show resolved
Hide resolved
example-get-started-experiments/code/.github/workflows/deploy-model.yml
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some suggestions for creating different configs and endpoints per stage, and creating a serverless endpoint.
composed_name = re.sub( | ||
r"[^a-zA-Z0-9\-]", "-", f"{name}-{version}-{stage}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know what possible characters can appear in f"{name}-{version}-{stage}"
that would need to be replaced here? I guess it's related to iterative/dvc#9821?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it's related to iterative/dvc#9821?
This regex is only relevant for Sagemaker and it is enforced by their API .
From top of my head, the composed name includes /
, :
from name
and .
from version
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the rationale for the naming restrictions in gto were to make it more likely they would be compatible with systems like sagemaker. Not sure it really helps though. We still hit these restrictions and users don't know why they can't use these characters.
python src/train.py | ||
|
||
dvc stage add -n evaluate \ | ||
-p base,evaluate \ | ||
-d src/evaluate.py -d models/model.pkl -d data/test_data \ | ||
python src/evaluate.py | ||
|
||
dvc stage add -n sagemaker \ | ||
-d models/model.pth -o model.tar.gz \ | ||
'cp models/model.pth sagemaker/code/model.pth && cd sagemaker && tar -cpzf model.tar.gz code/ && cd .. && mv sagemaker/model.tar.gz . && rm sagemaker/code/model.pth' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor thoughts to try to clean this up a bit:
I haven't tested this, but wondering if we can simplify this at all with something like this?
'cp models/model.pth sagemaker/code/model.pth && cd sagemaker && tar -cpzf model.tar.gz code/ && cd .. && mv sagemaker/model.tar.gz . && rm sagemaker/code/model.pth' | |
'cp models/model.pth sagemaker/out && cp sagemaker/code sagemaker/out && tar -cpzf model.tar.gz sagemaker/out' |
Also, I wonder if it would be better to append directly to dvc.yaml
and use list syntax for cmd
instead of &&
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I wonder if it would be better to append directly to dvc.yaml and use list syntax for cmd instead of &&?
I didn't manage to use list cmd with stage add
(which we use in the generate.sh
script , but might be missing something
This comment was marked as resolved.
This comment was marked as resolved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to address my comments from this PR, but I don't see any actual blockers here. Nice work @daavoo!
Can you try with the Sandbox account?
I assume we don't want to make the actual endpoint public, but rather a simple UI that queries the endpoint. |
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
example-get-started-experiments/code/.github/workflows/deploy-model.yml
Outdated
Show resolved
Hide resolved
@@ -90,7 +90,7 @@ jobs: | |||
- uses: aws-actions/configure-aws-credentials@v1 | |||
with: | |||
aws-region: us-east-2 | |||
role-to-assume: arn:aws:iam::342840881361:role/SandboxUser | |||
role-to-assume: ${{ secrets.AWS_ROLE_TO_ASSUME }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we have to generalize it in this way? (an extra step for us to take care of)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was suggested by @jesper7 and @tapadipti to use the role as a secret. I assume it has some security implications although I am not sure what is the danger of leaking a role name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@daavoo How did you set it up for https://github.com/iterative/example-get-started-experiments? Should we add it to the readme here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that roles can be public.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks really great! looks simple and clen
@daavoo Looks like this PR is close to getting merged. Since this uses one of our official demo repos, we could use this in the blog post instead of the demo-fashion-mnist that I have currently used. wdyt? I can try to replace the example snippets in the blog post to use your snippets. And you might wanna rewrite some of the text. We'll not have a web UI, but that should be ok. |
Makes sense to me. I would perhaps also use the opportunity to cut the scope of the post a little by dropping DVC details in favor of pointers to the dvc get-started pages |
Ok. I'll share an updated version of the blog post tomorrow. |
Merging as the endpoint is now working. Don't hesitate to open followups |
- run: dvc remote add -d --local storage s3://dvc-public/remote/get-started-pools | ||
|
||
- run: | | ||
MODEL_DATA=$(dvc get --show-url . model.tar.gz) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@daavoo Doesn't this get the model data for the latest commit instead of the specific version we are trying to deploy? From the DVC docs, when --rev
is not specified, The latest commit (in the default branch) is used by default when this option is not specified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume that the phrase in the docs is meant for the scenario where you pass a remote URL repo as the first argument.
I think the behavior here is correct, but I am going to double-check.
I think it works because we are using the local repo .
and so the current status of the workspace will be used. The workflow uses actions/checkout
and runs on the git tag creation so the workspace will be the revision where the tag was created
composed_name = re.sub( | ||
r"[^a-zA-Z0-9\-]", "-", f"{name}-{version}-{stage}") | ||
|
||
model = PyTorchModel( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@daavoo If I created a SageMaker model for a given model version before deploying it in one environment (eg, dev
), does this code recreate the same model before deploying it in another environment (eg, prod
). Or does it skip recreating it and just return a reference to the old model?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It currently creates a new model. If we want to do what you said last, we should only add stage
to the composed name after the model has been created.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Referring to this msg - you said that the existing model would be re-used, right? Would removing stage
from the model name make this happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would removing stage from the model name make this happen?
Yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I'll create a PR to remove stage from the model name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
) | ||
|
||
|
||
return model.deploy( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@daavoo If I am trying to deploy a new model version to an existing stage, does this code create a new endpoint or does it update the existing endpoint? (I need to confirm this for the blog post)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@daavoo If I am trying to deploy a new model version to an existing stage, does this code create a new endpoint or does it update the existing endpoint? (I need to confirm this for the blog post)
Not sure I understand the first part but a new endpoint is created from combining the name, version, and stage, so any change to those 3 things result in a new endpoint
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if I deploy v1 to prod today, and v2 to prod tomorrow, will I have 2 different endpoints with 2 different endpoint names? Shouldn't the prod endpoint always be the same? (just like studio prod version is always studio.iterative.ai) So that any clients running inference against the prod endpoint don't have to update their endpoint addresses each time there's a new deployment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The version shouldn't be part of the endpoint name, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't the prod endpoint always be the same?
I think it is a matter of opinion and depends on the pattern we expect/want to showcase:
A) The endpoint is directly queried by external consumers
B) There is some app&url that uses the right endpoint internally without exposing it to external consumers.
(just like studio prod version is always studio.iterative.ai) So that any clients running inference against the prod endpoint don't have to update their endpoint addresses each time there's a new deployment.
That is the user-facing app&url and it only "points" to the correct software because there are gitops bumping the internal studio version (i.e. https://github.com/iterative/itops/pull/2238)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So then we should also create a service to delete stale endpoints, right? Else, we will end up with a bunch of endpoints over time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So then we should also create a service to delete stale endpoints, right? Else, we will end up with a bunch of endpoints over time.
I would expect that to be configured / handled on the AWS side.
Anyhow, as per your suggestions, I will send a new P.R. creating/updating a single endpoint per stage
@@ -68,6 +68,7 @@ def train(): | |||
models_dir = Path("models") | |||
models_dir.mkdir(exist_ok=True) | |||
learn.export(fname=(models_dir / "model.pkl").absolute()) | |||
torch.save(learn.model, (models_dir / "model.pth").absolute()) | |||
live.log_artifact( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@daavoo If we logged the tar file of the model, then we wouldn't need to specify the file name model.tar.gz
in the deployment script, right? The deployment script would be able to get the file name from the Git tag itself, which means that a single deployment script could be used for deploying several models with different names. Right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now, in the blog post, I show that there's an artifact entry for model.pkl (so the model registry shows model.pkl). But the file that is getting deployed is model.tar.gz. We know that they are the same model, but it still looks like a disconnect between the model registry and the deployment script.
As discussed in #233 (comment)
Agree with @tapadipti that it makes sense to have one endpoint per stage or per version. Otherwise, I think we kind of miss the point of the registry (you can deploy every update to a new endpoint without it). IMO one endpoint per stage makes the most sense to drive home the value of that field, and I think we should focus on this being a self-contained deployment (you can do deployment without needing a separate engineering team to pick up the new model endpoint). |
Add Sagemaker deployment.
https://github.com/iterative/example-get-started-experiments/actions/workflows/deploy-model.yml
https://us-east-2.console.aws.amazon.com/sagemaker/home?region=us-east-2#/endpoints/results-train-pool-segmentation-v0-1-0-dev
export AWS_DEFAULT_REGION=us-east-2 python src/endpoint_predict.py \ --img_path data/test_data/REGION_1-24_0_1024_0_1024.jpg \ --endpoint_name results-train-pool-segmentation-v0-1-0-dev