Pytorch Extending Our Containers
Pytorch Extending Our Containers
Pytorch Extending Our Containers
1 of 13 26/9/19, 6:09 pm
pytorch_extending_our_containers https://cnk.notebook.ap-southeast-1.sagemaker.aws/examples/pr...
By packaging an algorithm in a container, you can bring almost any code to the Amazon SageMaker environment,
regardless of programming language, environment, framework, or dependencies.
Even if there is direct SDK support for your environment or framework, you may want to add additional functionality or
configure your container environment differently while utilizing our container to use on SageMaker.
Some of the reasons to extend a SageMaker deep learning framework container are:
1. Install additional dependencies. (E.g. I want to install a specific Python library, that the current SageMaker containers
don't install.)
2. Configure your environment. (E.g. I want to add an environment variable to my container.)
Although it is possible to extend any of our framework containers as a parent image, the example this notebook
covers is currently only intended to work with our PyTorch (0.4.0+) and Chainer (4.1.0+) containers.
This walkthrough shows that it is quite straightforward to extend one of our containers to build your own custom
container for PyTorch or Chainer.
2 of 13 26/9/19, 6:09 pm
pytorch_extending_our_containers https://cnk.notebook.ap-southeast-1.sagemaker.aws/examples/pr...
3 of 13 26/9/19, 6:09 pm
pytorch_extending_our_containers https://cnk.notebook.ap-southeast-1.sagemaker.aws/examples/pr...
For many data scientists, Docker containers are a new technology. But they are not difficult and can significantly simplify
the deployment of your software packages.
Docker provides a simple way to package arbitrary code into an image that is totally self-contained. Once you have an
image, you can use Docker to run a container based on that image. Running a container is just like running a program on
the machine except that the container creates a fully self-contained environment for the program to run. Containers are
isolated from each other and from the host environment, so the way your program is set up is the way it runs, no matter
where you run it.
Docker is more powerful than environment managers like conda or virtualenv because (a) it is completely language
independent and (b) it comprises your whole operating environment, including startup commands, and environment
variable.
A Docker container is like a virtual machine, but it is much lighter weight. For example, a program running in a container
can start in less than a second and many containers can run simultaneously on the same physical or virtual machine
instance.
Docker uses a simple file called a Dockerfile to specify how the image is assembled. An example is provided below.
You can build your Docker images based on Docker images built by yourself or by others, which can simplify things quite
a bit.
Docker has become very popular in programming and devops communities due to its flexibility and its well-defined
specification of how code can be run in its containers. It is the underpinning of many services built in the past few years,
such as Amazon ECS (https://aws.amazon.com/ecs/).
Amazon SageMaker uses Docker to allow users to train and deploy arbitrary algorithms.
In Amazon SageMaker, Docker containers are invoked in a one way for training and another, slightly different, way for
hosting. The following sections outline how to build containers for the SageMaker environment.
If you specify a program as an ENTRYPOINT in the Dockerfile, that program will be run at startup and its first
argument will be train or serve . The program can then look at that argument and decide what to do. The
original ENTRYPOINT specified within the SageMaker PyTorch is here (https://github.com/aws/sagemaker-pytorch-
container/blob/master/docker/0.4.0/final/Dockerfile.cpu#L18).
4 of 13 26/9/19, 6:09 pm
pytorch_extending_our_containers https://cnk.notebook.ap-southeast-1.sagemaker.aws/examples/pr...
.
├── Dockerfile
├── build_and_push.sh
└── cifar10
├── cifar10.py
Dockerfile describes how to build your Docker container image. More details are provided below.
build_and_push.sh is a script that uses the Dockerfile to build your container images and then pushes it to
ECR. We invoke the commands directly later in this notebook, but you can just copy and run the script for your own
algorithms.
cifar10 is the directory which contains our user code to be invoked.
In this simple application, we install only one file in the container. You may only need that many, but if you have many
supporting routines, you may wish to install more.
cifar10.py is the program that implements our training algorithm and handles loading our model for inferences.
5 of 13 26/9/19, 6:09 pm
pytorch_extending_our_containers https://cnk.notebook.ap-southeast-1.sagemaker.aws/examples/pr...
The Dockerfile
The Dockerfile describes the image that we want to build. You can think of it as describing the complete operating
system installation of the system that you want to run. A Docker container running is quite a bit lighter than a full
operating system, however, because it takes advantage of Linux on the host machine for the basic operations.
We start from the SageMaker PyTorch image as the base. The base image is an ECR image, so it will have the following
pattern.
{account}.dkr.ecr.{region}.amazonaws.com/sagemaker-{framework}:{framework_version}-{processor_type}-
{python_version}
1. account - AWS account ID the ECR image belongs to. Our public deep learning framework images are all under the
520713654638 account.
2. region - The region the ECR image belongs to. Available regions (https://aws.amazon.com/about-aws/global-
infrastructure/regional-product-services/).
3. framework - The deep learning framework.
4. framework_version - The version of the deep learning framework.
5. processor_type - CPU or GPU.
6. python_version - The supported version of Python.
Information on supported frameworks and versions can be found in this README (https://github.com/aws/sagemaker-
python-sdk).
Next, we add the code that implements our specific algorithm to the container and set up the right environment for it to
run under.
DISCLAIMER: As of now, the support for the two environment variables below are only supported for the
SageMaker Chainer (4.1.0+) and PyTorch (0.4.0+) containers.
1. SAGEMAKER_SUBMIT_DIRECTORY - the directory within the container containing our Python script for training and
inference.
2. SAGEMAKER_PROGRAM - the Python script that should be invoked for training and inference.
6 of 13 26/9/19, 6:09 pm
pytorch_extending_our_containers https://cnk.notebook.ap-southeast-1.sagemaker.aws/examples/pr...
ENV PATH="/opt/ml/code:${PATH}"
# /opt/ml and all subdirectories are utilized by SageMaker, we use the /code s
ubdirectory to store our user code.
COPY /cifar10 /opt/ml/code
This code looks for an ECR repository in the account you're using and the current default region (if you're using a
SageMaker notebook instance, this is the region where the notebook instance was created). If the repository doesn't
exist, the script will create it. In addition, since we are using the SageMaker PyTorch image as the base, we will need to
retrieve ECR credentials to pull this public image.
7 of 13 26/9/19, 6:09 pm
pytorch_extending_our_containers https://cnk.notebook.ap-southeast-1.sagemaker.aws/examples/pr...
In [ ]: %%sh
cd container
fullname="${account}.dkr.ecr.${region}.amazonaws.com/${algorithm_name}:latest"
if [ $? -ne 0 ]
then
aws ecr create-repository --repository-name "${algorithm_name}" > /dev/null
fi
# Get the login command from ECR in order to pull down the SageMaker PyTorch im
age
$(aws ecr get-login --registry-ids 520713654638 --region ${region} --no-include
-email)
# Build the docker image locally with the image name and then push it to ECR
# with the full name.
trainloader = get_train_data_loader('/tmp/pytorch-example/cifar-10-data')
testloader = get_test_data_loader('/tmp/pytorch-example/cifar-10-data')
8 of 13 26/9/19, 6:09 pm
pytorch_extending_our_containers https://cnk.notebook.ap-southeast-1.sagemaker.aws/examples/pr...
In [ ]: # There should be the original tar file along with the extracted data director
y. (cifar-10-python.tar.gz, cifar-10-batches-py)
! ls /tmp/pytorch-example/cifar-10-data
Let's start with setting up our IAM role. We make use of a helper function within the Python SDK. This function throw an
exception if run outside of a SageMaker notebook instance, as it gets metadata from the notebook instance. If running
outside, you must provide an IAM role with proper access stated above in Permissions.
role = get_execution_role()
After our training has succeeded, our training algorithm outputs our trained model within the /opt/ml/model directory,
which is used to handle predictions.
We can then call deploy() with an instance_count and instance_type, which is 1 and local . This invokes our
PyTorch container with 'serve', which setups our container to handle prediction requests as defined here
(https://github.com/aws/sagemaker-pytorch-container/blob/master/src/sagemaker_pytorch_container/serving.py#L103).
What is returned is a predictor, which is used to make inferences against our trained model.
We recommend testing and training your training algorithm locally first, as it provides quicker iterations and better
debuggability.
In [ ]: # Lets set up our SageMaker notebook instance for local mode.
!/bin/bash ./utils/setup.sh
In [ ]: import os
import subprocess
instance_type = 'local'
if subprocess.call('nvidia-smi') == 0:
## Set type to GPU if one is present
instance_type = 'local_gpu'
9 of 13 26/9/19, 6:09 pm
pytorch_extending_our_containers https://cnk.notebook.ap-southeast-1.sagemaker.aws/examples/pr...
hyperparameters = {'epochs': 1}
estimator = Estimator(role=role,
train_instance_count=1,
train_instance_type=instance_type,
image_name='pytorch-extending-our-containers-cifar10-exam
ple:latest',
hyperparameters=hyperparameters)
estimator.fit('file:///tmp/pytorch-example/cifar-10-data')
The reponse will be tensors containing the probabilities of each image belonging to one of the 10 classes. Based on the
highest probability we will map that index to the corresponding class in our output. The classes can be referenced from
the CIFAR-10 website (https://www.cs.toronto.edu/~kriz/cifar.html). Since we didn't train the model for that long, we
aren't expecting very accurate results.
# print images
imshow(torchvision.utils.make_grid(images))
print('GroundTruth: ', ' '.join('%4s' % classes[labels[j]] for j in range(4)))
predictor.accept = 'application/json'
predictor.content_type = 'application/json'
predictor.serializer = json_serializer
predictor.deserializer = json_deserializer
outputs = predictor.predict(images.numpy())
_, predicted = torch.max(torch.from_numpy(np.array(outputs)), 1)
In [ ]: predictor.delete_endpoint()
10 of 13 26/9/19, 6:09 pm
pytorch_extending_our_containers https://cnk.notebook.ap-southeast-1.sagemaker.aws/examples/pr...
In [ ]: # S3 prefix
prefix = 'DEMO-pytorch-cifar10'
sess = sage.Session()
Training on SageMaker
Training a model on SageMaker with the Python SDK is done in a way that is similar to the way we trained it locally. This
is done by changing our train_instance_type from local to one of our supported EC2 instance types
(https://aws.amazon.com/sagemaker/pricing/instance-types/).
In addition, we must now specify the ECR image URL, which we just pushed above.
Finally, our local training dataset has to be in Amazon S3 and the S3 URL to our dataset is passed into the fit() call.
Let's first fetch our ECR image url that corresponds to the image we just built and pushed.
11 of 13 26/9/19, 6:09 pm
pytorch_extending_our_containers https://cnk.notebook.ap-southeast-1.sagemaker.aws/examples/pr...
client = boto3.client('sts')
account = client.get_caller_identity()['Account']
my_session = boto3.session.Session()
region = my_session.region_name
algorithm_name = 'pytorch-extending-our-containers-cifar10-example'
print(ecr_image)
hyperparameters = {'epochs': 1}
instance_type = 'ml.m4.xlarge'
estimator = Estimator(role=role,
train_instance_count=1,
train_instance_type=instance_type,
image_name=ecr_image,
hyperparameters=hyperparameters)
estimator.fit(data_location)
# print images
imshow(torchvision.utils.make_grid(images))
print('GroundTruth: ', ' '.join('%4s' % classes[labels[j]] for j in range(4)))
predictor.accept = 'application/json'
predictor.content_type = 'application/json'
predictor.serializer = json_serializer
predictor.deserializer = json_deserializer
outputs = predictor.predict(images.numpy())
_, predicted = torch.max(torch.from_numpy(np.array(outputs)), 1)
Optional cleanup
When you're done with the endpoint, you should clean it up.
All of the training jobs, models and endpoints we created can be viewed through the SageMaker console of your AWS
account.
In [ ]: predictor.delete_endpoint()
12 of 13 26/9/19, 6:09 pm
pytorch_extending_our_containers https://cnk.notebook.ap-southeast-1.sagemaker.aws/examples/pr...
Reference
How Amazon SageMaker interacts with your Docker container for training (https://docs.aws.amazon.com
/sagemaker/latest/dg/your-algorithms-training-algo.html)
How Amazon SageMaker interacts with your Docker container for inference (https://docs.aws.amazon.com
/sagemaker/latest/dg/your-algorithms-inference-code.html)
CIFAR-10 Dataset (https://www.cs.toronto.edu/~kriz/cifar.html)
SageMaker Python SDK (https://github.com/aws/sagemaker-python-sdk)
Dockerfile (https://docs.docker.com/engine/reference/builder/)
scikit-bring-your-own (https://github.com/awslabs/amazon-sagemaker-examples/blob/master
/advanced_functionality/scikit_bring_your_own/scikit_bring_your_own.ipynb)
SageMaker PyTorch container (https://github.com/aws/sagemaker-pytorch-container)
13 of 13 26/9/19, 6:09 pm