Introduction to Machine Learning Serving and Packaging_v2

Introduction to Machine
Learning Serving and

Packaging
@Build Beta
source code is here!

Serving
Source
2
What is Machine Learning “Serving” ?
Machine learning serving is the

process of hosting machine-learning
models (on the cloud or on premises)
and to make their functions available
via API so that applications can
incorporate AI into their systems.
3
What do we need when serving machine learning model ?
Use all the core principles to write good inference code to retain all the good
characteristics of it like: Correct, Low latency, High throughput, Efficient,
Maintainable, Secure. Plus:
● Reliable: High-up time, self-heal after crash, …

● Scalable: Inference on multiple machines, clusters according to current load
● Secure: Authentication, encryption during transfer
● Monitorable: Ability to monitor the system, log operation metrics, catch errors
4
5
Deep dive: FastAPI
● High Performance
● ASGI-based
● Short and Easy
● Automatic Documentation
6
Basic FastAPI app
7
Better with Async
8
Why Async?
● Improved Concurrency: handles multiple requests simultaneously without
blocking the main thread, ideal for I/O-bound tasks
● Better Performance Under Load: maintains high throughput with fewer

resources, suitable for high CCU APIs
● Scalability: scale with minimal overhead, optimize resource utilization
● Background Tasks and Scheduled Jobs: allows for background task
processing, enable executing tasks without affecting response time
9
Automatic API documentation
FastAPI uses Swagger or ReDoc to
provide automatic documentation for
APIs implementing with FastAPI
10
Some concept that you’ll need to understand to be able to
learn all this
- API ? REST API ? https://www.youtube.com/watch?v=Q-BpqyOT3a8
- Synchronous vs Asynchronous Web Server ?
- JSON ? https://www.w3schools.com/js/js_json_intro.asp
- GRPC ? https://www.youtube.com/watch?v=hVrwuMnCtok
- REST vs GRPC ? https://www.baeldung.com/rest-vs-grpc
11
Packaging
12
What is Machine Learning packaging ?
Inorder for a Machine Learning inference server to run, we need many things:
- Dependencies: ML frameworks, Inference server, Python runtime, Operating

System ? …
- Processing Code/Logic
- Model weights
Packaging is the process of making sure these things that we need are being
packaged properly so it can be use properly in various production environment,
not just on your messy development machine anymore.
13
What do we need when packaging machine learning server ?
● Reliable: Create the same bundle everytime, automatically

● Portable: Make sure it can run on various machine, operating system, cloud
server, on premise server with minimal difference
● Secure: Don’t leak secrets key, private information that not applicable in
production in the server bundle
● Fast and Lightweight: No dead weight, build fast, ship fast, update fast.
14
Dependencies Management
● Dependency management is a technique for identifying, resolving, and
patching dependencies in your application’s codebase.
● A dependency manager is a software module that helps integrate external
libraries or packages into your larger application stack.
15
16
Why we should use conda ?
Conda packages everything
1. Portability across operating systems: Instead of installing Python in three

different ways on Linux, macOS, and Windows, you can use the same
environment.yml on all three.
2. Reproducibility: It’s possible to pin almost the whole stack, from the Python
interpreter upwards.
3. Consistent configuration: You don’t need to install system packages and
Python packages in two different ways; (almost) everything can go in one file,
the environment.yml.
17
Deep dive: Use conda properly
● Write environment.yaml properly for reproducibility:
https://pythonspeed.com/articles/conda-dependency-management/
● Conda can package stuff outside of python:
○ CUDA, CUDNN
○ Even NodeJS, Go, …
● Don’t forget to use mamba for faster speed:
https://pythonspeed.com/articles/faster-conda-install/
Apply for the resnet18 example

18
Weight management
● Weight management is the process version control your weight to make sure
the right one will be fetch for the right source code.
● How can we store and version our model weight ?
○ Google Drive + README: Manual, human mistake, … ?
○ Git: Slow performance, huge disk and memory usage when dealing with
large files … ?
● Please noted that this can also share the same tooling with data management
19
Deep dive: DVC for weight management
20
What ? Git for Data & Models
Why ? Data and model are big files to store on git, so we need away to leverage
git excellent version feature for big files like model
How ?
● Git store .dvc file to act like a placeholder for the actual big file. The big file
can then be store on a separate storage system like Google Drive, S3
● Share CLI design with Git that make it easier to use: add, push, pull, …
21
22
Docker
Containerization
What ? Why ? How ?
-> This 100 seconds

video is enough to get us
started, watch it
23
Deep dive: Docker for ML Serving
● How to select the CPU base:
https://pythonspeed.com/articles/base-image-python-docker-images/
● GPU base images: https://hub.docker.com/r/nvidia/cuda
● GPU base runtime: https://github.com/NVIDIA/nvidia-docker
● Remember to check the CUDA compatibility requirement for your ML models
24
Deep dive: Docker for ML Serving
● Docker ignore properly: .git, .Dockerfile, .pyc, .env, data, debugs, …
● Activate conda env properly:
https://pythonspeed.com/articles/activate-conda-dockerfile/
● Docker multi-stage build with conda pack:
https://pythonspeed.com/articles/conda-docker-image-size/
● Docker selective copy with weight for better caching:
RUN --mount=target=/ctx rsync -r --exclude='weights' /ctx/ /code/
25
Load Testing
26
What is Load testing?
● A test that ensures your service
performance, reliability, and scalability
● Should use before deployment or after

significant changes to code or
infrastructure
● It helps:
○ Identify performance bottlenecks
○ Optimize resource usage
○ Enhance system scalability
27
Definitions you need to understand!
● Concurrent User (CCU): The number of users actively interacting with the
system at the same time
● Ramp up: The gradual increase in the number of users or load on a system
over a specified period to reach a target load
● Response Time: The total time taken from when a user sends a request to
when they receive a response
● Response Per Second (RPS): The number of requests handled by the
system each second, measuring system capacity.
28
Brief introduction: Locust
A friendly python-based tool for load testing
29
Case study: Setup proper packaging for Lite-HRNet code
● Init dvc, remote as GG drive

● Setup env with conda
● Build cpu and gpu docker images
● Test the packaged model
30
Stuff that are not discussed in this lesson
● General Python web server: Flask, FastAPI, Django
● HTTP, TCP, API, REST, GRPC, networking
● Advance Inference Server: KServe, Triton
● Container orchestration like Kubernetes, Service Mesh, API Gateway, ...
● Scale inference server with multiple machine, serverless
● Packaging for low level board, edge devices like phone
● Demo with complex UI, high performance frontend using real FE framework
31
Excellent resource to learn more about the topic
● Articles: Production-ready Docker packaging for Python developers
● Build a realtime, high throughput Semantic Face Editing web app using
TorchScript, Triton, Streamlit and K8S
32
Thanks for your attention !
33

Introduction to Machine Learning Serving and Packaging_v2

Uploaded by

Copyright:

Available Formats

Introduction to Machine Learning Serving and Packaging_v2

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Introduction to Machine Learning Serving and Packaging_v2

Uploaded by

Copyright:

Available Formats

Introduction to Machine

Learning Serving and

source code is here!

Machine learning serving is the

● Reliable: High-up time, self-heal after crash, …

● Better Performance Under Load: maintains high throughput with fewer

- Dependencies: ML frameworks, Inference server, Python runtime, Operating

● Reliable: Create the same bundle everytime, automatically

1. Portability across operating systems: Instead of installing Python in three

Apply for the resnet18 example

-> This 100 seconds

● Should use before deployment or after

● Init dvc, remote as GG drive

You might also like