Introduction to Machine Learning Serving and Packaging_v2
Introduction to Machine Learning Serving and Packaging_v2
Introduction to Machine Learning Serving and Packaging_v2
Source
2
What is Machine Learning “Serving” ?
3
What do we need when serving machine learning model ?
Use all the core principles to write good inference code to retain all the good
characteristics of it like: Correct, Low latency, High throughput, Efficient,
Maintainable, Secure. Plus:
4
5
Deep dive: FastAPI
● High Performance
● ASGI-based
● Short and Easy
● Automatic Documentation
6
Basic FastAPI app
7
Better with Async
8
Why Async?
● Improved Concurrency: handles multiple requests simultaneously without
blocking the main thread, ideal for I/O-bound tasks
9
Automatic API documentation
FastAPI uses Swagger or ReDoc to
provide automatic documentation for
APIs implementing with FastAPI
10
Some concept that you’ll need to understand to be able to
learn all this
- API ? REST API ? https://www.youtube.com/watch?v=Q-BpqyOT3a8
- Synchronous vs Asynchronous Web Server ?
- JSON ? https://www.w3schools.com/js/js_json_intro.asp
- GRPC ? https://www.youtube.com/watch?v=hVrwuMnCtok
- REST vs GRPC ? https://www.baeldung.com/rest-vs-grpc
11
Packaging
12
What is Machine Learning packaging ?
Inorder for a Machine Learning inference server to run, we need many things:
Packaging is the process of making sure these things that we need are being
packaged properly so it can be use properly in various production environment,
not just on your messy development machine anymore.
13
What do we need when packaging machine learning server ?
14
Dependencies Management
● Dependency management is a technique for identifying, resolving, and
patching dependencies in your application’s codebase.
● A dependency manager is a software module that helps integrate external
libraries or packages into your larger application stack.
15
16
Why we should use conda ?
Conda packages everything
19
Deep dive: DVC for weight management
20
Deep dive: DVC for weight management
What ? Git for Data & Models
Why ? Data and model are big files to store on git, so we need away to leverage
git excellent version feature for big files like model
How ?
● Git store .dvc file to act like a placeholder for the actual big file. The big file
can then be store on a separate storage system like Google Drive, S3
● Share CLI design with Git that make it easier to use: add, push, pull, …
21
Deep dive: DVC for weight management
22
Docker
Containerization
What ? Why ? How ?
23
Deep dive: Docker for ML Serving
● How to select the CPU base:
https://pythonspeed.com/articles/base-image-python-docker-images/
● GPU base images: https://hub.docker.com/r/nvidia/cuda
● GPU base runtime: https://github.com/NVIDIA/nvidia-docker
● Remember to check the CUDA compatibility requirement for your ML models
24
Deep dive: Docker for ML Serving
● Docker ignore properly: .git, .Dockerfile, .pyc, .env, data, debugs, …
● Activate conda env properly:
https://pythonspeed.com/articles/activate-conda-dockerfile/
● Docker multi-stage build with conda pack:
https://pythonspeed.com/articles/conda-docker-image-size/
● Docker selective copy with weight for better caching:
RUN --mount=target=/ctx rsync -r --exclude='weights' /ctx/ /code/
25
Load Testing
26
What is Load testing?
● A test that ensures your service
performance, reliability, and scalability
27
Definitions you need to understand!
● Concurrent User (CCU): The number of users actively interacting with the
system at the same time
● Ramp up: The gradual increase in the number of users or load on a system
over a specified period to reach a target load
● Response Time: The total time taken from when a user sends a request to
when they receive a response
● Response Per Second (RPS): The number of requests handled by the
system each second, measuring system capacity.
28
Brief introduction: Locust
A friendly python-based tool for load testing
29
Case study: Setup proper packaging for Lite-HRNet code
30
Stuff that are not discussed in this lesson
● General Python web server: Flask, FastAPI, Django
● HTTP, TCP, API, REST, GRPC, networking
● Advance Inference Server: KServe, Triton
● Container orchestration like Kubernetes, Service Mesh, API Gateway, ...
● Scale inference server with multiple machine, serverless
● Packaging for low level board, edge devices like phone
● Demo with complex UI, high performance frontend using real FE framework
31
Excellent resource to learn more about the topic
● Articles: Production-ready Docker packaging for Python developers
● Build a realtime, high throughput Semantic Face Editing web app using
TorchScript, Triton, Streamlit and K8S
32
Thanks for your attention !
33