Introduction to Machine Learning Serving and Packaging_v2

Download as pdf or txt
Download as pdf or txt
You are on page 1of 33

Introduction to Machine

Learning Serving and


Packaging
@Build Beta

source code is here!


Serving

Source

2
What is Machine Learning “Serving” ?

Machine learning serving is the


process of hosting machine-learning
models (on the cloud or on premises)
and to make their functions available
via API so that applications can
incorporate AI into their systems.

3
What do we need when serving machine learning model ?
Use all the core principles to write good inference code to retain all the good
characteristics of it like: Correct, Low latency, High throughput, Efficient,
Maintainable, Secure. Plus:

● Reliable: High-up time, self-heal after crash, …


● Scalable: Inference on multiple machines, clusters according to current load
● Secure: Authentication, encryption during transfer
● Monitorable: Ability to monitor the system, log operation metrics, catch errors

4
5
Deep dive: FastAPI

● High Performance
● ASGI-based
● Short and Easy
● Automatic Documentation

6
Basic FastAPI app

7
Better with Async

8
Why Async?
● Improved Concurrency: handles multiple requests simultaneously without
blocking the main thread, ideal for I/O-bound tasks

● Better Performance Under Load: maintains high throughput with fewer


resources, suitable for high CCU APIs
● Scalability: scale with minimal overhead, optimize resource utilization
● Background Tasks and Scheduled Jobs: allows for background task
processing, enable executing tasks without affecting response time

9
Automatic API documentation
FastAPI uses Swagger or ReDoc to
provide automatic documentation for
APIs implementing with FastAPI

10
Some concept that you’ll need to understand to be able to
learn all this
- API ? REST API ? https://www.youtube.com/watch?v=Q-BpqyOT3a8
- Synchronous vs Asynchronous Web Server ?
- JSON ? https://www.w3schools.com/js/js_json_intro.asp
- GRPC ? https://www.youtube.com/watch?v=hVrwuMnCtok
- REST vs GRPC ? https://www.baeldung.com/rest-vs-grpc

11
Packaging

12
What is Machine Learning packaging ?
Inorder for a Machine Learning inference server to run, we need many things:

- Dependencies: ML frameworks, Inference server, Python runtime, Operating


System ? …
- Processing Code/Logic
- Model weights

Packaging is the process of making sure these things that we need are being
packaged properly so it can be use properly in various production environment,
not just on your messy development machine anymore.
13
What do we need when packaging machine learning server ?

● Reliable: Create the same bundle everytime, automatically


● Portable: Make sure it can run on various machine, operating system, cloud
server, on premise server with minimal difference
● Secure: Don’t leak secrets key, private information that not applicable in
production in the server bundle
● Fast and Lightweight: No dead weight, build fast, ship fast, update fast.

14
Dependencies Management
● Dependency management is a technique for identifying, resolving, and
patching dependencies in your application’s codebase.
● A dependency manager is a software module that helps integrate external
libraries or packages into your larger application stack.

15
16
Why we should use conda ?
Conda packages everything

1. Portability across operating systems: Instead of installing Python in three


different ways on Linux, macOS, and Windows, you can use the same
environment.yml on all three.
2. Reproducibility: It’s possible to pin almost the whole stack, from the Python
interpreter upwards.
3. Consistent configuration: You don’t need to install system packages and
Python packages in two different ways; (almost) everything can go in one file,
the environment.yml.
17
Deep dive: Use conda properly
● Write environment.yaml properly for reproducibility:
https://pythonspeed.com/articles/conda-dependency-management/
● Conda can package stuff outside of python:
○ CUDA, CUDNN
○ Even NodeJS, Go, …
● Don’t forget to use mamba for faster speed:
https://pythonspeed.com/articles/faster-conda-install/

Apply for the resnet18 example


18
Weight management
● Weight management is the process version control your weight to make sure
the right one will be fetch for the right source code.
● How can we store and version our model weight ?
○ Google Drive + README: Manual, human mistake, … ?
○ Git: Slow performance, huge disk and memory usage when dealing with
large files … ?
● Please noted that this can also share the same tooling with data management

19
Deep dive: DVC for weight management

20
Deep dive: DVC for weight management
What ? Git for Data & Models

Why ? Data and model are big files to store on git, so we need away to leverage
git excellent version feature for big files like model

How ?

● Git store .dvc file to act like a placeholder for the actual big file. The big file
can then be store on a separate storage system like Google Drive, S3
● Share CLI design with Git that make it easier to use: add, push, pull, …

21
Deep dive: DVC for weight management

22
Docker
Containerization
What ? Why ? How ?

-> This 100 seconds


video is enough to get us
started, watch it

23
Deep dive: Docker for ML Serving
● How to select the CPU base:
https://pythonspeed.com/articles/base-image-python-docker-images/
● GPU base images: https://hub.docker.com/r/nvidia/cuda
● GPU base runtime: https://github.com/NVIDIA/nvidia-docker
● Remember to check the CUDA compatibility requirement for your ML models

24
Deep dive: Docker for ML Serving
● Docker ignore properly: .git, .Dockerfile, .pyc, .env, data, debugs, …
● Activate conda env properly:
https://pythonspeed.com/articles/activate-conda-dockerfile/
● Docker multi-stage build with conda pack:
https://pythonspeed.com/articles/conda-docker-image-size/
● Docker selective copy with weight for better caching:
RUN --mount=target=/ctx rsync -r --exclude='weights' /ctx/ /code/

25
Load Testing

26
What is Load testing?
● A test that ensures your service
performance, reliability, and scalability

● Should use before deployment or after


significant changes to code or
infrastructure
● It helps:
○ Identify performance bottlenecks
○ Optimize resource usage
○ Enhance system scalability

27
Definitions you need to understand!
● Concurrent User (CCU): The number of users actively interacting with the
system at the same time

● Ramp up: The gradual increase in the number of users or load on a system
over a specified period to reach a target load
● Response Time: The total time taken from when a user sends a request to
when they receive a response
● Response Per Second (RPS): The number of requests handled by the
system each second, measuring system capacity.

28
Brief introduction: Locust
A friendly python-based tool for load testing

29
Case study: Setup proper packaging for Lite-HRNet code

● Init dvc, remote as GG drive


● Setup env with conda
● Build cpu and gpu docker images
● Test the packaged model

30
Stuff that are not discussed in this lesson
● General Python web server: Flask, FastAPI, Django
● HTTP, TCP, API, REST, GRPC, networking
● Advance Inference Server: KServe, Triton
● Container orchestration like Kubernetes, Service Mesh, API Gateway, ...
● Scale inference server with multiple machine, serverless
● Packaging for low level board, edge devices like phone
● Demo with complex UI, high performance frontend using real FE framework

31
Excellent resource to learn more about the topic
● Articles: Production-ready Docker packaging for Python developers
● Build a realtime, high throughput Semantic Face Editing web app using
TorchScript, Triton, Streamlit and K8S

32
Thanks for your attention !

33

You might also like