Skip to content

Commit 6710cfe

Browse files
sfc-gh-jrasleysfc-gh-aqiaosfc-gh-mwyatt
authored
Llama 3.1 405b release (Snowflake-Labs#1) (Snowflake-Labs#28)
Co-authored-by: Aurick Qiao <aurick.qiao@snowflake.com> Co-authored-by: Michael Wyatt <michael.wyatt@snowflake.com>
1 parent 05fff7b commit 6710cfe

19 files changed

+2611
-18
lines changed

README.md

Lines changed: 25 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,30 @@
11
[![License Apache 2.0](https://badgen.net/badge/license/apache2.0/blue)](https://github.com/Snowflake-Labs/snowflake-arctic/blob/master/LICENSE)
22
[![Twitter](https://img.shields.io/twitter/follow/snowflakedb)](https://twitter.com/intent/follow?screen_name=snowflakedb)
33

4-
# ❄️ Snowflake Arctic ❄️
4+
# ❄️ Snowflake AI Research ❄️
55

66
## Latest News
7+
* [07/23/2024] [Snowflake Teams Up with Meta to Host and Optimize New Flagship Model Family in Snowflake Cortex AI
8+
](https://www.snowflake.com/blog/meta-llama-enterprise-apps-snowflake-cortex-ai/)
9+
* [Achieve Low-Latency and High-Throughput Inference with Meta's Llama 3.1 405B using Snowflake’s Optimized AI Stack](https://www.snowflake.com/engineering-blog/optimize-LLMs-with-llama-snowflake-ai-stack/)
10+
* [Fine-Tune Llama 3.1 405B on a Single Node using Snowflake’s Memory-Optimized AI Stack](https://www.snowflake.com/engineering-blog/fine-tune-llama-single-node-snowflake/)
711
* [04/24/2024] [Snowflake Arctic: The Best LLM for Enterprise AI — Efficiently Intelligent, Truly Open](https://www.snowflake.com/blog/arctic-open-efficient-foundation-language-models-snowflake/)
812

913
## Overview
1014

15+
The Snowflake AI Research team is conducting open, foundational research to advance the field of AI while making enterprise AI easy, efficient, and trusted. This repo contains several artifacts to help efficiently train and inference popular LLMs in practice. We released [Arctic](https://www.snowflake.com/blog/arctic-open-efficient-foundation-language-models-snowflake/) in April of 2023 and are proud to announce the release of our Massive LLM inference and fine-tuning stacks specifically tailored to Llama 3.1 405B.
16+
17+
## Llama 3.1 405B
18+
19+
In collaboration with DeepSpeed, Hugging Face, vLLM, and the broader AI community we are excited to open-source our inference and fine-tuning stacks optimized for Llama 3.1 405B. For inference we support a massive 128K context window from day one, while enabling real-time inference with up to 3x lower end-to-end latency and 1.4x higher throughput than existing open source solutions. Please see our blog, [Achieve Low-Latency and High-Throughput Inference with Meta's Llama 3.1 405B using Snowflake’s Optimized AI Stack](https://www.snowflake.com/engineering-blog/optimize-LLMs-with-llama-snowflake-ai-stack/), that deep dive into all of these innovations. For fine-tuning we support training on a single and multi-node training environments using the latest in memory efficient training techniques such as parameter-efficient fine-tuning, FP8 quantization, ZeRO-3-inspired sharding, and targeted parameter offloading (when necessary). Please see our blog, [Fine-Tune Llama 3.1 405B on a Single Node using Snowflake’s Memory-Optimized AI Stack](https://www.snowflake.com/engineering-blog/fine-tune-llama-single-node-snowflake/), for a deep dive into how we did this.
20+
21+
### Getting started
22+
23+
* [Inference deployment and benchmarks with vLLM](inference/llama3.1)
24+
* [Fine-Tuning Support for Llama 3.1 405B](training/llama3.1)
25+
26+
## Arctic
27+
1128
At Snowflake, we see a consistent pattern in AI needs and use cases from our enterprise customers. Enterprises want to use LLMs to build conversational SQL data copilots, code copilots and RAG chat bots. From a metrics perspective, this translates to LLMs that excel at SQL, code, complex instruction following and the ability to produce grounded answers. We capture these abilities into a single metric we call enterprise intelligence by taking an average of Coding (HumanEval+ and MBPP+), SQL Generation (Spider), and Instruction following (IFEval).
1229

1330
<p align="center">
@@ -28,38 +45,28 @@ The Snowflake AI Research Team is thrilled to introduce Snowflake Arctic, a top-
2845

2946
* Truly Open: Apache 2.0 license provides ungated access to weights and code. In addition, we are also open sourcing all of our data recipes and research insights.
3047

31-
## Getting Started
48+
### Getting Started
3249

33-
### Inference API Providers 🚀
50+
**Inference API Providers**
3451
Access Arctic via your model garden or catalog of choice including AWS, NVIDIA AI Catalog, Replicate, Lamini, Perplexity, and Together AI over the next coming days.
3552

36-
### Model Weights 🤗
53+
**Model Weights**
3754
The best way to get yourself running with Arctic is through Hugging Face. We have uploaded both the Base and Instruct model variants to the Hugging Face hub:
3855

3956
* [Snowflake/snowflake-arctic-base](https://huggingface.co/Snowflake/snowflake-arctic-base)
4057
* [Snowflake/snowflake-arctic-instruct](https://huggingface.co/Snowflake/snowflake-arctic-instruct)
4158

42-
### Inference
59+
**Inference**
4360

4461
We provide two different tutorials on standing up Arctic for inference:
4562

46-
* [Basic Hugging Face setup](inference/)
47-
* [vLLM Deployment](inference/vllm/)
63+
* [Basic Hugging Face setup](inference/arctic)
64+
* [vLLM Deployment](inference/arctic/vllm/)
4865

49-
## Cookbooks/Tutorials
66+
**Cookbooks/Tutorials**
5067

5168
We believe in a thriving research community, and we are committed to sharing our insights as we build the Arctic family of models, to advance research and reduce the cost of LLM training and inference for everyone. Please check out our [on-going cookbook releases](https://www.snowflake.com/en/data-cloud/arctic/cookbook/) where we will dive deeper into several areas crucial for training models like Arctic.
5269

5370
* [Exploring Mixture of Experts (MoE)](https://medium.com/snowflake/snowflake-arctic-cookbook-series-exploring-mixture-of-experts-moe-c7d6b8f14d16)
5471
* [Building an Efficient Training System for Arctic](https://medium.com/snowflake/snowflake-arctic-cookbook-series-building-an-efficient-training-system-for-arctic-6658b9bdfcae)
5572
* [Arctic’s Approach to Data](https://medium.com/snowflake/snowflake-arctic-cookbook-series-arctics-approach-to-data-b81a8a0958bd)
56-
* More coming soon..
57-
58-
## Coming Soon
59-
60-
Continue to watch this space we plan to frequently add new things here including:
61-
62-
* Fine-tuning tutorials
63-
* Further improvements to inference performance
64-
* HFQuantizer support for DeepSpeed's FP Quantization
65-
* Upstreaming Arctic support for both transformers and vLLM
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.

inference/llama3.1/README.md

Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,69 @@
1+
# Getting Started with vLLM + Llama 3.1 405B
2+
This tutorial covers how to use Llama 3.1 405B with vLLM and what performance you should expect when running it. We are actively
3+
working with the vLLM community to upstream Llama 3.1 405B support, but until then please use the repos detailed below.
4+
5+
Hardware assumptions of this tutorial. We are using two 8xH100 instances (i.e., [p5.48xlarge](https://aws.amazon.com/ec2/instance-types/p5/))
6+
for this tutorial but similar hardware should provide similar results.
7+
8+
## Detailed Installation and Benchmarking Instructions
9+
10+
For the steps going forward we highly recommend that use `hf_transfer` when downloading any of the Llama 3.1 405B checkpoints
11+
from Hugging Face to get the best throughput. On an AWS instance we are seeing the checkpoint will download in about 20-30 minutes. In vLLM
12+
this should be enabled by default if the package is installed (https://github.com/vllm-project/vllm/pull/3817).
13+
14+
## Step 1: Install Dependencies
15+
16+
We recommend setting up a virtual environment to get all of your dependencies isolated to avoid potential conflicts.
17+
18+
On each node:
19+
20+
```bash
21+
# we recommend setting up a virtual environment for this
22+
virtualenv llama3-venv
23+
source llama3-venv/bin/activate
24+
25+
# Faster ckpt download speed.
26+
pip install huggingface_hub[hf_transfer]
27+
28+
# Install vLLM from Snowflake-Labs. This may take several (5-10) minutes.
29+
pip install git+https://github.com/Snowflake-Labs/vllm.git@llama3-staging-rebase
30+
31+
# Install deepspeed from Snowflake-Labs.
32+
pip install git+https://github.com/Snowflake-Labs/DeepSpeed.git@add-fp8-gemm
33+
```
34+
35+
## Step 2: how to run online benchmarks (PP=2)
36+
37+
On node 1:
38+
39+
```bash
40+
ray start --head
41+
```
42+
43+
On node 2:
44+
45+
```bash
46+
ray start --address <node 1 ip>:6379
47+
```
48+
49+
On node 1:
50+
51+
```bash
52+
pip install dataclasses_json
53+
```
54+
55+
```bash
56+
python benchmark_trace.py \
57+
--backend vllm \
58+
--trace synth-1k.jsonl \
59+
--arrival-rate 1 \
60+
--model meta-llama/Meta-Llama-3.1-405B \
61+
--pipeline-parallel-size 2 \
62+
--tensor-parallel-size 8 \
63+
--enable-chunked-prefill \
64+
--max-num-seqs 64 \
65+
--max-num-batched-tokens 512 \
66+
--gpu-memory-utilization 0.95 \
67+
--use-allgather-pipeline-comm \
68+
--disable-log-requests
69+
```

0 commit comments

Comments
 (0)