devilcoder
diff --git a/‎README.md
Lines changed: 25 additions & 18 deletions b/‎README.md
Lines changed: 25 additions & 18 deletions
diff --git a/‎inference/README.md renamed to ‎inference/arctic/README.md b/‎inference/README.md renamed to ‎inference/arctic/README.md
diff --git a/‎inference/requirements.txt renamed to ‎inference/arctic/requirements.txt b/‎inference/requirements.txt renamed to ‎inference/arctic/requirements.txt
diff --git a/‎inference/vllm/Dockerfile renamed to ‎inference/arctic/vllm/Dockerfile b/‎inference/vllm/Dockerfile renamed to ‎inference/arctic/vllm/Dockerfile
diff --git a/‎inference/vllm/README.md renamed to ‎inference/arctic/vllm/README.md b/‎inference/vllm/README.md renamed to ‎inference/arctic/vllm/README.md
diff --git a/‎inference/vllm/benchmarks/benchmark_batch.py renamed to ‎inference/arctic/vllm/benchmarks/benchmark_batch.py b/‎inference/vllm/benchmarks/benchmark_batch.py renamed to ‎inference/arctic/vllm/benchmarks/benchmark_batch.py
diff --git a/‎inference/vllm/benchmarks/benchmark_online.py renamed to ‎inference/arctic/vllm/benchmarks/benchmark_online.py b/‎inference/vllm/benchmarks/benchmark_online.py renamed to ‎inference/arctic/vllm/benchmarks/benchmark_online.py
diff --git a/‎inference/vllm/offline_inference_arctic.py renamed to ‎inference/arctic/vllm/offline_inference_arctic.py b/‎inference/vllm/offline_inference_arctic.py renamed to ‎inference/arctic/vllm/offline_inference_arctic.py
diff --git a/‎inference/llama3.1/README.md
Lines changed: 69 additions & 0 deletions b/‎inference/llama3.1/README.md
Lines changed: 69 additions & 0 deletions
@@ -1,13 +1,30 @@
 [![License Apache 2.0](https://badgen.net/badge/license/apache2.0/blue)](https://github.com/Snowflake-Labs/snowflake-arctic/blob/master/LICENSE)
 [![Twitter](https://img.shields.io/twitter/follow/snowflakedb)](https://twitter.com/intent/follow?screen_name=snowflakedb)
 
-# ❄️ Snowflake Arctic ❄️
+# ❄️ Snowflake AI Research ❄️
 
 ## Latest News
+* [07/23/2024] [Snowflake Teams Up with Meta to Host and Optimize New Flagship Model Family in Snowflake Cortex AI
+](https://www.snowflake.com/blog/meta-llama-enterprise-apps-snowflake-cortex-ai/)
+    * [Achieve Low-Latency and High-Throughput Inference with Meta's Llama 3.1 405B using Snowflake’s Optimized AI Stack](https://www.snowflake.com/engineering-blog/optimize-LLMs-with-llama-snowflake-ai-stack/)
+    * [Fine-Tune Llama 3.1 405B on a Single Node using Snowflake’s Memory-Optimized AI Stack](https://www.snowflake.com/engineering-blog/fine-tune-llama-single-node-snowflake/)
 * [04/24/2024] [Snowflake Arctic: The Best LLM for Enterprise AI — Efficiently Intelligent, Truly Open](https://www.snowflake.com/blog/arctic-open-efficient-foundation-language-models-snowflake/)
 
 ## Overview
 
+The Snowflake AI Research team is conducting open, foundational research to advance the field of AI while making enterprise AI easy, efficient, and trusted. This repo contains several artifacts to help efficiently train and inference popular LLMs in practice. We released [Arctic](https://www.snowflake.com/blog/arctic-open-efficient-foundation-language-models-snowflake/) in April of 2023 and are proud to announce the release of our Massive LLM inference and fine-tuning stacks specifically tailored to Llama 3.1 405B.
+
+## Llama 3.1 405B
+
+In collaboration with DeepSpeed, Hugging Face, vLLM, and the broader AI community we are excited to open-source our inference and fine-tuning stacks optimized for Llama 3.1 405B. For inference we support a massive 128K context window from day one, while enabling real-time inference with up to 3x lower end-to-end latency and 1.4x higher throughput than existing open source solutions. Please see our blog, [Achieve Low-Latency and High-Throughput Inference with Meta's Llama 3.1 405B using Snowflake’s Optimized AI Stack](https://www.snowflake.com/engineering-blog/optimize-LLMs-with-llama-snowflake-ai-stack/), that deep dive into all of these innovations. For fine-tuning we support training on a single and multi-node training environments using the latest in memory efficient training techniques such as parameter-efficient fine-tuning, FP8 quantization, ZeRO-3-inspired sharding, and targeted parameter offloading (when necessary). Please see our blog, [Fine-Tune Llama 3.1 405B on a Single Node using Snowflake’s Memory-Optimized AI Stack](https://www.snowflake.com/engineering-blog/fine-tune-llama-single-node-snowflake/), for a deep dive into how we did this.
+
+### Getting started
+
+* [Inference deployment and benchmarks with vLLM](inference/llama3.1)
+* [Fine-Tuning Support for Llama 3.1 405B](training/llama3.1)
+
+## Arctic
+
 At Snowflake, we see a consistent pattern in AI needs and use cases from our enterprise customers. Enterprises want to use LLMs to build conversational SQL data copilots, code copilots and RAG chat bots. From a metrics perspective, this translates to LLMs that excel at SQL, code, complex instruction following and the ability to produce grounded answers. We capture these abilities into a single metric we call enterprise intelligence by taking an average of Coding (HumanEval+ and MBPP+), SQL Generation (Spider), and Instruction following (IFEval).
 
 <p align="center">
@@ -28,38 +45,28 @@ The Snowflake AI Research Team is thrilled to introduce Snowflake Arctic, a top-
 
 * Truly Open: Apache 2.0 license provides ungated access to weights and code. In addition, we are also open sourcing all of our data recipes and research insights.
 
-## Getting Started
+### Getting Started
 
-### Inference API Providers 🚀
+**Inference API Providers**
 Access Arctic via your model garden or catalog of choice including AWS, NVIDIA AI Catalog, Replicate, Lamini, Perplexity, and Together AI over the next coming days.
 
-### Model Weights 🤗
+**Model Weights**
 The best way to get yourself running with Arctic is through Hugging Face. We have uploaded both the Base and Instruct model variants to the Hugging Face hub:
 
 * [Snowflake/snowflake-arctic-base](https://huggingface.co/Snowflake/snowflake-arctic-base)
 * [Snowflake/snowflake-arctic-instruct](https://huggingface.co/Snowflake/snowflake-arctic-instruct)
 
-### Inference
+**Inference**
 
 We provide two different tutorials on standing up Arctic for inference:
 
-* [Basic Hugging Face setup](inference/)
-* [vLLM Deployment](inference/vllm/)
+* [Basic Hugging Face setup](inference/arctic)
+* [vLLM Deployment](inference/arctic/vllm/)
 
-## Cookbooks/Tutorials
+**Cookbooks/Tutorials**
 
 We believe in a thriving research community, and we are committed to sharing our insights as we build the Arctic family of models, to advance research and reduce the cost of LLM training and inference for everyone. Please check out our [on-going cookbook releases](https://www.snowflake.com/en/data-cloud/arctic/cookbook/) where we will dive deeper into several areas crucial for training models like Arctic.
 
 * [Exploring Mixture of Experts (MoE)](https://medium.com/snowflake/snowflake-arctic-cookbook-series-exploring-mixture-of-experts-moe-c7d6b8f14d16)
 * [Building an Efficient Training System for Arctic](https://medium.com/snowflake/snowflake-arctic-cookbook-series-building-an-efficient-training-system-for-arctic-6658b9bdfcae)
 * [Arctic’s Approach to Data](https://medium.com/snowflake/snowflake-arctic-cookbook-series-arctics-approach-to-data-b81a8a0958bd)
-* More coming soon..
-
-## Coming Soon
-
-Continue to watch this space we plan to frequently add new things here including:
-
-* Fine-tuning tutorials
-* Further improvements to inference performance
-* HFQuantizer support for DeepSpeed's FP Quantization
-* Upstreaming Arctic support for both transformers and vLLM
 
@@ -0,0 +1,69 @@
+# Getting Started with vLLM + Llama 3.1 405B
+This tutorial covers how to use Llama 3.1 405B with vLLM and what performance you should expect when running it. We are actively 
+working with the vLLM community to upstream Llama 3.1 405B support, but until then please use the repos detailed below.
+
+Hardware assumptions of this tutorial. We are using two 8xH100 instances (i.e., [p5.48xlarge](https://aws.amazon.com/ec2/instance-types/p5/)) 
+for this tutorial but similar hardware should provide similar results.
+
+## Detailed Installation and Benchmarking Instructions
+
+For the steps going forward we highly recommend that use `hf_transfer` when downloading any of the Llama 3.1 405B checkpoints 
+from Hugging Face to get the best throughput. On an AWS instance we are seeing the checkpoint will download in about 20-30 minutes. In vLLM 
+this should be enabled by default if the package is installed (https://github.com/vllm-project/vllm/pull/3817).
+
+## Step 1: Install Dependencies
+
+We recommend setting up a virtual environment to get all of your dependencies isolated to avoid potential conflicts.
+
+On each node:
+
+```bash
+# we recommend setting up a virtual environment for this
+virtualenv llama3-venv
+source llama3-venv/bin/activate
+
+# Faster ckpt download speed.
+pip install huggingface_hub[hf_transfer]
+
+# Install vLLM from Snowflake-Labs. This may take several (5-10) minutes.
+pip install git+https://github.com/Snowflake-Labs/vllm.git@llama3-staging-rebase
+
+# Install deepspeed from Snowflake-Labs.
+pip install git+https://github.com/Snowflake-Labs/DeepSpeed.git@add-fp8-gemm
+```
+
+## Step 2: how to run online benchmarks (PP=2)
+
+On node 1:
+
+```bash
+ray start --head
+```
+
+On node 2:
+
+```bash
+ray start --address <node 1 ip>:6379
+```
+
+On node 1:
+
+```bash
+pip install dataclasses_json
+```
+
+```bash
+python benchmark_trace.py \
+    --backend vllm \
+    --trace synth-1k.jsonl \
+    --arrival-rate 1 \
+    --model meta-llama/Meta-Llama-3.1-405B \
+    --pipeline-parallel-size 2 \
+    --tensor-parallel-size 8 \
+    --enable-chunked-prefill \
+    --max-num-seqs 64 \
+    --max-num-batched-tokens 512 \
+    --gpu-memory-utilization 0.95 \
+    --use-allgather-pipeline-comm \
+    --disable-log-requests
+```