Skip to content

Commit 8089223

Browse files
authored
go back to h3 (#1490)
1 parent 7fa019f commit 8089223

File tree

1 file changed

+12
-24
lines changed

1 file changed

+12
-24
lines changed

pgml-cms/blog/serverless-llms-are-dead-long-live-serverless-llms.md

Lines changed: 12 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -26,14 +26,12 @@ LLMs are large by definition. Llama 3’s mid-range 70B model requires ~140GB ju
2626

2727
GPU RAM is in very high demand, which has driven up costs and reduced availability. Most applications do not sustain on the order of 100 concurrent interactive chatbot sessions, or 1000 embedding requests per second to make dedicated GPUs cost-effective. Even if they do generate that workload, they need to deliver significant financial benefits to be cost-effective.
2828

29-
**Serverless is not the answer**
30-
29+
### Serverless is not the answer
3130
Serverless applications typically work because the application code required to execute requests is relatively small, and can be launched, cached and replicated relatively quickly. You can not load 140GB of model weights from disk into GPU RAM within the timespan of reasonable serverless request timeout. [Startups have tried, and failed](https://www.banana.dev/blog/sunset).
3231

3332
We tried this approach originally as well. Any model you used would be cached on your connection. After the first request warmed up the connection things were great, but that first request could time out – perpetually, never succeeding. Infinitely re-loading models for little if any actual usage is not a good use of scarce resources.
3433

35-
**Hosted service APIs are not the answer**
36-
34+
### Hosted service APIs are not the answer
3735
If you can’t load models on-demand, and individual users can’t afford to pay for the RAM to leave the models resident long term, the next best thing is to share the cost of the models RAM between many users. APIs like OpenAI and Fireworks.ai achieve cost-effective hosting, because large numbers of users are sharing the weights across their aggregate requests, so they only need to pay for their portion of the compute used, rather than the RAM. If you only use a model for a fraction of the GPU capacity (hundreds of concurrent chats or thousands of embeddings per second), you only need to pay for a fraction of the cost. This is great.
3836

3937
That problem is that APIs do not live in your datacenter. They are managed by some other company.
@@ -43,8 +41,7 @@ That problem is that APIs do not live in your datacenter. They are managed by so
4341
- You have no control over how far away their datacenter is, and they operate with generalized transports like HTTP and JSON, rather than more efficient protocols used for low latency high bandwidth applications. _AI applications are relatively high bandwidth_. This makes APIs relatively high latency, often by an order of magnitude or two.
4442
- Sending data over the open internet introduces additional reliability issues. Events relatively unrelated to you or even your provider will cause additional slowdowns and failures in your application.
4543

46-
**Dedicated hosting is not the answer (for most)**
47-
44+
### Dedicated hosting is not the answer (for most)
4845
You may avoid many of the pitfalls of traditional Serverless deployments or APIs, but you’re back to paying full price for GPU RAM, so you’ll need to be operating at scale, with a large team to support this option. There are some additional pitfalls to hosting LLMs that many teams will re-discover, but they can be overcome.
4946

5047
- LLMs need to be either baked into the container (hundred GB container images break most existing CI/CD pipelines), or they need to be downloaded on startup (downloading hundreds of gigabytes at app boot has its own issues). You will put your k8s configuration and docker knowledge through its paces getting GPU hardware, drivers and compilers aligned.
@@ -67,51 +64,42 @@ Because we’ve curated the best in class models, they will always be instantly
6764

6865
Your application can instantly burst usage to massive scale without a second thought, other than the aforementioned cost of GPU usage. Financial costs are now the limiting factor, but we have an additional new lever to optimize costs even further.
6966

70-
**Multi-tenant continuous batching**
71-
67+
### Multi-tenant continuous batching
7268
It’s not just loading the model weights into GPU RAM the first time that’s expensive. Streaming those weights from GPU RAM to the CUDA cores for each request is actually the bottleneck for most LLM applications. Continuous batching allows us to reuse a single layer of weights for multiple different queries at the same time, further reducing costs, without significantly impacting overall latency. Thanks to vLLM team for [this impressive breakthrough](https://arxiv.org/abs/2309.06180) in performance.
7369

74-
**Simplified pricing**
75-
70+
### Simplified pricing
7671
Compared to using a host of services to provide comparable functionality, our pricing is significantly simpler. We charge for:
7772

7873
Storage: $0.25 per gigabyte per month. Including text, vector, JSON, binary and relational data formats as well as all index types.
7974
Compute: $7.50 per hour for requests. Including LLM, embeddings, NLP & ML models, analytical, relational and vector ANN queries. Query time is measured per request, to the nanosecond.
8075

8176
No fixed costs. We’ll even give you $100 free credit to test this functionality with your own data. Check out our [pricing](/pricing) to estimate your own workload and compare to alternative architectures.
8277

83-
**Custom & fine-tuned models**
84-
78+
### Custom & fine-tuned models
8579
There is a myriad number of specialized models available for use with PostgresML. We strive for compatibility with anything you can download from Hugging Face. You can also fine tune models using PostgresML, or upload your own variants with a private Hugging Face access key. These models are not shared, so they are billed based on the cost of the required GPU RAM to serve them, for as long as they are loaded for your engine.
8680

8781
This also gives you the option to avoid being forced into an undesirable update cadence. We take breaking changes seriously, including new model versions that have their own unpredictable behaviors, but also want to simplify long term management and the upgrade path when new model versions are inevitably released.
8882

89-
**Support is included**
90-
83+
### Support is included
9184
We’re here to help you optimize your workloads to get the most out of this architecture. In addition to support, we’ve built [an SDK](/docs/api/client-sdk/) that encapsulates core use cases like RAG that make it easy to get started building your own chat experience, with combined, LLM, embedding, ANN and keyword search all in one place. This is just the beginning.
9285

93-
**It’s easier than ever to get started**
94-
86+
### It’s easier than ever to get started
9587
You can create and scale your AI engine in minutes. You no longer need to do any initial capacity planning, because you’ll have burst access to multiple GPUs whenever you need. We’ll autoscale both compute and storage as you use it. Just give it a name, and we’ll give you a connection string to get started building your AI application.
9688

9789
<figure><img src=".gitbook/assets/create_new_engine.png" alt=""><figcaption></figcaption></figure>
9890

99-
**Instant autoscaling**
100-
91+
### Instant autoscaling
10192
You’ll experience instant and near limitless scale, automatically. Our serverless plan dynamically adjusts to your application's needs, ensuring it can handle peak loads without the need for over provisioning. Whether you’re handling a sudden spike in traffic or scaling down during off-peak hours, we’ll adapt in real-time.
10293

103-
**Significant cost savings**
104-
94+
### Significant cost savings
10595
<figure><img src=".gitbook/assets/price_vs.png" alt=""><figcaption>Try out our <a href="/pricing">cost calculator</a> to learn more about how we help you save</figcaption></figure>
10696

10797
Our new pricing is designed to minimize costs, you’ll save 42% on vector database costs alone if you’re using Pinecone. Additionally, you’ll only pay for what you use, with no up-front costs.
10898

109-
**Unmatched performance**
110-
99+
### Unmatched performance
111100
Our serverless engines are not just about convenience; it's about performance too. When it comes to retrieval-augmented generation (RAG) chatbots, PostgresML is **4x faster than HuggingFace and Pinecone**. For embedding generation, we are **10x faster than OpenAI**. This means you can deliver faster, more responsive applications to your users.
112101

113-
**Dedicated instances available in every major cloud**
114-
102+
### Dedicated instances available in every major cloud
115103
In addition to pay as you go serverless usage, PostgresML also offers managed databases inside your Virtual Private Cloud in AWS, Azure and GCP. Enterprise customers operating at scale can have complete control and guaranteed data privacy. You’ll retain ultimate control of network security policies and hardware resources allocated. You can configure a private engine with as much scale and any models you need through our admin console, while using your own negotiated pricing agreements with the hosting cloud vendor.
116104

117105
## Get started with the AI infrastructure of the future today

0 commit comments

Comments
 (0)