Skip to content

Commit 59989ab

Browse files
authored
Chat blog preview fixes (#929)
1 parent 031e738 commit 59989ab

File tree

1 file changed

+19
-21
lines changed

1 file changed

+19
-21
lines changed

pgml-dashboard/content/blog/pgml-chat-a-command-line-tool-for-deploying-low-latency-knowledge-based-chatbots-part-I.md

Lines changed: 19 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ image_alt: "pgml-chat: A command-line tool for deploying low-latency knowledge-b
1515
</div>
1616

1717

18-
# Introduction
18+
## Introduction
1919
Chatbots powered by large language models like GPT-4 seem amazingly smart at first. They can have conversations on almost any topic. But chatbots have a huge blindspot - no long-term memory. Ask them about current events from last week or topics related to your specific business, and they just draw a blank.
2020

2121
To be truly useful for real applications, chatbots need fast access to knowledge - almost like human memory. Without quick recall, conversations become frustratingly slow and limited. It's like chatting with someone suffering from short-term memory loss.
@@ -34,10 +34,10 @@ We need a better foundational solution tailored specifically for chatbots - one
3434

3535
In this blog series, we will explore PostgresML to do just that. In the first part, we will talk about deploying a chatbot using `pgml-chat` command line tool built on top of PostgresML. We will compare PostgresML query performance with a combination of Hugging Face and Pinecone. In the second part, we will show how `pgml-chat` works under the hood and focus on achieving low-latencies.
3636

37-
# Steps to build a chatbot on your own data
37+
## Steps to build a chatbot on your own data
3838
Similar to building and deploying machine learning models, building a chatbot involves steps that are both offline and online. The offline steps are compute-intensive and need to be done periodically when the data changes or the chatbot performance has deteriorated. The online steps are fast and need to be done in real-time. Below, we describe the steps in detail.
3939

40-
## 1. Building the Knowledge Base
40+
### 1. Building the Knowledge Base
4141

4242
This offline setup lays the foundation for your chatbot's intelligence. It involves:
4343

@@ -48,31 +48,31 @@ This offline setup lays the foundation for your chatbot's intelligence. It invol
4848

4949
This knowledge base setup powers the contextual understanding for your chatbot. It's compute-intensive but only needs to be peridocially updated as your domain knowledge evolves.
5050

51-
## 2. Connecting to Conversational AI
51+
### 2. Connecting to Conversational AI
5252

5353
With its knowledge base in place, now the chatbot links to models that allow natural conversations:
5454

5555
1. Based on users' questions, querying the indexed chunks to rapidly pull the most relevant passages.
5656
2. Passing those passages to a model like GPT-3 to generate conversational responses.
5757
3. Orchestrating the query, retrieval and generation flow to enable real-time chat.
5858

59-
## 3. Evaluating and Fine-tuning the chatbot
59+
### 3. Evaluating and Fine-tuning the chatbot
6060

6161
The chatbot needs to be evaluated and fine-tuned before it can be deployed to the real world. This involves:
6262

6363
1. Experimenting with different prompts and selecting the one that generates the best responses for a suite of questions.
6464
2. Evaluating the chatbot's performance on a test set of questions by comparing the chatbot's responses to the ground truth responses.
6565
3. If the performance is not satisfactory then we need to go to step 1 and generate embeddings using a different model. This is because the embeddings are the foundation of the chatbot's intelligence to get the most relevant passage from the knowledge base.
6666

67-
## 4. Connecting to the Real World
67+
### 4. Connecting to the Real World
6868

6969
Finally, the chatbot needs to be deployed to the real world. This involves:
7070

7171
1. Identifying the interface that the users will interact with. This can be Slack, Discord, Teams or your own custom chat platform. Once identified get the API keys for the interface.
7272
2. Hosting a chatbot service that can serve multiple users.
7373
3. Integrating the chatbot service with the interface so that it can receive and respond to messages.
7474

75-
# pgml-chat
75+
## pgml-chat
7676
`pgml-chat` is a command line tool that allows you to do the following:
7777
- Build a knowledge base that involves:
7878
- Ingesting documents into the database
@@ -84,7 +84,7 @@ Finally, the chatbot needs to be deployed to the real world. This involves:
8484
- Provides a chat interface at command line to evaluate your setup
8585
- Runs Slack or Discord chat services so that your users can interact with your chatbot.
8686

87-
## Getting Started
87+
### Getting Started
8888

8989
Before you begin, make sure you have the following:
9090

@@ -136,7 +136,7 @@ DISCORD_BOT_TOKEN=<DISCORD_BOT_TOKEN> # Discord bot token to run Discord chat se
136136

137137
!!!
138138

139-
## Usage
139+
### Usage
140140
You can get help on the command line interface by running:
141141

142142

@@ -161,7 +161,7 @@ optional arguments:
161161

162162
!!!
163163

164-
## 1. Building the Knowledge Base
164+
### 1. Building the Knowledge Base
165165
In this step, we ingest documents, chunk documents, generate embeddings and index these embeddings for fast query.
166166

167167

@@ -207,7 +207,7 @@ In the current version, we only support markdown files. We will be adding suppor
207207

208208
**LOG_LEVEL** will set the log level for the application. The default is `ERROR`. You can set it to `DEBUG` to see more detailed logs.
209209

210-
## 2. Connecting to Conversational AI
210+
### 2. Connecting to Conversational AI
211211
Here we will show how to experiment with prompts for the chat completion model to generate responses. We will use OpenAI `gpt-3.5-turbo` for chat completion. You need an [OpenAI API key](https://platform.openai.com/account/api-keys) to run this step.
212212

213213
You can provide the bot with a name and style of response using `SYSTEM_PROMPT` and `BASE_PROMPT` environment variables. The bot will then generate a response based on the user's question, context from vector search and the prompt. For the bot we built for PostgresML, we used the following system prompt. You can change the name of the bot, location and the name of the topics it will answer questions about.
@@ -236,11 +236,10 @@ BASE_PROMPT="Given relevant parts of a document and a question, create a final a
236236

237237
!!!
238238

239-
## 3. Evaluating and Fine-tuning chatbot
239+
### 3. Evaluating and Fine-tuning chatbot
240240
Here we will show how to evaluate the chatbot's performance using the `cli` chat interface. This step will help you experiment with different prompts without spinning up a chat service. You can increase the log level to ERROR to suppress the logs from pgml-chat and OpenAI chat completion service.
241241

242242

243-
244243
!!! code_block
245244

246245
```bash
@@ -273,10 +272,10 @@ If the responses are not acceptible, then increase the LOG_LEVEL to check for th
273272
274273
You can change the embeddings model using the environment variable `MODEL` in `.env` file. Some models like `hknulp/instructor-xl` also take an instruction to generate embeddings. You can change the instruction using the environment variable `MODEL_PARAMS`. You can also change the instruction for query embeddings using the environment variable `QUERY_PARAMS`.
275274
276-
## 4. Connecting to the Real World
275+
### 4. Connecting to the Real World
277276
Once you are comfortable with the chatbot's performance it is ready for connecting to the real world. Here we will show how to run the chatbot as a Slack or Discord service. You need to create a Slack or Discord app and get the bot token and app token to run the chat service. Under the hood we use [`slack-bolt`](https://slack.dev/bolt-python/concepts) and [`discord.py`](https://discordpy.readthedocs.io/en/stable/) libraries to run the chat services.
278277
279-
### Slack
278+
#### Slack
280279
281280
You need SLACK_BOT_TOKEN and SLACK_APP_TOKEN to run the chatbot on Slack. You can get these tokens by creating a Slack app. Follow the instructions [here](https://slack.dev/bolt-python/tutorial/getting-started) to create a Slack app.Include the following environment variables in your .env file:
282281
@@ -305,9 +304,8 @@ Once the slack app is running, you can interact with the chatbot on Slack as sho
305304
![Slack Chatbot](/dashboard/static/images/blog/slack_screenshot.png)
306305
307306
308-
### Discord
307+
#### Discord
309308
310-
**Setup**
311309
You need DISCORD_BOT_TOKEN to run the chatbot on Discord. You can get this token by creating a Discord app. Follow the instructions [here](https://discordpy.readthedocs.io/en/stable/discord.html) to create a Discord app. Include the following environment variables in your .env file:
312310
313311
```bash
@@ -327,7 +325,7 @@ Once the discord app is running, you can interact with the chatbot on Discord as
327325
328326
![Discord Chatbot](/dashboard/static/images/blog/discord_screenshot.png)
329327
330-
## PostgresML vs. Hugging Face + Pinecone
328+
### PostgresML vs. Hugging Face + Pinecone
331329
To evaluate query latency, we performed an experiment with 10,000 Wikipedia documents from the SQuAD dataset. Embeddings were generated using the intfloat/e5-large model.
332330
333331
For PostgresML, we used a GPU-powered serverless database running on NVIDIA A10G GPUs with client in us-west-2 region. For HuggingFace, we used their inference API endpoint running on NVIDIA A10G GPUs in us-east-1 region and a client in the same us-east-1 region. Pinecone was used as the vector search index for HuggingFace embeddings.
@@ -336,13 +334,13 @@ By keeping the document dataset, model, and hardware constant, we aimed to evalu
336334
337335
![pgml_vs_hf_pinecone_query](/dashboard/static/images/blog/pgml_vs_hf_pinecone_query.jpg)
338336
339-
Our experiments found that PostgresML outperformed HuggingFace + Pinecone in query latency by 6x. Mean latency was 59ms for PostgresML and 233ms for HuggingFace + Pinecone. Query latency was averaged across 100 queries to account for any outliers. This ~4x improvement in mean latency can be attributed to PostgresML's tight integration of embedding generation, indexing, and querying within the database running on NVIDIA A10G GPUs.
337+
Our experiments found that PostgresML outperformed HuggingFace + Pinecone in query latency by ~4x. Mean latency was 59ms for PostgresML and 233ms for HuggingFace + Pinecone. Query latency was averaged across 100 queries to account for any outliers. This ~4x improvement in mean latency can be attributed to PostgresML's tight integration of embedding generation, indexing, and querying within the database running on NVIDIA A10G GPUs.
340338
341339
For applications like chatbots that require low latency access to knowledge, PostgresML provides superior performance over combining multiple services. The serverless architecture also provides predictable pricing and scales seamlessly with usage.
342340
343-
# Conclusions
341+
## Conclusions
344342
In this post, we announced PostgresML Chatbot Builder - an open source tool that makes it easy to build knowledge based chatbots. We discussed the effort required to integrate various components like ingestion, embedding generation, indexing etc. and how PostgresML Chatbot Builder automates this end-to-end workflow.
345343
346-
We also presented some initial benchmark results comparing PostgresML and HuggingFace + Pinecone for query latency using the SQuAD dataset. PostgresML provided up to 6x lower latency thanks to its tight integration and optimizations.
344+
We also presented some initial benchmark results comparing PostgresML and HuggingFace + Pinecone for query latency using the SQuAD dataset. PostgresML provided up to ~4x lower latency thanks to its tight integration and optimizations.
347345
348346
Stay tuned for part 2 of this benchmarking blog post where we will present more comprehensive results evaluating performance for generating embeddings with different models and batch sizes. We will also share additional query latency benchmarks with more document collections.

0 commit comments

Comments
 (0)