You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: pgml-dashboard/content/blog/pgml-chat-a-command-line-tool-for-deploying-low-latency-knowledge-based-chatbots-part-I.md
+19-21Lines changed: 19 additions & 21 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -15,7 +15,7 @@ image_alt: "pgml-chat: A command-line tool for deploying low-latency knowledge-b
15
15
</div>
16
16
17
17
18
-
# Introduction
18
+
##Introduction
19
19
Chatbots powered by large language models like GPT-4 seem amazingly smart at first. They can have conversations on almost any topic. But chatbots have a huge blindspot - no long-term memory. Ask them about current events from last week or topics related to your specific business, and they just draw a blank.
20
20
21
21
To be truly useful for real applications, chatbots need fast access to knowledge - almost like human memory. Without quick recall, conversations become frustratingly slow and limited. It's like chatting with someone suffering from short-term memory loss.
@@ -34,10 +34,10 @@ We need a better foundational solution tailored specifically for chatbots - one
34
34
35
35
In this blog series, we will explore PostgresML to do just that. In the first part, we will talk about deploying a chatbot using `pgml-chat` command line tool built on top of PostgresML. We will compare PostgresML query performance with a combination of Hugging Face and Pinecone. In the second part, we will show how `pgml-chat` works under the hood and focus on achieving low-latencies.
36
36
37
-
# Steps to build a chatbot on your own data
37
+
##Steps to build a chatbot on your own data
38
38
Similar to building and deploying machine learning models, building a chatbot involves steps that are both offline and online. The offline steps are compute-intensive and need to be done periodically when the data changes or the chatbot performance has deteriorated. The online steps are fast and need to be done in real-time. Below, we describe the steps in detail.
39
39
40
-
## 1. Building the Knowledge Base
40
+
###1. Building the Knowledge Base
41
41
42
42
This offline setup lays the foundation for your chatbot's intelligence. It involves:
43
43
@@ -48,31 +48,31 @@ This offline setup lays the foundation for your chatbot's intelligence. It invol
48
48
49
49
This knowledge base setup powers the contextual understanding for your chatbot. It's compute-intensive but only needs to be peridocially updated as your domain knowledge evolves.
50
50
51
-
## 2. Connecting to Conversational AI
51
+
###2. Connecting to Conversational AI
52
52
53
53
With its knowledge base in place, now the chatbot links to models that allow natural conversations:
54
54
55
55
1. Based on users' questions, querying the indexed chunks to rapidly pull the most relevant passages.
56
56
2. Passing those passages to a model like GPT-3 to generate conversational responses.
57
57
3. Orchestrating the query, retrieval and generation flow to enable real-time chat.
58
58
59
-
## 3. Evaluating and Fine-tuning the chatbot
59
+
###3. Evaluating and Fine-tuning the chatbot
60
60
61
61
The chatbot needs to be evaluated and fine-tuned before it can be deployed to the real world. This involves:
62
62
63
63
1. Experimenting with different prompts and selecting the one that generates the best responses for a suite of questions.
64
64
2. Evaluating the chatbot's performance on a test set of questions by comparing the chatbot's responses to the ground truth responses.
65
65
3. If the performance is not satisfactory then we need to go to step 1 and generate embeddings using a different model. This is because the embeddings are the foundation of the chatbot's intelligence to get the most relevant passage from the knowledge base.
66
66
67
-
## 4. Connecting to the Real World
67
+
###4. Connecting to the Real World
68
68
69
69
Finally, the chatbot needs to be deployed to the real world. This involves:
70
70
71
71
1. Identifying the interface that the users will interact with. This can be Slack, Discord, Teams or your own custom chat platform. Once identified get the API keys for the interface.
72
72
2. Hosting a chatbot service that can serve multiple users.
73
73
3. Integrating the chatbot service with the interface so that it can receive and respond to messages.
74
74
75
-
# pgml-chat
75
+
##pgml-chat
76
76
`pgml-chat` is a command line tool that allows you to do the following:
77
77
- Build a knowledge base that involves:
78
78
- Ingesting documents into the database
@@ -84,7 +84,7 @@ Finally, the chatbot needs to be deployed to the real world. This involves:
84
84
- Provides a chat interface at command line to evaluate your setup
85
85
- Runs Slack or Discord chat services so that your users can interact with your chatbot.
86
86
87
-
## Getting Started
87
+
###Getting Started
88
88
89
89
Before you begin, make sure you have the following:
90
90
@@ -136,7 +136,7 @@ DISCORD_BOT_TOKEN=<DISCORD_BOT_TOKEN> # Discord bot token to run Discord chat se
136
136
137
137
!!!
138
138
139
-
## Usage
139
+
###Usage
140
140
You can get help on the command line interface by running:
141
141
142
142
@@ -161,7 +161,7 @@ optional arguments:
161
161
162
162
!!!
163
163
164
-
## 1. Building the Knowledge Base
164
+
###1. Building the Knowledge Base
165
165
In this step, we ingest documents, chunk documents, generate embeddings and index these embeddings for fast query.
166
166
167
167
@@ -207,7 +207,7 @@ In the current version, we only support markdown files. We will be adding suppor
207
207
208
208
**LOG_LEVEL** will set the log level for the application. The default is `ERROR`. You can set it to `DEBUG` to see more detailed logs.
209
209
210
-
## 2. Connecting to Conversational AI
210
+
###2. Connecting to Conversational AI
211
211
Here we will show how to experiment with prompts for the chat completion model to generate responses. We will use OpenAI `gpt-3.5-turbo` for chat completion. You need an [OpenAI API key](https://platform.openai.com/account/api-keys) to run this step.
212
212
213
213
You can provide the bot with a name and style of response using `SYSTEM_PROMPT` and `BASE_PROMPT` environment variables. The bot will then generate a response based on the user's question, context from vector search and the prompt. For the bot we built for PostgresML, we used the following system prompt. You can change the name of the bot, location and the name of the topics it will answer questions about.
@@ -236,11 +236,10 @@ BASE_PROMPT="Given relevant parts of a document and a question, create a final a
236
236
237
237
!!!
238
238
239
-
## 3. Evaluating and Fine-tuning chatbot
239
+
###3. Evaluating and Fine-tuning chatbot
240
240
Here we will show how to evaluate the chatbot's performance using the `cli` chat interface. This step will help you experiment with different prompts without spinning up a chat service. You can increase the log level to ERROR to suppress the logs from pgml-chat and OpenAI chat completion service.
241
241
242
242
243
-
244
243
!!! code_block
245
244
246
245
```bash
@@ -273,10 +272,10 @@ If the responses are not acceptible, then increase the LOG_LEVEL to check for th
273
272
274
273
You can change the embeddings model using the environment variable `MODEL` in `.env` file. Some models like `hknulp/instructor-xl` also take an instruction to generate embeddings. You can change the instruction using the environment variable `MODEL_PARAMS`. You can also change the instruction for query embeddings using the environment variable `QUERY_PARAMS`.
275
274
276
-
## 4. Connecting to the Real World
275
+
### 4. Connecting to the Real World
277
276
Once you are comfortable with the chatbot's performance it is ready for connecting to the real world. Here we will show how to run the chatbot as a Slack or Discord service. You need to create a Slack or Discord app and get the bot token and app token to run the chat service. Under the hood we use [`slack-bolt`](https://slack.dev/bolt-python/concepts) and [`discord.py`](https://discordpy.readthedocs.io/en/stable/) libraries to run the chat services.
278
277
279
-
### Slack
278
+
#### Slack
280
279
281
280
You need SLACK_BOT_TOKEN and SLACK_APP_TOKEN to run the chatbot on Slack. You can get these tokens by creating a Slack app. Follow the instructions [here](https://slack.dev/bolt-python/tutorial/getting-started) to create a Slack app.Include the following environment variables in your .env file:
282
281
@@ -305,9 +304,8 @@ Once the slack app is running, you can interact with the chatbot on Slack as sho
You need DISCORD_BOT_TOKEN to run the chatbot on Discord. You can get this token by creating a Discord app. Follow the instructions [here](https://discordpy.readthedocs.io/en/stable/discord.html) to create a Discord app. Include the following environment variables in your .env file:
312
310
313
311
```bash
@@ -327,7 +325,7 @@ Once the discord app is running, you can interact with the chatbot on Discord as
To evaluate query latency, we performed an experiment with 10,000 Wikipedia documents from the SQuAD dataset. Embeddings were generated using the intfloat/e5-large model.
332
330
333
331
For PostgresML, we used a GPU-powered serverless database running on NVIDIA A10G GPUs with client in us-west-2 region. For HuggingFace, we used their inference API endpoint running on NVIDIA A10G GPUs in us-east-1 region and a client in the same us-east-1 region. Pinecone was used as the vector search index for HuggingFace embeddings.
@@ -336,13 +334,13 @@ By keeping the document dataset, model, and hardware constant, we aimed to evalu
Our experiments found that PostgresML outperformed HuggingFace + Pinecone in query latency by 6x. Mean latency was 59ms forPostgresML and 233ms for HuggingFace + Pinecone. Query latency was averaged across 100 queries to account for any outliers. This ~4x improvementin mean latency can be attributed to PostgresML's tight integration of embedding generation, indexing, and querying within the database running on NVIDIA A10G GPUs.
337
+
Our experiments found that PostgresML outperformed HuggingFace + Pinecone in query latency by ~4x. Mean latency was 59ms forPostgresML and 233ms for HuggingFace + Pinecone. Query latency was averaged across 100 queries to account for any outliers. This ~4x improvementin mean latency can be attributed to PostgresML's tight integration of embedding generation, indexing, and querying within the database running on NVIDIA A10G GPUs.
340
338
341
339
For applications like chatbots that require low latency access to knowledge, PostgresML provides superior performance over combining multiple services. The serverless architecture also provides predictable pricing and scales seamlessly with usage.
342
340
343
-
# Conclusions
341
+
## Conclusions
344
342
In this post, we announced PostgresML Chatbot Builder - an open source tool that makes it easy to build knowledge based chatbots. We discussed the effort required to integrate various components like ingestion, embedding generation, indexing etc. and how PostgresML Chatbot Builder automates this end-to-end workflow.
345
343
346
-
We also presented some initial benchmark results comparing PostgresML and HuggingFace + Pinecone for query latency using the SQuAD dataset. PostgresML provided up to 6x lower latency thanks to its tight integration and optimizations.
344
+
We also presented some initial benchmark results comparing PostgresML and HuggingFace + Pinecone for query latency using the SQuAD dataset. PostgresML provided up to ~4x lower latency thanks to its tight integration and optimizations.
347
345
348
346
Stay tuned for part 2 of this benchmarking blog post where we will present more comprehensive results evaluating performance for generating embeddings with different models and batch sizes. We will also share additional query latency benchmarks with more document collections.
0 commit comments