-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Text2sql tool - e2e evals and fine-tuning #967
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
``` | ||
|
||
|
||
### Creating a reasoning dataset from the TRAIN dataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Jump from previous section to creating reasoning dataset is bit unclear to me. Why are we doing this step? What is the value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good Q! I edited the README to make it clearer. Thanks @varunfb for all the great feedback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, i've done fine-tuning with the reasoning dataset and am running the eval script (each inference takes longer). Will update the README (and remove the 1st in Next Steps if needed) after the eval is done. Plan to merge the PR next week.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we have done the FT with adding reasoning to the sql dataset, we should keep it but we need to better frame and provide more context why and how we have done it. I am guessing synthetic data was part of it. @jeffxtang
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @init27 please review the narrative.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we have done the FT with adding reasoning to the sql dataset, we should keep it but we need to better frame and provide more context why and how we have done it. I am guessing synthetic data was part of it. @jeffxtang
The why and how are stated here: "The goal is to see if we can improve the accuracy of the fine-tuned model by adding the reasoning info in the dataset." followed by Creating a reasoning dataset from the TRAIN dataset. The dataset was generated synthetically and I thought about using the synthetic-data-kit but felt like it's an overkill. But porting this script to using the data kit would be more than welcome! @init27
|
||
Llama 3.1 8b quantized model: 14.02% (original) -> 37.16% (fine-tuned) | ||
|
||
## Quick Start on Evaluating Llama on Text2SQL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
import { createClient } from 'redis';
const client = createClient();
await client.connect();
await client.json.set('user:1', '$', {
name: 'Alice',
emails: ['alice@example.com', 'alice@work.com'],
address: { city: 'NYC', zip: '10001' }
});
console.log(await client.json.get('user:1', { path: '$' })); // { name: 'Alice', emails: [...], address: {...} }
console.log(await client.json.get('user:1', { path: '$.name' })); // "Alice"
console.log(await client.json.get('user:1', { path: '$.emails[0]' })); // "alice@example.com"
console.log(await client.json.get('user:1', { path: '$.address.zip' })); // "10001"
await client.quit();
|
||
| Fine-tuning Combination | Accuracy | | ||
|-----------------------------|-------------------------------| | ||
| Non-Quantized, CoT, PEFT | 43.35% | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Non-Quantized, CoT, PEFT | 43.35% | | |
| Llama 3.1 8B, CoT, LORA | 43.35% | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jeffxtang we need to specify the method, PEFT is broader methodology
| Non-Quantized, No CoT, PEFT | 39.31% | | ||
| Quantized, No CoT, PEFT | 39.31% | | ||
| Non-Quantized, No CoT, FFT | 36.31% (38.27% for 10 epochs) | | ||
| Quantized, CoT, FFT | N/A | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets remove the one without numbers.
| Non-Quantized, CoT, FFT | 42.44% (43.87% for 10 epochs) | | ||
| Non-Quantized, No CoT, PEFT | 39.31% | | ||
| Quantized, No CoT, PEFT | 39.31% | | ||
| Non-Quantized, No CoT, FFT | 36.31% (38.27% for 10 epochs) | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we trying to show the perf improvement as we increased the epcohs?
|
||
## SFT with the BIRD TRAIN dataset (No Reasoning) | ||
|
||
We'll first use the BIRD TRAIN dataset to prepare for supervised fine-tuning with no reasoning info in the dataset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to clarify on what "create_sft_dataset.py" does and if this is a preprocessing step on BIRD dataset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall we need to add more context on the methodology, both for non COT and COT approach.
|
||
Below are the results of the Llama models we have evaluated on the BIRD DEV dataset: | ||
|
||
| Model | Llama API Accuracy | Together Accuracy | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets keep one of them, its just the infra for hosting models, we would not want to compare the two, it will not send the right message in this context. We should provide a way for users to hook up different APIs to the harness here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry hit approve by mistake please refer to inline comments.
…ctoring for text2sql_eval; minimum eval packages for eval requirements; merge peft script to make vllm happy
…ctoring for text2sql_eval; minimum eval packages for eval requirements; merge peft script to make vllm happy
…ing in 2-step eval
What does this PR do?
The text2sql tool contains the scripts for evaluating Llama (original and fine-tuned) models on Text2SQL tasks using the popular BIRD dataset, generating fine-tuning datasets, and fine-tuning Llama 3.1 8B with the datasets.
We have significantly simplified the original eval scripts from the BIRD repo for Llama models hosted via Meta's Llama API or Together.ai, so you can quickly evaluate in 1-2-3 steps how well different Llama models perform on the Text2SQL task.
We have also provided end-to-end scripts for generating datasets and fine-tuning a quantized Llama 3.1 8B model to gain a 165% accuracy improvement over the original model.
Fixes # (issue)
Feature/Issue validation/testing
Please describe the tests that you ran to verify your changes and relevant result summary. Provide instructions so it can be reproduced.
Please also list any relevant details for your test configuration.
Test A
Logs for Test A
Test B
Logs for Test B
Before submitting
Pull Request section?
to it if that's the case.
Thanks for contributing 🎉!