Skip to content

Text2sql tool - e2e evals and fine-tuning #967

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 72 commits into
base: main
Choose a base branch
from
Open

Text2sql tool - e2e evals and fine-tuning #967

wants to merge 72 commits into from

Conversation

jeffxtang
Copy link
Contributor

What does this PR do?

The text2sql tool contains the scripts for evaluating Llama (original and fine-tuned) models on Text2SQL tasks using the popular BIRD dataset, generating fine-tuning datasets, and fine-tuning Llama 3.1 8B with the datasets.

We have significantly simplified the original eval scripts from the BIRD repo for Llama models hosted via Meta's Llama API or Together.ai, so you can quickly evaluate in 1-2-3 steps how well different Llama models perform on the Text2SQL task.

We have also provided end-to-end scripts for generating datasets and fine-tuning a quantized Llama 3.1 8B model to gain a 165% accuracy improvement over the original model.

Fixes # (issue)

Feature/Issue validation/testing

Please describe the tests that you ran to verify your changes and relevant result summary. Provide instructions so it can be reproduced.
Please also list any relevant details for your test configuration.

  • Test A
    Logs for Test A

  • Test B
    Logs for Test B

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

Thanks for contributing 🎉!

heyjustinai

This comment was marked as outdated.

```


### Creating a reasoning dataset from the TRAIN dataset
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jump from previous section to creating reasoning dataset is bit unclear to me. Why are we doing this step? What is the value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good Q! I edited the README to make it clearer. Thanks @varunfb for all the great feedback.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, i've done fine-tuning with the reasoning dataset and am running the eval script (each inference takes longer). Will update the README (and remove the 1st in Next Steps if needed) after the eval is done. Plan to merge the PR next week.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have done the FT with adding reasoning to the sql dataset, we should keep it but we need to better frame and provide more context why and how we have done it. I am guessing synthetic data was part of it. @jeffxtang

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @init27 please review the narrative.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have done the FT with adding reasoning to the sql dataset, we should keep it but we need to better frame and provide more context why and how we have done it. I am guessing synthetic data was part of it. @jeffxtang

The why and how are stated here: "The goal is to see if we can improve the accuracy of the fine-tuned model by adding the reasoning info in the dataset." followed by Creating a reasoning dataset from the TRAIN dataset. The dataset was generated synthetically and I thought about using the synthetic-data-kit but felt like it's an overkill. But porting this script to using the data kit would be more than welcome! @init27


Llama 3.1 8b quantized model: 14.02% (original) -> 37.16% (fine-tuned)

## Quick Start on Evaluating Llama on Text2SQL

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import { createClient } from 'redis';

const client = createClient();
await client.connect();

await client.json.set('user:1', '$', {
name: 'Alice',
emails: ['alice@example.com', 'alice@work.com'],
address: { city: 'NYC', zip: '10001' }
});

console.log(await client.json.get('user:1', { path: '$' })); // { name: 'Alice', emails: [...], address: {...} }
console.log(await client.json.get('user:1', { path: '$.name' })); // "Alice"
console.log(await client.json.get('user:1', { path: '$.emails[0]' })); // "alice@example.com"
console.log(await client.json.get('user:1', { path: '$.address.zip' })); // "10001"

await client.quit();

alirezamix81

This comment was marked as spam.


| Fine-tuning Combination | Accuracy |
|-----------------------------|-------------------------------|
| Non-Quantized, CoT, PEFT | 43.35% |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| Non-Quantized, CoT, PEFT | 43.35% |
| Llama 3.1 8B, CoT, LORA | 43.35% |

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeffxtang we need to specify the method, PEFT is broader methodology

| Non-Quantized, No CoT, PEFT | 39.31% |
| Quantized, No CoT, PEFT | 39.31% |
| Non-Quantized, No CoT, FFT | 36.31% (38.27% for 10 epochs) |
| Quantized, CoT, FFT | N/A |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets remove the one without numbers.

| Non-Quantized, CoT, FFT | 42.44% (43.87% for 10 epochs) |
| Non-Quantized, No CoT, PEFT | 39.31% |
| Quantized, No CoT, PEFT | 39.31% |
| Non-Quantized, No CoT, FFT | 36.31% (38.27% for 10 epochs) |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we trying to show the perf improvement as we increased the epcohs?


## SFT with the BIRD TRAIN dataset (No Reasoning)

We'll first use the BIRD TRAIN dataset to prepare for supervised fine-tuning with no reasoning info in the dataset.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to clarify on what "create_sft_dataset.py" does and if this is a preprocessing step on BIRD dataset?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall we need to add more context on the methodology, both for non COT and COT approach.


Below are the results of the Llama models we have evaluated on the BIRD DEV dataset:

| Model | Llama API Accuracy | Together Accuracy |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets keep one of them, its just the infra for hosting models, we would not want to compare the two, it will not send the right message in this context. We should provide a way for users to hook up different APIs to the harness here.

@HamidShojanazeri HamidShojanazeri self-requested a review July 16, 2025 22:48
Copy link
Contributor

@HamidShojanazeri HamidShojanazeri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry hit approve by mistake please refer to inline comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants