Text2sql tool - e2e evals and fine-tuning #967

jeffxtang · 2025-06-27T17:17:07Z

What does this PR do?

The text2sql tool contains the scripts for evaluating Llama (original and fine-tuned) models on Text2SQL tasks using the popular BIRD dataset, generating fine-tuning datasets, and fine-tuning Llama 3.1 8B with the datasets.

We have significantly simplified the original eval scripts from the BIRD repo for Llama models hosted via Meta's Llama API or Together.ai, so you can quickly evaluate in 1-2-3 steps how well different Llama models perform on the Text2SQL task.

We have also provided end-to-end scripts for generating datasets and fine-tuning a quantized Llama 3.1 8B model to gain a 165% accuracy improvement over the original model.

Fixes # (issue)

Feature/Issue validation/testing

Please describe the tests that you ran to verify your changes and relevant result summary. Provide instructions so it can be reproduced.
Please also list any relevant details for your test configuration.

Test A
Logs for Test A
Test B
Logs for Test B

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

Thanks for contributing 🎉!

…s update

end-to-end-use-cases/coding/text2sql/tool/README.md

varunfb · 2025-06-27T19:32:36Z

end-to-end-use-cases/coding/text2sql/tool/README.md

+```
+
+
+### Creating a reasoning dataset from the TRAIN dataset


Jump from previous section to creating reasoning dataset is bit unclear to me. Why are we doing this step? What is the value?

Good Q! I edited the README to make it clearer. Thanks @varunfb for all the great feedback.

BTW, i've done fine-tuning with the reasoning dataset and am running the eval script (each inference takes longer). Will update the README (and remove the 1st in Next Steps if needed) after the eval is done. Plan to merge the PR next week.

If we have done the FT with adding reasoning to the sql dataset, we should keep it but we need to better frame and provide more context why and how we have done it. I am guessing synthetic data was part of it. @jeffxtang

cc @init27 please review the narrative.

If we have done the FT with adding reasoning to the sql dataset, we should keep it but we need to better frame and provide more context why and how we have done it. I am guessing synthetic data was part of it. @jeffxtang

The why and how are stated here: "The goal is to see if we can improve the accuracy of the fine-tuned model by adding the reasoning info in the dataset." followed by Creating a reasoning dataset from the TRAIN dataset. The dataset was generated synthetically and I thought about using the synthetic-data-kit but felt like it's an overkill. But porting this script to using the data kit would be more than welcome! @init27

alirezamix81 · 2025-06-28T17:10:03Z

end-to-end-use-cases/coding/text2sql/tool/README.md

+
+Llama 3.1 8b quantized model: 14.02% (original) -> 37.16% (fine-tuned)
+
+## Quick Start on Evaluating Llama on Text2SQL


import { createClient } from 'redis';

const client = createClient();
await client.connect();

await client.json.set('user:1', '$', {
name: 'Alice',
emails: ['alice@example.com', 'alice@work.com'],
address: { city: 'NYC', zip: '10001' }
});

console.log(await client.json.get('user:1', { path: '$' })); // { name: 'Alice', emails: [...], address: {...} }
console.log(await client.json.get('user:1', { path: '$.name' })); // "Alice"
console.log(await client.json.get('user:1', { path: '$.emails[0]' })); // "alice@example.com"
console.log(await client.json.get('user:1', { path: '$.address.zip' })); // "10001"

await client.quit();

…netuned with reasoning eval

HamidShojanazeri · 2025-07-16T20:33:17Z

end-to-end-use-cases/coding/text2sql/fine-tuning/README.md

+
+| Fine-tuning Combination     | Accuracy                      |
+|-----------------------------|-------------------------------|
+| Non-Quantized, CoT, PEFT    | 43.35%                        |


@jeffxtang we need to specify the method, PEFT is broader methodology

HamidShojanazeri · 2025-07-16T20:37:57Z

end-to-end-use-cases/coding/text2sql/fine-tuning/README.md

+| Non-Quantized, No CoT, PEFT | 39.31%                        |
+| Quantized, No CoT, PEFT     | 39.31%                        |
+| Non-Quantized, No CoT, FFT  | 36.31% (38.27% for 10 epochs) |
+| Quantized, CoT, FFT         | N/A                           |


Lets remove the one without numbers.

HamidShojanazeri · 2025-07-16T20:38:18Z

end-to-end-use-cases/coding/text2sql/fine-tuning/README.md

+| Non-Quantized, CoT, FFT     | 42.44% (43.87% for 10 epochs) |
+| Non-Quantized, No CoT, PEFT | 39.31%                        |
+| Quantized, No CoT, PEFT     | 39.31%                        |
+| Non-Quantized, No CoT, FFT  | 36.31% (38.27% for 10 epochs) |


are we trying to show the perf improvement as we increased the epcohs?

HamidShojanazeri · 2025-07-16T20:43:34Z

end-to-end-use-cases/coding/text2sql/fine-tuning/README.md

+
+## SFT with the BIRD TRAIN dataset (No Reasoning)
+
+We'll first use the BIRD TRAIN dataset to prepare for supervised fine-tuning with no reasoning info in the dataset.


We need to clarify on what "create_sft_dataset.py" does and if this is a preprocessing step on BIRD dataset?

Overall we need to add more context on the methodology, both for non COT and COT approach.

HamidShojanazeri · 2025-07-16T22:46:21Z

end-to-end-use-cases/coding/text2sql/eval/README.md

+
+Below are the results of the Llama models we have evaluated on the BIRD DEV dataset:
+
+| Model                  | Llama API Accuracy | Together Accuracy |


Lets keep one of them, its just the infra for hosting models, we would not want to compare the two, it will not send the right message in this context. We should provide a way for users to hook up different APIs to the harness here.

HamidShojanazeri

sorry hit approve by mistake please refer to inline comments.

…ctoring for text2sql_eval; minimum eval packages for eval requirements; merge peft script to make vllm happy

…ing in 2-step eval

… into text2sql

… progress

jeffxtang and others added 11 commits June 24, 2025 18:13

text2sql eval and ft tools

ebed0ef

script and README update

edcf746

script to create reasoning dataset; llama_text2sql.py and requirement…

76a8caf

…s update

train loss png

ab7df10

README update

2e8b278

README update

2c514b1

README cleanup

5a18b6b

README update

0033fc9

README update - overview and quick start

3997357

README update - next steps and creating reasoning dataset

44aa896

README and create_reasoning_dataset.py and trl_sft.py update

46d3245

jeffxtang assigned ghd3v, HamidShojanazeri, init27 and varunfb Jun 27, 2025

facebook-github-bot added the cla signed label Jun 27, 2025

jeffxtang assigned heyjustinai Jun 27, 2025

This comment was marked as outdated.

Sign in to view

varunfb reviewed Jun 27, 2025

View reviewed changes

end-to-end-use-cases/coding/text2sql/tool/README.md Outdated Show resolved Hide resolved

varunfb reviewed Jun 27, 2025

View reviewed changes

end-to-end-use-cases/coding/text2sql/tool/README.md Outdated Show resolved Hide resolved

varunfb reviewed Jun 27, 2025

View reviewed changes

jeffxtang added 3 commits June 27, 2025 14:40

readme and requirements updated based on PR feedback

b89d945

readme update based on PR feeedback

3731175

readme

094ab01

alirezamix81 approved these changes Jun 28, 2025

View reviewed changes

jeffxtang added 4 commits June 28, 2025 10:53

updated create_reasoning_dataset, llama_text2sql.py and README for fi…

6d76ea0

…netuned with reasoning eval

README format fix

cf54eb4

train loss cot; README

e182902

parent README update

79945b6

This comment was marked as spam.

Sign in to view

jeffxtang added 3 commits July 12, 2025 09:49

fine-tuning README update with latest result

6269c15

READMEs update

2cdfbf0

READMEs update

4bb7faa

HamidShojanazeri approved these changes Jul 16, 2025

View reviewed changes

HamidShojanazeri self-requested a review July 16, 2025 22:48

HamidShojanazeri requested changes Jul 16, 2025

View reviewed changes

ghd3v and others added 24 commits July 17, 2025 23:20

adding vllm eval files and updating requirements.txt

2bd662c

Updated README.md

b574c6d

Update README.md

58ea6cb

Update README.md

33ac1ab

some refactoring and cleaning

e10ddda

vllm enabled eval for HF and fine-tuned models; code cleanup and refa…

5baa1e3

…ctoring for text2sql_eval; minimum eval packages for eval requirements; merge peft script to make vllm happy

vllm enabled eval for HF and fine-tuned models; code cleanup and refa…

f894d26

…ctoring for text2sql_eval; minimum eval packages for eval requirements; merge peft script to make vllm happy

trl import

ad48509

Update the eval section using vllm for fine-tuning README.md

e059899

Update fine-tuning README.md

f80e7bf

Update fine-tuning README.md

b630735

Update eval README.md for vllm based HF model

1ac67d9

Update FT README.md

1b802d3

Update eval README.md

df598c4

Update eval README.md

cb8b0bd

batch processing and vllm llama call in parallel; clean progress show…

77d3544

…ing in 2-step eval

Merge branch 'text2sql' of https://github.com/meta-llama/llama-cookbook…

deca42c

… into text2sql

code cleanup and refactoring; cloud llama response generation in tqdm…

12a6dfa

… progress

some cleanup and typo fix

799dee6

FT readme update; removed old vllm py and sh files

6501cf4

3 READMEs update; fine-tuning requirements update with vllm etc

82bb008

main README

27a23af

Update eval README.md

e38abf1

Update FT EADME.md

fc80546


		Llama 3.1 8b quantized model: 14.02% (original) -> 37.16% (fine-tuned)

		## Quick Start on Evaluating Llama on Text2SQL

	\| Non-Quantized, CoT, PEFT \| 43.35% \|
	\| Llama 3.1 8B, CoT, LORA \| 43.35% \|


		## SFT with the BIRD TRAIN dataset (No Reasoning)

		We'll first use the BIRD TRAIN dataset to prepare for supervised fine-tuning with no reasoning info in the dataset.


		Below are the results of the Llama models we have evaluated on the BIRD DEV dataset:

		\| Model \| Llama API Accuracy \| Together Accuracy \|

Text2sql tool - e2e evals and fine-tuning #967

Are you sure you want to change the base?

Text2sql tool - e2e evals and fine-tuning #967

Uh oh!

Conversation

jeffxtang commented Jun 27, 2025

What does this PR do?

Feature/Issue validation/testing

Before submitting

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

This comment was marked as spam.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HamidShojanazeri left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!