FeatEng

The benchmark for LLMs designed to tackle one of the most knowledge-intensive tasks in data science: writing feature engineering code, which requires domain knowledge in addition to a deep understanding of the underlying problem and data structure. The method can cheaply and efficiently assess the broad capabilities of LLMs in contrast to the existing methods.

Evaluation Setup

The model is provided with a dataset description in a prompt and asked to generate code transforming it. The evaluation score is derived from the improvement achieved by an XGBoost model fit on the modified dataset compared to the original data.

Usage

(1) Install package with pip, e.g.:

pip install --upgrade "feateng @ git+https://github.com/FeatEng/FeatEng"

(2) Run evalution command. The most straightforward is to follow the following pattern:

feateng.evaluate --model [model_name]        \
                 --backend [hf|openai|vllm]  \
                 --greedy

Nevertheless, we recommend sampling several outputs and calculating scores as an average. This can be achieved with the --temperature and --n_samples parameters instead of --greedy.

Examples

E.g., to obtain gpt-4o-mini scores comparable to these from the paper, use the following:

OPENAI_API_KEY="sk-..." feateng.evaluate --model "gpt-4o-mini-2024-07-18"  \
                                         --backend "openai"                \
                                         --temperature 1                   \
                                         --n_samples 3

For HuggingFace models, one can follow the following:

feateng.evaluate --model "mistralai/Codestral-22B-v0.1"     \
                 --backend "hf"                             \
                 --temperature 1                          \
                 --n_samples 3                              \
                 --attn_implementation "flash_attention_2"

Or, preferably, use vLLM (requires installing feateng with vllm extras), e.g.:

feateng.evaluate --model "meta-llama/Llama-3.1-70B-Instruct"  \
                 --backend "vllm"                             \
                 --temperature 1                              \
                 --tp 4

Hints

(1) Because FeatEng has prompts of ~8k tokens, vLLM with automatic prefix caching offers significantly better performance.

(2) We recommend downloading datasets explicitly first before running the evals, especially with many parallel executions (--parallel) or in deployable eval jobs:

huggingface-cli download FeatEng/Data --repo-type dataset
huggingface-cli download FeatEng/Benchmark --repo-type dataset

Implementation details

We rely heavily on the EvalPlus suite we extended to support FeatEng.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
feateng		feateng
static		static
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.cfg		setup.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

FeatEng

Evaluation Setup

Usage

Examples

Hints

Implementation details

About

Uh oh!

Uh oh!

Contributors 2

Uh oh!

Languages

License

FeatEng/FeatEng

Folders and files

Latest commit

History

Repository files navigation

FeatEng

Evaluation Setup

Usage

Examples

Hints

Implementation details

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors 2

Uh oh!

Languages