Skip to content

The benchmark for LLMs designed to tackle one of the most knowledge-intensive tasks in data science: writing feature engineering code, which requires domain knowledge in addition to a deep understanding of the underlying problem and data structure.

License

Notifications You must be signed in to change notification settings

FeatEng/FeatEng

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FeatEng

The benchmark for LLMs designed to tackle one of the most knowledge-intensive tasks in data science: writing feature engineering code, which requires domain knowledge in addition to a deep understanding of the underlying problem and data structure. The method can cheaply and efficiently assess the broad capabilities of LLMs in contrast to the existing methods.

Evaluation Setup

The model is provided with a dataset description in a prompt and asked to generate code transforming it. The evaluation score is derived from the improvement achieved by an XGBoost model fit on the modified dataset compared to the original data.

Components of LLM's prompt in FeatEng

Usage

(1) Install package with pip, e.g.:

pip install --upgrade "feateng @ git+https://github.com/FeatEng/FeatEng"

(2) Run evalution command. The most straightforward is to follow the following pattern:

feateng.evaluate --model [model_name]        \
                 --backend [hf|openai|vllm]  \
                 --greedy

Nevertheless, we recommend sampling several outputs and calculating scores as an average. This can be achieved with the --temperature and --n_samples parameters instead of --greedy.

Examples

E.g., to obtain gpt-4o-mini scores comparable to these from the paper, use the following:

OPENAI_API_KEY="sk-..." feateng.evaluate --model "gpt-4o-mini-2024-07-18"  \
                                         --backend "openai"                \
                                         --temperature 1                   \
                                         --n_samples 3

For HuggingFace models, one can follow the following:

feateng.evaluate --model "mistralai/Codestral-22B-v0.1"     \
                 --backend "hf"                             \
                 --temperature 1                          \
                 --n_samples 3                              \
                 --attn_implementation "flash_attention_2"

Or, preferably, use vLLM (requires installing feateng with vllm extras), e.g.:

feateng.evaluate --model "meta-llama/Llama-3.1-70B-Instruct"  \
                 --backend "vllm"                             \
                 --temperature 1                              \
                 --tp 4

Hints

(1) Because FeatEng has prompts of ~8k tokens, vLLM with automatic prefix caching offers significantly better performance.

(2) We recommend downloading datasets explicitly first before running the evals, especially with many parallel executions (--parallel) or in deployable eval jobs:

huggingface-cli download FeatEng/Data --repo-type dataset
huggingface-cli download FeatEng/Benchmark --repo-type dataset

Implementation details

We rely heavily on the EvalPlus suite we extended to support FeatEng.

About

The benchmark for LLMs designed to tackle one of the most knowledge-intensive tasks in data science: writing feature engineering code, which requires domain knowledge in addition to a deep understanding of the underlying problem and data structure.

Resources

License

Stars

Watchers

Forks

Contributors 2

  •  
  •  

Languages