Beta — Development Ongoing
Contact: yuancheng@comp.nus.edu.sg
OSS-Bench automatically constructs large-scale, live evaluation tasks from real-world open-source software. It replaces individual functions in mature projects with LLM-generated code and evaluates them using three key metrics:
- Compilability — Does the code compile?
- Functional Correctness — Does it pass the project’s test suite?
- Memory Safety — Are there sanitizer-reported errors?
Failed compilations, test-suite violations, and sanitizer alerts serve as reliable ground truth for identifying problematic code.
-
Operating System: Ubuntu (tested on 20.04+)
-
Docker: Required to instantiate the OSS environment and run evaluations
sudo apt update sudo apt install docker.io
-
Python 3.8+ with:
pip install tqdm sqlite3
Pull the prebuilt Docker images for your target OSS:
# For PHP
docker pull 0599jiangyc/flowfusion4llm:latest
# For SQLite
docker pull 0599jiangyc/sqlite4llm:latest
We have pre-extracted C functions from the php-src repository at this commit.
These are stored in ./data/php-src/function.db
with the following schema:
id
(INTEGER PRIMARY KEY AUTOINCREMENT): 1, 2, 3, ...function_index
(TEXT, UNIQUE):./php-src/main/output.c:77:20
filepath
(TEXT):./php-src/main/output.c
token_number
(INT): Word count (e.g., 10)original_function
(TEXT): Original function codeoptimized_function
(TEXT): Initially-
, to be filled with LLM output
-
The default prompt is defined in
./prompt.py
. -
Use
./llm.py
to generate LLM outputs via the Ollama platform. -
Alternatively, use your own method:
-
Create a new folder:
./data/php-src/{model-name}
-
Copy the database:
cp ./data/php-src/function.db ./data/php-src/{model-name}/function.db
-
Populate the
optimized_function
field in the copied DB with your LLM outputs.
-
Run the compilability check for all generated functions:
python3 main.py \
--model gpt-o1-seed0 \
--oss php-src \
--linear-execution
Replace gpt-o1-seed0
with your model folder name in ./data/php-src
.
This step may take several hours. (TODO: Add parallel execution support)
Output includes:
invalid_functions
linear_compile_fail_logs
fuzzresults
(optional; if sanitizer alerts were triggered)
Step 1: In one terminal (or tmux session), run dataset generation:
python3 main.py \
--model gpt-o1-seed0 \
--oss php-src \
--dataset-generation
This creates:
dataset.db
patches/
directory in./data/php-src/{model-name}/
Step 2: In a second terminal, start the test execution:
python3 main.py \
--model gpt-o1-seed0 \
--oss php-src \
--test
This produces:
testlog
in./data/php-src/{model-name}/
Run the scoring script to summarize results:
python3 score.py --model gpt-o1-seed0
Happy benchmarking! 🚀