GitHub - oss-bench/oss-bench: Benchmark Generator For Coding LLMs

OSS-Bench: Benchmark Generator for Coding LLMs

Beta — Development Ongoing

OSS-Bench automatically constructs large-scale, live evaluation tasks from real-world open-source software. It replaces individual functions in mature projects with LLM-generated code and evaluates them using three key metrics:

Compilability — Does the code compile?
Functional Correctness — Does it pass the project’s test suite?
Memory Safety — Are there sanitizer-reported errors?

Failed compilations, test-suite violations, and sanitizer alerts serve as reliable ground truth for identifying problematic code.

Prerequisites

Operating System: Ubuntu (tested on 20.04+)
Docker: Required to instantiate the OSS environment and run evaluations
```
sudo apt update
sudo apt install docker.io
```
Python 3.8+ with:
```
pip install tqdm sqlite3
```

Docker Images

Pull the prebuilt Docker images for your target OSS:

# For PHP
docker pull 0599jiangyc/flowfusion4llm:latest

# For SQLite
docker pull 0599jiangyc/sqlite4llm:latest

Getting Started with OSS-Bench (PHP)

1. Extract Functions

We have pre-extracted C functions from the php-src repository at this commit.

These are stored in ./data/php-src/function.db with the following schema:

id (INTEGER PRIMARY KEY AUTOINCREMENT): 1, 2, 3, ...
function_index (TEXT, UNIQUE): ./php-src/main/output.c:77:20
filepath (TEXT): ./php-src/main/output.c
token_number (INT): Word count (e.g., 10)
original_function (TEXT): Original function code
optimized_function (TEXT): Initially -, to be filled with LLM output

2. Collect LLM Outputs

The default prompt is defined in ./prompt.py.
Use ./llm.py to generate LLM outputs via the Ollama platform.
Alternatively, use your own method:
1. Create a new folder: ./data/php-src/{model-name}
2. Copy the database:
```
cp ./data/php-src/function.db ./data/php-src/{model-name}/function.db
```
3. Populate the optimized_function field in the copied DB with your LLM outputs.

3. Evaluate Compilability

Run the compilability check for all generated functions:

python3 main.py \
  --model gpt-o1-seed0 \
  --oss php-src \
  --linear-execution

Replace gpt-o1-seed0 with your model folder name in ./data/php-src.

This step may take several hours. (TODO: Add parallel execution support)

Output includes:

invalid_functions
linear_compile_fail_logs
fuzzresults (optional; if sanitizer alerts were triggered)

4. Evaluate Functionality & Memory Safety

Step 1: In one terminal (or tmux session), run dataset generation:

python3 main.py \
  --model gpt-o1-seed0 \
  --oss php-src \
  --dataset-generation

This creates:

dataset.db
patches/ directory in ./data/php-src/{model-name}/

Step 2: In a second terminal, start the test execution:

python3 main.py \
  --model gpt-o1-seed0 \
  --oss php-src \
  --test

This produces:

testlog in ./data/php-src/{model-name}/

5. Compute Final Scores

Run the scoring script to summarize results:

python3 score.py --model gpt-o1-seed0

Happy benchmarking! 🚀

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
data/php-src		data/php-src
.gitignore		.gitignore
README.md		README.md
bench.sh		bench.sh
cutlog.py		cutlog.py
function.py		function.py
llm.py		llm.py
main.py		main.py
prompt.py		prompt.py
score.py		score.py
sqlite3db.py		sqlite3db.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

OSS-Bench: Benchmark Generator for Coding LLMs

Prerequisites

Docker Images

Getting Started with OSS-Bench (PHP)

1. Extract Functions

2. Collect LLM Outputs

3. Evaluate Compilability

4. Evaluate Functionality & Memory Safety

5. Compute Final Scores

About

Uh oh!

Releases

Packages

Languages

oss-bench/oss-bench

Folders and files

Latest commit

History

Repository files navigation

OSS-Bench: Benchmark Generator for Coding LLMs

Prerequisites

Docker Images

Getting Started with OSS-Bench (PHP)

1. Extract Functions

2. Collect LLM Outputs

3. Evaluate Compilability

4. Evaluate Functionality & Memory Safety

5. Compute Final Scores

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages