-
Notifications
You must be signed in to change notification settings - Fork 17
Description
Hi, thank you very much for your excellent work and contributions in this domain!
According to the CODERAG-BENCH paper, a reproducible execution environment is provided for RepoEval. The file code-rag-bench/tree/main/generation/eval/tasks/custom_metrics/repoeval_task_id2tests.json
specifies the mapping between each task and its corresponding test files for evaluation.
However, after setting up the environment following the instructions in the README.md
, I noticed that for some tasks under deepmind_tarcr
(e.g., deepmind_tarcr/6
), the listed test files include tracr/rasp/rasp_test.py
. To verify the evaluation behavior, I replaced the target function with a failing statement (assert 1 == 0
), as done in generation/eval/tasks/custom_metrics/repoeval_execution.py
at line 250, and ran:
pytest tracr/rasp/rasp_test.py -v
Surprisingly, all test cases passed. Here’s a small excerpt from the output:
tracr/rasp/rasp_test.py::AggregateTest::test_aggregate_on_size_2_inputs3 PASSED [ 99%]
tracr/rasp/rasp_test.py::AggregateTest::test_aggregate_on_size_2_inputs4 PASSED [ 99%]
tracr/rasp/rasp_test.py::AggregateTest::test_aggregate_on_size_2_inputs5 PASSED [ 99%]
tracr/rasp/rasp_test.py::RaspProgramTest::test_has_prev PASSED [100%]
======================================================================= 5353 passed in 5.61s ========================================================================
The same behavior happened as well from the trajectories of OpenHands agent and SWE-agent. This seems to suggest a potential mismatch between the task and the corresponding test file, as the failure injection did not cause any test failures.
Could you kindly verify whether the evaluation environment and the task-to-test mapping are correctly set up? Any clarification would be greatly appreciated.