Skip to content

Potential mismatch between task and test files in repoeval_task_id2tests.json for deepmind_tarcr #9

@ZimaBlue307

Description

@ZimaBlue307

Hi, thank you very much for your excellent work and contributions in this domain!

According to the CODERAG-BENCH paper, a reproducible execution environment is provided for RepoEval. The file code-rag-bench/tree/main/generation/eval/tasks/custom_metrics/repoeval_task_id2tests.json specifies the mapping between each task and its corresponding test files for evaluation.

However, after setting up the environment following the instructions in the README.md, I noticed that for some tasks under deepmind_tarcr (e.g., deepmind_tarcr/6), the listed test files include tracr/rasp/rasp_test.py. To verify the evaluation behavior, I replaced the target function with a failing statement (assert 1 == 0), as done in generation/eval/tasks/custom_metrics/repoeval_execution.py at line 250, and ran:

pytest tracr/rasp/rasp_test.py -v

Surprisingly, all test cases passed. Here’s a small excerpt from the output:

tracr/rasp/rasp_test.py::AggregateTest::test_aggregate_on_size_2_inputs3 PASSED                                                                               [ 99%]
tracr/rasp/rasp_test.py::AggregateTest::test_aggregate_on_size_2_inputs4 PASSED                                                                               [ 99%]
tracr/rasp/rasp_test.py::AggregateTest::test_aggregate_on_size_2_inputs5 PASSED                                                                               [ 99%]
tracr/rasp/rasp_test.py::RaspProgramTest::test_has_prev PASSED                                                                                                [100%]

======================================================================= 5353 passed in 5.61s ========================================================================

The same behavior happened as well from the trajectories of OpenHands agent and SWE-agent. This seems to suggest a potential mismatch between the task and the corresponding test file, as the failure injection did not cause any test failures.

Could you kindly verify whether the evaluation environment and the task-to-test mapping are correctly set up? Any clarification would be greatly appreciated.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions