Potential mismatch between task and test files in `repoeval_task_id2tests.json` for `deepmind_tarcr`

Hi, thank you very much for your excellent work and contributions in this domain!

According to the CODERAG-BENCH paper, a reproducible execution environment is provided for RepoEval. The file `code-rag-bench/tree/main/generation/eval/tasks/custom_metrics/repoeval_task_id2tests.json` specifies the mapping between each task and its corresponding test files for evaluation.

However, after setting up the environment following the instructions in the `README.md`, I noticed that for some tasks under `deepmind_tarcr` (e.g., `deepmind_tarcr/6`), the listed test files include `tracr/rasp/rasp_test.py`. To verify the evaluation behavior, I replaced the target function with a failing statement (`assert 1 == 0`), as done in `generation/eval/tasks/custom_metrics/repoeval_execution.py` at line 250, and ran:

```bash
pytest tracr/rasp/rasp_test.py -v
```

Surprisingly, all test cases passed. Here’s a small excerpt from the output:

```
tracr/rasp/rasp_test.py::AggregateTest::test_aggregate_on_size_2_inputs3 PASSED                                                                               [ 99%]
tracr/rasp/rasp_test.py::AggregateTest::test_aggregate_on_size_2_inputs4 PASSED                                                                               [ 99%]
tracr/rasp/rasp_test.py::AggregateTest::test_aggregate_on_size_2_inputs5 PASSED                                                                               [ 99%]
tracr/rasp/rasp_test.py::RaspProgramTest::test_has_prev PASSED                                                                                                [100%]

======================================================================= 5353 passed in 5.61s ========================================================================
```
The same behavior happened as well from the trajectories of OpenHands agent and SWE-agent. This seems to suggest a potential mismatch between the task and the corresponding test file, as the failure injection did not cause any test failures.

Could you kindly verify whether the evaluation environment and the task-to-test mapping are correctly set up? Any clarification would be greatly appreciated.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Potential mismatch between task and test files in `repoeval_task_id2tests.json` for `deepmind_tarcr` #9

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Potential mismatch between task and test files in repoeval_task_id2tests.json for deepmind_tarcr #9

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Potential mismatch between task and test files in `repoeval_task_id2tests.json` for `deepmind_tarcr` #9