Issue in Reproduce Retrieval Performance

Hi there,

Thank you for the excellent work and for publishing the code base.

I am attempting to reproduce the retrieval performance of BGE-base as shown in Table 3 but have encountered some issues.

### Data Missing:
data_name = 'live_code_bench', I faced a problem:
```
 File "/anaconda/envs/code-rag/lib/python3.10/site-packages/datasets/load.py", line 1858, in dataset_module_factory
    raise DatasetNotFoundError(f"Dataset '{path}' doesn't exist on the Hub or cannot be accessed.") from e
datasets.exceptions.DatasetNotFoundError: Dataset 'project-x/programming_solutions' doesn't exist on the Hub or cannot be accessed
```
### How to reproduce
For use BGE-base to reproduce, I use `./run_st_models.sh BAAI/bge-base-en bge-base 16`
For data_name = "RepoEval", I got below errors:

`ValueError: File datasets/repoeval_retrieval_data/api/corpus.jsonl not present! Please provide accurate file.`

`ValueError: File datasets/repoeval_retrieval_data/function/corpus.jsonl not present! Please provide accurate file.`

I generated a folder as repoeval_deepmind_tracr after `python -m create/repoeval_repo.py`. 

For data_name = "SWE-bench-Lite", I generated many folder as swe-bench-lite_{}_{}-{$number} after `python -m create/swebench_repo.py`

`ValueError: File datasets/swe-bench-lite/corpus.jsonl not present! Please provide accurate file.
`

I think how to process RepoEval and SWE-bench-Lite is wrong. Could you please tell me how to process the data? 

### Performance Mismatch
For ODEX, I get NDCG@10 = 12.29, But you report is 22.0. It has a big gap. 
For DS-1000, I get NDCG@10 = 18.43, But your report is 10.8. 
Could you please guide me on this? 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Issue in Reproduce Retrieval Performance #2

Data Missing:

How to reproduce

Performance Mismatch

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue in Reproduce Retrieval Performance #2

Description

Data Missing:

How to reproduce

Performance Mismatch

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions