-
Notifications
You must be signed in to change notification settings - Fork 17
Description
Hi there,
Thank you for the excellent work and for publishing the code base.
I am attempting to reproduce the retrieval performance of BGE-base as shown in Table 3 but have encountered some issues.
Data Missing:
data_name = 'live_code_bench', I faced a problem:
File "/anaconda/envs/code-rag/lib/python3.10/site-packages/datasets/load.py", line 1858, in dataset_module_factory
raise DatasetNotFoundError(f"Dataset '{path}' doesn't exist on the Hub or cannot be accessed.") from e
datasets.exceptions.DatasetNotFoundError: Dataset 'project-x/programming_solutions' doesn't exist on the Hub or cannot be accessed
How to reproduce
For use BGE-base to reproduce, I use ./run_st_models.sh BAAI/bge-base-en bge-base 16
For data_name = "RepoEval", I got below errors:
ValueError: File datasets/repoeval_retrieval_data/api/corpus.jsonl not present! Please provide accurate file.
ValueError: File datasets/repoeval_retrieval_data/function/corpus.jsonl not present! Please provide accurate file.
I generated a folder as repoeval_deepmind_tracr after python -m create/repoeval_repo.py
.
For data_name = "SWE-bench-Lite", I generated many folder as swe-bench-lite_{}_{}-{$number} after python -m create/swebench_repo.py
ValueError: File datasets/swe-bench-lite/corpus.jsonl not present! Please provide accurate file.
I think how to process RepoEval and SWE-bench-Lite is wrong. Could you please tell me how to process the data?
Performance Mismatch
For ODEX, I get NDCG@10 = 12.29, But you report is 22.0. It has a big gap.
For DS-1000, I get NDCG@10 = 18.43, But your report is 10.8.
Could you please guide me on this?