Bug: Required Dataset for Livecodebench Retrieval project-x/programming is Unavailable

### **Description**

Hello, I am trying to run the retrieval script to generate the index and `qrels` file for the `live_code_bench` task, following the project's documentation.

The process fails because the script attempts to download the `project-x/programming` dataset from the Hugging Face Hub. This dataset is no longer available and results in a `DatasetNotFoundError`. This makes it impossible to run the retrieval evaluation as described.

### **Steps to Reproduce**

1.  Clone the repository.
2.  Install dependencies.
3.  Run the retrieval creation script (e.g., `retrieval/create/live_code_bench.py`). The script fails when it tries to download the default corpus.

### **Error Message**

```
Traceback (most recent call last):
  File "/root/code-rag-bench/retrieval/create/live_code_bench.py", line 65, in <module>
    main()
  File "/root/code-rag-bench/retrieval/create/live_code_bench.py", line 36, in main
    docs = get_corpus(args.corpus_name, args.cache_dir)
  File "/root/code-rag-bench/retrieval/create/live_code_bench.py", line 18, in get_corpus
    dataset = load_dataset(hf_name, cache_dir=cache_dir)["train"]
  File "/home/vipuser/anaconda3/envs/crag/lib/python3.10/site-packages/datasets/load.py", line 2062, in load_dataset
    builder_instance = load_dataset_builder(
  File "/home/vipuser/anaconda3/envs/crag/lib/python3.10/site-packages/datasets/load.py", line 1782, in load_dataset_builder
    dataset_module = dataset_module_factory(
  File "/home/vipuser/anaconda3/envs/crag/lib/python3.10/site-packages/datasets/load.py", line 1652, in dataset_module_factory
    raise e1 from None
  File "/home/vipuser/anaconda3/envs/crag/lib/python3.10/site-packages/datasets/load.py", line 1636, in dataset_module_factory
    raise DatasetNotFoundError(message) from e
datasets.exceptions.DatasetNotFoundError: Dataset 'project-x/programming' is a gated dataset on the Hub. You must be authenticated to access it.
```


### **Analysis**

The `qrels` generation script appears to be designed to match `query-id` from the `livecodebench/code_generation` dataset with a corresponding `doc_id` in the corpus. This means that a generic replacement corpus (like `code_search_net`) will not work for this specific script.

It seems that `project-x/programming` was the specific required corpus containing the ground-truth documents needed to create the relevance file (`qrels`).

### **Request**

Could you please provide a working solution for this? For example:

  * Could you make the original `project-x/programming` dataset available again?
  * Alternatively, could you provide the pre-generated `qrels` file (`test.tsv`) and the corresponding search index for the `live_code_bench` task?
  * If there is a new official workflow, could you please provide updated instructions?

Thank you for your help\!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bug: Required Dataset for Livecodebench Retrieval project-x/programming is Unavailable #12

Description

Steps to Reproduce

Error Message

Analysis

Request

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Bug: Required Dataset for Livecodebench Retrieval project-x/programming is Unavailable #12

Description

Description

Steps to Reproduce

Error Message

Analysis

Request

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions