Skip to content

Bug: Required Dataset for Livecodebench Retrieval project-x/programming is Unavailable #12

@begineri

Description

@begineri

Description

Hello, I am trying to run the retrieval script to generate the index and qrels file for the live_code_bench task, following the project's documentation.

The process fails because the script attempts to download the project-x/programming dataset from the Hugging Face Hub. This dataset is no longer available and results in a DatasetNotFoundError. This makes it impossible to run the retrieval evaluation as described.

Steps to Reproduce

  1. Clone the repository.
  2. Install dependencies.
  3. Run the retrieval creation script (e.g., retrieval/create/live_code_bench.py). The script fails when it tries to download the default corpus.

Error Message

Traceback (most recent call last):
  File "/root/code-rag-bench/retrieval/create/live_code_bench.py", line 65, in <module>
    main()
  File "/root/code-rag-bench/retrieval/create/live_code_bench.py", line 36, in main
    docs = get_corpus(args.corpus_name, args.cache_dir)
  File "/root/code-rag-bench/retrieval/create/live_code_bench.py", line 18, in get_corpus
    dataset = load_dataset(hf_name, cache_dir=cache_dir)["train"]
  File "/home/vipuser/anaconda3/envs/crag/lib/python3.10/site-packages/datasets/load.py", line 2062, in load_dataset
    builder_instance = load_dataset_builder(
  File "/home/vipuser/anaconda3/envs/crag/lib/python3.10/site-packages/datasets/load.py", line 1782, in load_dataset_builder
    dataset_module = dataset_module_factory(
  File "/home/vipuser/anaconda3/envs/crag/lib/python3.10/site-packages/datasets/load.py", line 1652, in dataset_module_factory
    raise e1 from None
  File "/home/vipuser/anaconda3/envs/crag/lib/python3.10/site-packages/datasets/load.py", line 1636, in dataset_module_factory
    raise DatasetNotFoundError(message) from e
datasets.exceptions.DatasetNotFoundError: Dataset 'project-x/programming' is a gated dataset on the Hub. You must be authenticated to access it.

Analysis

The qrels generation script appears to be designed to match query-id from the livecodebench/code_generation dataset with a corresponding doc_id in the corpus. This means that a generic replacement corpus (like code_search_net) will not work for this specific script.

It seems that project-x/programming was the specific required corpus containing the ground-truth documents needed to create the relevance file (qrels).

Request

Could you please provide a working solution for this? For example:

  • Could you make the original project-x/programming dataset available again?
  • Alternatively, could you provide the pre-generated qrels file (test.tsv) and the corresponding search index for the live_code_bench task?
  • If there is a new official workflow, could you please provide updated instructions?

Thank you for your help!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions