Reproduction issue

Hi. I really appreciate your helpful contribution in this field.
I have always wondered why Code RAG does not have a well-composed benchmark, and you are the one who resolved my concern.

While reproducing the results that you have reported in the paper, especially LiveCodeBench (LCB), which is my core interest, I nearly reproduced the pass@1 without retrieval.
I got 48.90 (pass@1) with gpt-4o-mini, while you have reported 43.8 (pass@1) with gpt-4o.
So I am happy with this result.

But when I tried RAG with BM25, I got the result of 0.4 (pass@1) with the same generative model.
I wonder if there is any hidden recipe to achieve 35.5 which is reported in the table 6. 
(although you have reported one with gpt-3.5-turbo, I guess my result should hover around 40 something.)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reproduction issue #6

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Reproduction issue #6

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions