Skip to content

Reproduction issue #6

@ArtemisDicoTiar

Description

@ArtemisDicoTiar

Hi. I really appreciate your helpful contribution in this field.
I have always wondered why Code RAG does not have a well-composed benchmark, and you are the one who resolved my concern.

While reproducing the results that you have reported in the paper, especially LiveCodeBench (LCB), which is my core interest, I nearly reproduced the pass@1 without retrieval.
I got 48.90 (pass@1) with gpt-4o-mini, while you have reported 43.8 (pass@1) with gpt-4o.
So I am happy with this result.

But when I tried RAG with BM25, I got the result of 0.4 (pass@1) with the same generative model.
I wonder if there is any hidden recipe to achieve 35.5 which is reported in the table 6.
(although you have reported one with gpt-3.5-turbo, I guess my result should hover around 40 something.)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions