-
Notifications
You must be signed in to change notification settings - Fork 17
Description
Hi. I really appreciate your helpful contribution in this field.
I have always wondered why Code RAG does not have a well-composed benchmark, and you are the one who resolved my concern.
While reproducing the results that you have reported in the paper, especially LiveCodeBench (LCB), which is my core interest, I nearly reproduced the pass@1 without retrieval.
I got 48.90 (pass@1) with gpt-4o-mini, while you have reported 43.8 (pass@1) with gpt-4o.
So I am happy with this result.
But when I tried RAG with BM25, I got the result of 0.4 (pass@1) with the same generative model.
I wonder if there is any hidden recipe to achieve 35.5 which is reported in the table 6.
(although you have reported one with gpt-3.5-turbo, I guess my result should hover around 40 something.)