Skip to content

Commit c08277e

Browse files
committed
update readme
1 parent c9a8fd9 commit c08277e

File tree

1 file changed

+58
-24
lines changed

1 file changed

+58
-24
lines changed

README.md

Lines changed: 58 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# FIRST: Faster Improved Listwise Reranking with Single Token Decoding
22

3-
This repository contains the code for the paper [FIRST: Faster Improved Listwise Reranking with Single Token Decoding](https://arxiv.org/pdf/2406.15657)
3+
This repository contains the code for the paper [FIRST: Faster Improved Listwise Reranking with Single Token Decoding](https://arxiv.org/pdf/2406.15657) and the reranker code for the paper [CoRNStack: High-Quality Contrastive Data for Better Code Ranking](https://arxiv.org/abs/2412.01007).
44

55
FIRST is a novel listwise LLM reranking approach leveraging the output logits of the first generated identifier to obtain a ranked ordering of the input candidates directly. FIRST incorporates a learning-to-rank loss during training, prioritizing ranking accuracy for the more relevant passages.
66

@@ -22,59 +22,93 @@ pip install beir
2222

2323
Before running the scripts below, do
2424
```
25-
export REPO_DIR=<path to the llm-reranker directory
25+
export REPO_DIR=<path to the llm-reranker directory
2626
```
2727

2828
## 1. Retrieval
29+
### 1a. Text Retrieval
2930
We use [contriever](https://github.com/facebookresearch/contriever) as the underlying retrieval model. The precomputed query and passage embeddings for BEIR are available [here](https://huggingface.co/datasets/rryisthebest/Contreiever_BEIR_Embeddings).
3031

3132
**Note:** If you wish to not run the retrieval yourself, the retrieval results are provided [here](https://drive.google.com/drive/folders/1eMiqwiTVwJy_Zcss7LQF9hQ1aeTFMZUm?usp=sharing) and you can directly jump to [Reranking](#2-reranking)
3233

33-
3434
To run the contriever retrieval using the precomputed encodings
3535

3636
```
37-
bash bash/beir/run_1st_retrieval.sh <Path to folder with BEIR encodings>
37+
bash bash/run_1st_retrieval.sh <Path to folder with BEIR encodings>
3838
```
3939
To get the retrieval scores, run:
4040

4141
```
42-
bash bash/beir/run_eval.sh rank
42+
bash bash/run_eval.sh rank
43+
```
44+
45+
### 1b. Code Retrieval
46+
**Note:** If you wish to not run the code retrieval yourself, the code retrieval results are provided [here](https://drive.google.com/drive/folders/1GYI4g7mTVOhsttwDSioOISBZe_KiFEFt?usp=sharing) and you can directly jump to [Reranking](#2-reranking)
47+
48+
To get the code retrieval scores, run:
49+
50+
```
51+
bash bash/run_eval.sh rank code
4352
```
4453

4554
## 2. Reranking
46-
### 2a. Baseline Cross-encoder reranking
55+
### 2a. Baseline Text Cross-encoder reranking
4756

48-
To run the baseline cross encoder re-ranking, run:
57+
To run the baseline text cross encoder re-ranking, run:
4958
```
50-
bash bash/beir/run_rerank.sh
59+
bash bash/run_rerank.sh
5160
```
52-
### 2b. FIRST LLM Reranking
61+
### 2b. FIRST LLM Reranking - Text
5362

54-
To convert the retrieval results to input for LLM reranking, run:
63+
To convert the retrieval results to input for Text LLM reranking, run:
5564

5665
```
57-
bash bash/beir/run_convert_results.sh
66+
bash bash/run_convert_results.sh text
5867
```
5968

6069
We provide the trained FIRST reranker [here](https://huggingface.co/rryisthebest/First_Model).
6170

62-
To run the FIRST reranking, run:
71+
To run the FIRST reranking, set RERANK_TYPE="text" in bash/run_rerank_llm.sh and run:
6372

6473
```
65-
bash bash/beir/run_rerank_llm.sh
74+
bash bash/run_rerank_llm.sh
6675
```
6776

6877
To evaluate the reranking performance, run:
6978

7079
```
71-
bash bash/run_eval.sh rerank
80+
bash bash/run_eval.sh rerank text
7281
7382
```
7483
**Note:** Set flag --suffix to "llm_FIRST_alpha" for FIRST reranker evaluation or "ce" for cross encoder reranker
7584

85+
### 2c. CodeRanker - Code Reranking
86+
**Note:** CodeRanker currently does not support logit and alpha inference.
87+
88+
To convert the code retrieval results to input for Code LLM reranking, run:
89+
90+
```
91+
bash bash/run_convert_results.sh code
92+
```
93+
94+
We provide the trained FIRST reranker [here](https://huggingface.co/rryisthebest/First_Model).
95+
96+
To run the CodeRanker reranking, set RERANK_TYPE="code" and CODE_PROMPT_TYPE="docstring" (Codesearchnet) or "github_issue" (Swebench) in bash/run_rerank_llm.sh and run:
97+
98+
```
99+
bash bash/run_rerank_llm.sh
100+
```
101+
102+
To evaluate the reranking performance, run:
103+
104+
```
105+
bash bash/run_eval.sh rerank code
106+
```
107+
76108
## 3. Model Training
77-
We also provide the data and scripts to train the LLM reranker by yourself if you wish to do so.
109+
**Note:** Below is the training code for FIRST. We are still working on releasing the training code for the CodeRanker.
110+
111+
We provide the data and scripts to train the LLM reranker by yourself if you wish to do so.
78112
### 3a. Training Dataset
79113
Converted training dataset (alphabetic IDs) is on [HF](https://huggingface.co/datasets/rryisthebest/rank_zephyr_training_data_alpha). The standard numeric training dataset can be found [here](https://huggingface.co/datasets/castorini/rank_zephyr_training_data).
80114

@@ -88,38 +122,38 @@ We support three training objectives:
88122

89123
To train the model, run:
90124
```
91-
bash bash/beir/run_train.sh
125+
bash bash/run_train.sh
92126
```
93127

94128
To train a gated model, login to Huggingface and get token access at huggingface.co/settings/tokens.
95129
```
96130
huggingface-cli login
97131
```
98-
## 4. Relevance Feedback
132+
## 4. Relevance Feedback (not relevant for codeReranker)
99133
We also provide scripts here to use the LLM reranker for a downstream task, such as relevance feedback. [Inference-time relevance feedback](https://arxiv.org/pdf/2305.11744) uses the reranker's output to distill the retriever's query embedding to improve recall.
100134
### 4a. Dataset preparation for relevance feedback
101135
To prepare dataset(s) for relevance feedback, run:
102136
```
103-
bash bash/beir/run_prepare_distill.sh <Path to folder with BEIR encodings>
137+
bash bash/run_prepare_distill.sh <Path to folder with BEIR encodings>
104138
```
105-
### 4b. Distillation (Relevance Feedback Step)
139+
### 4b. Distillation (Relevance Feedback Step) not relevant for codeReranker
106140
You can choose to run distillation with either the cross encoder or the LLM reranker or both sequentially.
107141
To perform the relevance feedback distillation step, run:
108142
```
109-
bash bash/beir/run_distill.sh
143+
bash bash/run_distill.sh
110144
```
111145
This step creates new query embeddings after distillation.
112146

113-
### 4c. 2nd Retrieval
147+
### 4c. 2nd Retrieval (not relevant for codeReranker)
114148
To perform the retrieval step with the new query embedding after distillation, run:
115149
```
116-
bash bash/beir/run_2nd_retrieval.sh <Path to folder with BEIR encodings>
150+
bash bash/run_2nd_retrieval.sh <Path to folder with BEIR encodings>
117151
```
118152

119-
### 4d. Relevance feedback evaluation
153+
### 4d. Relevance feedback evaluation (not relevant for codeReranker)
120154
To evaluate the 2nd retrieval step, run:
121155
```
122-
bash bash/beir/run_eval.sh rank_refit
156+
bash bash/run_eval.sh rank_refit
123157
```
124158

125159
## Citation

0 commit comments

Comments
 (0)