Skip to content

Commit 300203a

Browse files
authored
Merge pull request gangiswag#1 from rryisthebest/main
Add Initial Code (Training + Inference+ Instructions)
2 parents efdd17f + 699c4a1 commit 300203a

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

67 files changed

+5888
-1
lines changed

.gitignore

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
datasets/*
2+
data/*
3+
scripts/__pycache__/*
4+
scripts/utils/__pycache__/*
5+
models/*
6+
scripts/*.ipynb
7+
wandb/*
8+
logs/*
9+
qrels/*
10+
outputs/*
11+
scripts/latency_test.py
12+
scripts/logits_reranking_test.py
13+
temp/*

README.md

Lines changed: 84 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,88 @@
11
# FIRST: Faster Improved Listwise Reranking with Single Token Decoding
2+
Relevance Feeback code will be released shortly after!
23

3-
Code will be released soon!
44

5+
## Installation
6+
You need to install the tevatron library (original source [here](https://github.com/texttron/tevatron)) which provides the framework for retrieval.
7+
8+
```
9+
conda create --name {your env name} python=3.9.18
10+
cd tevatron
11+
pip install --editable .
12+
pip install beir
13+
```
14+
## You need to install the vLLM library (Instruction [here](https://docs.vllm.ai/en/latest/getting_started/installation.html)) which provides optimization for LLM generation.
15+
16+
Before running, do
17+
```
18+
export REPO_DIR=<path to this directory e.g. /shared/nas/data/m1/revanth3/exp/prf/ai2_data/workspace/repo/llm-reranker>
19+
```
20+
21+
## 1. Retrieval
22+
Please download the precomputed BEIR encodings stored at (Link will be added shortly)
23+
Run the baseline Contriever retrieval using the precomputed encodings
24+
25+
```
26+
bash bash/beir/run_1st_retrieval.sh <Path of precomputed BEIR encodings>
27+
```
28+
To get the baseline contriever scores and preprocess datasets, Run:
29+
30+
```
31+
bash bash/beir/run_eval.sh rank
32+
```
33+
34+
## 2. Reranking
35+
### 2a. Baseline Cross-encoder reranking
36+
Cross-encoder rerankig config is at `{REPO_DIR}/bash/beir/run_rerank_CE.sh`
37+
To run the baseline cross encoder re-ranking, run:
38+
```
39+
bash bash/beir/run_rerank.sh
40+
```
41+
### 2b. LLM Reranking
42+
LLM results preparation config is at `{REPO_DIR}/bash/beir/run_convert_results.sh`
43+
To prepare retrieval results for LLM reranking, run:
44+
45+
```
46+
bash bash/beir/run_convert_results.sh
47+
```
48+
49+
LLM rerankig config is at `{REPO_DIR}/bash/beir/run_rerank_llm.sh`
50+
To run the LLM reranking, run:
51+
52+
```
53+
bash bash/beir/run_rerank_llm.sh
54+
```
55+
56+
Evaluation config is at `{REPO_DIR}/bash/beir/run_eval.sh`
57+
To verify that ranking performance has improved from reranking, run:
58+
```
59+
bash bash/run_eval.sh rerank
60+
61+
Set flag --suffix to "llm_FIRST_alpha" for FIRST LLM evaluation or "ce" for cross encoder reranker
62+
```
63+
64+
65+
## 3. Model Training
66+
### 3a. Training Dataset
67+
Converted training dataset (alphabetic IDs) is on [HF](https://huggingface.co/datasets/rryisthebest/rank_zephyr_training_data_alpha). The standard numeric training dataset can be found [here](https://huggingface.co/datasets/castorini/rank_zephyr_training_data).
68+
69+
### 3b. Training
70+
We support three training objectives:
71+
72+
- **Ranking**: The Ranking objective uses a learning-to-rank algorithm to output the logits for the highest-ranked passage ID.
73+
- **Generation**: The Generation objective follows the principles of Causal Language Modeling, focusing on permutation generation.
74+
- **Combined**: The Combined objective, which we introduce in our paper, is a novel weighted approach that seamlessly integrates both ranking and generation principles, and is the setting applied to the FIRST model.
75+
76+
77+
Training and accelerate configs are at `{REPO_DIR}/bash/run_train.sh` and `{REPO_DIR}/train_configs/accel_config.yaml`, respectively.
78+
79+
To train the model, run:
80+
```
81+
bash bash/run_train.sh
82+
```
83+
84+
To train gated model, login to Huggingface and get token access at huggingface.co/settings/tokens.
85+
```
86+
huggingface-cli login
87+
```
588

bash/beir/run_1st_retrieval.sh

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
#!/bin/bash
2+
3+
# Ensure the input directory is provided
4+
if [ -z "$1" ]; then
5+
echo "Usage: $0 <input_directory>"
6+
exit 1
7+
fi
8+
9+
input_dir="$1"
10+
11+
# Create necessary directories
12+
output_dir="${REPO_DIR}/outputs/beir"
13+
data_dir="${REPO_DIR}/datasets/beir"
14+
15+
mkdir -p "$output_dir" "$data_dir"
16+
17+
# Datasets to process
18+
datasets=('trec-covid') # 'climate-fever' 'dbpedia-entity' 'fever' 'fiqa' 'hotpotqa' 'msmarco' 'nfcorpus' 'nq' 'scidocs' 'scifact' 'trec-covid'
19+
20+
# Iterate over datasets
21+
for dataset in "${datasets[@]}"; do
22+
echo "Processing dataset: ${dataset}"
23+
24+
dataset_output_dir="${output_dir}/${dataset}"
25+
mkdir -p "$dataset_output_dir"
26+
27+
python -m tevatron.faiss_retriever \
28+
--query_reps "${input_dir}/${dataset}/original_query/qry.pt" \
29+
--passage_reps "${input_dir}/${dataset}/original_corpus/*.pt" \
30+
--depth 1000 \
31+
--batch_size -1 \
32+
--save_text \
33+
--save_ranking_to "${dataset_output_dir}/rank.tsv"
34+
done

bash/beir/run_convert_results.sh

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
data_dir=${REPO_DIR}/datasets/beir/
2+
output_dir=${REPO_DIR}/outputs/beir/
3+
4+
# List of datasets to process
5+
datasets=('trec-covid') # 'climate-fever' 'fever' 'hotpotqa' 'msmarco' 'nfcorpus' 'nq' 'fiqa' 'scidocs' 'scifact' 'dbpedia-entity' 'trec-covid'
6+
7+
# Iterate over datasets and process each one
8+
for datasets in "${datasets[@]}"; do
9+
echo "Processing dataset: ${datasets}"
10+
11+
# Execute the conversion script with error handling
12+
if python "${REPO_DIR}/scripts/convert_results.py" \
13+
--dataset "${datasets}" \
14+
--output_dir "${output_dir}" \
15+
--data_type "beir" \
16+
--data_dir "${data_dir}" \
17+
--top_k 100; then
18+
echo "Successfully processed ${datasets}"
19+
else
20+
echo "Failed to process ${datasets}" >&2
21+
exit 1
22+
fi
23+
done

bash/beir/run_eval.sh

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
#!/bin/bash
2+
3+
# Check if eval_type argument is provided
4+
if [ -z "$1" ]; then
5+
echo "Usage: $0 <eval_type>"
6+
exit 1
7+
fi
8+
9+
EVAL_TYPE=$1
10+
DATA_DIR="${REPO_DIR}/datasets/beir/"
11+
OUTPUT_DIR="${REPO_DIR}/outputs/beir/"
12+
13+
# List of datasets to process
14+
DATASETS=('trec-covid') # 'climate-fever' 'fever' 'hotpotqa' 'msmarco' 'nfcorpus' 'nq' 'fiqa' 'scidocs' 'scifact' 'dbpedia-entity' 'trec-covid'
15+
16+
# Iterate over datasets and process each one
17+
for DATASET in "${DATASETS[@]}"; do
18+
echo "Evaluating dataset: ${DATASET}"
19+
20+
# Execute the evaluation script
21+
# suffix: ce -> cross encoder reranker | llm_FIRST_alpha -> FIRST Model
22+
if python "${REPO_DIR}/scripts/eval.py" \
23+
--dataset "${DATASET}" \
24+
--output_path "${OUTPUT_DIR}" \
25+
--data_type "beir" \
26+
--suffix "llm_FIRST_alpha" \
27+
--eval_type "${EVAL_TYPE}" \
28+
--data_dir "${DATA_DIR}"; then
29+
echo "Successfully evaluated ${DATASET}"
30+
else
31+
echo "Failed to evaluate ${DATASET}" >&2
32+
exit 1
33+
fi
34+
done

bash/beir/run_rerank_CE.sh

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
#!/bin/bash
2+
3+
# Set directories
4+
DATA_DIR="${REPO_DIR}/datasets/beir/"
5+
OUTPUT_DIR="${REPO_DIR}/outputs/beir/"
6+
7+
# List of datasets to rerank
8+
DATASETS=('trec-covid') # 'climate-fever' 'fever' 'hotpotqa' 'msmarco' 'nfcorpus' 'nq' 'fiqa' 'scidocs' 'scifact' 'dbpedia-entity'
9+
10+
# Iterate over datasets and rerank each one
11+
for DATASET in "${DATASETS[@]}"; do
12+
echo "Reranking dataset: ${DATASET}"
13+
14+
# Execute the rerank script with error handling
15+
if python "${REPO_DIR}/scripts/rerank_CE.py" \
16+
--dataset "${DATASET}" \
17+
--output_dir "${OUTPUT_DIR}" \
18+
--data_dir "${DATA_DIR}" \
19+
--data_type "beir" \
20+
--top_k 100; then
21+
echo "Successfully reranked ${DATASET} with CE reranker"
22+
else
23+
echo "Failed to rerank ${DATASET}" >&2
24+
exit 1
25+
fi
26+
done

bash/beir/run_rerank_llm.sh

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
#!/bin/bash
2+
3+
# Set directories and model
4+
DATA_DIR="${REPO_DIR}/datasets/beir/"
5+
OUTPUT_DIR="${REPO_DIR}/outputs/beir/"
6+
MODEL_IN_USE="rryisthebest/First_Model"
7+
8+
# Configuration flags
9+
USE_LOGITS=1 # Whether to use FIRST single token logit decoding
10+
USE_ALPHA=1 # Whether to use Alphabetic Identifiers
11+
12+
# List of datasets to rerank
13+
DATASETS=('dbpedia-entity') # 'climate-fever' 'fever' 'hotpotqa' 'msmarco' 'nfcorpus' 'nq' 'fiqa' 'scidocs' 'scifact' 'trec-covid'
14+
15+
# Iterate over datasets and rerank each one
16+
for DATASET in "${DATASETS[@]}"; do
17+
echo "Reranking dataset: ${DATASET}"
18+
19+
# Execute the rerank script with error handling
20+
if python "${REPO_DIR}/scripts/rerank_llm.py" \
21+
--model "${MODEL_IN_USE}" \
22+
--dataset "${DATASET}" \
23+
--output_dir "${OUTPUT_DIR}" \
24+
--data_type "beir" \
25+
--data_dir "${DATA_DIR}" \
26+
--use_logits "${USE_LOGITS}" \
27+
--use_alpha "${USE_ALPHA}" \
28+
--llm_top_k 100 \
29+
--window_size 20 \
30+
--step_size 10 \
31+
--do_batched 1; then
32+
echo "Successfully reranked ${DATASET} with LLM reranker"
33+
else
34+
echo "Failed to rerank ${DATASET} with LLM reranker" >&2
35+
exit 1
36+
fi
37+
done

bash/beir/run_train.sh

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
#!/bin/bash
2+
3+
# Define model, dataset paths, and output directory
4+
BASE_MODEL="HuggingFaceH4/zephyr-7b-beta"
5+
TRAIN_DATA_PATH="rryisthebest/rank_zephyr_training_data_alpha" # Train Dataset --> Hugging Face dataset or Local dataset
6+
EVAL_DATA_PATH="rryisthebest/evaluation_data_alpha" # Eval Dataset --> Hugging Face dataset or Local dataset
7+
OUTPUT_DIR="${REPO_DIR}/models/ranking/FIRST_Model" # Directory to save the trained model
8+
BEIR_DATA_DIR="${REPO_DIR}/datasets/beir/"
9+
10+
# Launch training with DeepSpeed configuration
11+
accelerate launch --config_file "${REPO_DIR}/train_configs/accel_config_deepspeed.yaml" "${REPO_DIR}/scripts/train_ranking.py" \
12+
--model_name_or_path "${BASE_MODEL}" \
13+
--train_dataset_path "${TRAIN_DATA_PATH}" \
14+
--eval_dataset_path "${EVAL_DATA_PATH}" \
15+
--beir_data_path "${BEIR_DATA_DIR}" \
16+
--per_device_eval_batch_size 1 \
17+
--num_train_epochs 3 \
18+
--seed 42 \
19+
--per_device_train_batch_size 2 \
20+
--eval_steps 400 \
21+
--gradient_checkpointing \
22+
--gradient_accumulation_steps 16 \
23+
--lr_scheduler_type cosine \
24+
--num_warmup_steps 50 \
25+
--output_dir "${OUTPUT_DIR}" \
26+
--noisy_embedding_alpha 5 \
27+
--objective combined

0 commit comments

Comments
 (0)