TRL/Original DPO

Different experiments with TRL implementation and original implementation.

Created on October 10|Last edited on October 24

Comment

﻿
﻿
eval/rewards/accuracies
eval/rewards/accuracies
Showing first 10 runs
100200300400500600700Step0.60.650.70.75
loss/train
loss/train
50k100k150k200kStep100200300400
rewards_eval/margins
rewards_eval/margins
050k100k150kStep00.10.20.30.40.5
train/loss
train/loss
Showing first 10 runs
0200400600Step0.10.20.30.40.50.60.7
train/rewards/margins
train/rewards/margins
Showing first 10 runs
0200400600Step01234
eval/rewards/margins
eval/rewards/margins
Showing first 10 runs
100200300400500600700Step0.20.40.60.81
Run set38
﻿
Results:As we can see from the above graphs and the table, original implementation with fp32 and 2 models works better than TRL + LoRa + bf16. The difference is quite large. I will try some more parameters for LoRa with a hope that it will increase the quality of the model.
﻿

Add a comment