Skip to main content

TRL/Original DPO

Different experiments with TRL implementation and original implementation.
Created on October 10|Last edited on October 24

Showing first 10 runs
100200300400500600700Step0.60.650.70.75
50k100k150k200kStep100200300400
050k100k150kStep00.10.20.30.40.5
Showing first 10 runs
0200400600Step0.10.20.30.40.50.60.7
Showing first 10 runs
0200400600Step01234
Showing first 10 runs
100200300400500600700Step0.20.40.60.81
Run set
38


Results:

As we can see from the above graphs and the table, original implementation with fp32 and 2 models works better than TRL + LoRa + bf16. The difference is quite large. I will try some more parameters for LoRa with a hope that it will increase the quality of the model.