Automatically selecting optimal quantisation types - Help needed to test #15576
EAddario
started this conversation in
Show and tell
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi all! reaching out to the community for help in testing a new feature.
PR #15550 (in draft pending testing) introduces a new option
--target-bpw
, which implements an optimised quant type selection algorithm to automatically determine per-tensor quantisation types to achieve a bits-per-weight (bpw) target, with minimal estimated quality loss. The new function,Based on limited testing, this approach seems to produce, on average, better quality models compared to naive quantisation (i.e. simply running standard
llama-quantize
with no further optimisations).Although these are very encouraging results, more testing with different model architectures and sizes is needed before categorically concluding the new functionality consistently yields higher quality models.
The testing protocol I'm using is:
llama-quantize --imatrix imatrix-with-activations.gguf LLM-Model-F16.gguf Naive-Quantized-<TYPE>.gguf <type>
)python llama.cpp/gguf-py/gguf/scripts/gguf_dump.py --markdown BPW-Quantized-<TYPE>.gguf type
--target-bpw
to the corresponding bpw values (e.g.llama-quantize --imatrix imatrix-with-activations.gguf --target-bpw <naive bpw> LLM-Model-F16.gguf BPW-Quantized-<TYPE>.gguf <type>
)llama-perplexity -m <Naive|BPW>-Quantized-<TYPE>.gguf -f calibration_dataset.txt --kl-divergence-base LLM-Model-F16.logits --kl-divergence
(*) this particular list is just a personal preference. The new function can/should be used with any supported type.
I have tested two small but representative models: Llama-3.2-1B ("classic" transformer architecture) and Huihui-MoE-1.2B-A0.6B (typical Mixture or Experts), and got the following results:
(PPL: the smaller the better; 𝜌PPL: the higher the better; KLD: the smaller the better; In bold: best quality)
Llama-3.2-1B results:
Huihui-MoE-1.2B-A0.6B results:
Anyone wishing to lend a hand, grateful if you could try it and share your results, ideally in a similar format as above (it helps to collate and consolidate), along with comments / feedback and specially bug reports.
Please feel free to test any quantisation types you'd like
IMPORTANT NOTE!
A new format imatrix (gguf) is required for the function to work, and if "activations" are included in the imatrix file, the estimation process will be much more accurate. However, at the time of writing, this is only available by generating the file using #14891 with
--activation-statistics
and--output-format gguf
options:llama-imatrix -m LLM-Model-F16.gguf -f calibration-data.txt --activation-statistics --output-format gguf -o imatrix-with-activations.gguf
Beta Was this translation helpful? Give feedback.
All reactions