Automatically selecting optimal quantisation types - Help needed to test #15576

EAddario · 2025-08-25T21:10:44Z

EAddario
Aug 25, 2025

Hi all! reaching out to the community for help in testing a new feature.

PR #15550 (in draft pending testing) introduces a new option --target-bpw, which implements an optimised quant type selection algorithm to automatically determine per-tensor quantisation types to achieve a bits-per-weight (bpw) target, with minimal estimated quality loss. The new function,

builds a candidate set of quant types (K or IQ types)
for each layer/tensor, it simulates quantise→dequantise per candidate type, and estimates error using a weighted MSE error function. If the imatrix includes activations, it adds a bias penalty term to better reflect forward‑pass impact, making the error estimation more accurate and thus the quant type selection
it filters candidates to the pareto frontier (lowest error for a given size), then starts from the smallest bpw mix increasing to larger formats, based on the best error‑reduction per added bit, until the global bpw budget is reached
returns a map of tensor name → ggml_type overrides, which the main quantisation pass uses. If the minimum achievable BPW already exceeds the target, it returns that minimum.

Based on limited testing, this approach seems to produce, on average, better quality models compared to naive quantisation (i.e. simply running standard llama-quantize with no further optimisations).

Although these are very encouraging results, more testing with different model architectures and sizes is needed before categorically concluding the new functionality consistently yields higher quality models.

The testing protocol I'm using is:

Generate* Q8_0, Q6_K, Q5_K_M, Q5_K_S, Q4_K_M, Q4_K_S, Q3_K_L, Q3_K_M, Q3_K_S, IQ4_NL, IQ3_M, and IQ3_S naive quantisations (e.g. llama-quantize --imatrix imatrix-with-activations.gguf LLM-Model-F16.gguf Naive-Quantized-<TYPE>.gguf <type>)
Determine each model bits per weight (bpw). This can be easily done by using python llama.cpp/gguf-py/gguf/scripts/gguf_dump.py --markdown BPW-Quantized-<TYPE>.gguf type
Generate equivalent quant types by setting --target-bpw to the corresponding bpw values (e.g. llama-quantize --imatrix imatrix-with-activations.gguf --target-bpw <naive bpw> LLM-Model-F16.gguf BPW-Quantized-<TYPE>.gguf <type>)
Calculate quality scores via llama-perplexity -m <Naive|BPW>-Quantized-<TYPE>.gguf -f calibration_dataset.txt --kl-divergence-base LLM-Model-F16.logits --kl-divergence

(*) this particular list is just a personal preference. The new function can/should be used with any supported type.

I have tested two small but representative models: Llama-3.2-1B ("classic" transformer architecture) and Huihui-MoE-1.2B-A0.6B (typical Mixture or Experts), and got the following results:

(PPL: the smaller the better; 𝜌PPL: the higher the better; KLD: the smaller the better; In bold: best quality)

Llama-3.2-1B results:

Model	Naive BPW	Target BPW	Naive PPL	PPL	Naive 𝜌PPL	𝜌PPL	Naive KLD	KLD
IQ3_M	4.2042	4.2058	11.21441	13.08066	97.46%	94.62%	0.14661	0.29047
IQ3_S	4.1177	4.1191	11.41846	14.10772	97.08%	93.22%	0.16744	0.36883
IQ4_NL	4.9535	4.9542	10.10609	9.98096	99.19%	99.41%	0.04641	0.03356
Q3_K_L	4.6913	4.6894	10.74840	10.30599	98.10%	98.83%	0.10510	0.06738
Q3_K_M	4.4215	4.4184	10.97909	10.42277	97.71%	98.65%	0.12602	0.07920
Q3_K_S	4.1033	4.1037	14.19578	12.11165	92.80%	95.92%	0.37986	0.22400
Q4_K_M	5.1779	5.1792	10.01618	9.88781	99.34%	99.54%	0.03732	0.02654
Q4_K_S	4.9704	4.9762	10.06778	9.97105	99.27%	99.42%	0.04243	0.03350
Q5_K_M	5.8499	5.8521	9.75894	9.79049	99.80%	99.73%	0.01128	0.01620
Q5_K_S	5.7273	5.7291	9.76039	9.79663	99.80%	99.70%	0.01135	0.01757
Q6_K	6.5639	6.5646	9.68812	9.68277	99.91%	99.94%	0.00495	0.00354
Q8_0	8.5013	8.486	9.65172	9.64781	99.99%	99.99%	0.00050	0.00048

Huihui-MoE-1.2B-A0.6B results:

Model	Naive BPW	Target BPW	Naive PPL	PPL	Naive 𝜌PPL	𝜌PPL	Naive KLD	KLD
IQ3_M	3.9173	3.9204	27.44670	30.17950	92.93%	91.42%	0.53704	0.58776
IQ3_S	3.8207	3.8239	29.30734	32.94412	92.20%	90.00%	0.52148	0.70061
IQ4_NL	4.5043	4.5092	19.55229	19.71237	98.62%	98.09%	0.08709	0.13948
Q3_K_L	4.3883	4.3923	21.48216	20.62434	96.80%	98.20%	0.20565	0.12301
Q3_K_M	4.1221	5.0412	21.94908	18.87276	96.43%	99.20%	0.23232	0.04863
Q3_K_S	3.8207	3.8519	26.05622	23.81005	93.87%	95.60%	0.41128	0.30752
Q4_K_M	4.9904	5.0412	18.91957	18.87276	99.02%	99.20%	0.05888	0.04863
Q4_K_S	4.7793	4.7826	19.12118	19.25212	98.89%	99.02%	0.06898	0.06238
Q5_K_M	5.7541	5.7950	18.28129	18.31989	99.66%	99.70%	0.01778	0.01531
Q5_K_S	5.6323	5.6342	18.38359	18.37216	99.63%	99.65%	0.02013	0.01884
Q6_K	6.5655	6.5693	18.20380	18.19202	99.80%	99.81%	0.00776	0.00725
Q8_0	8.5028	8.5071	18.09292	18.08959	99.90%	99.90%	0.00094	0.00090

Anyone wishing to lend a hand, grateful if you could try it and share your results, ideally in a similar format as above (it helps to collate and consolidate), along with comments / feedback and specially bug reports.

Please feel free to test any quantisation types you'd like

IMPORTANT NOTE!

A new format imatrix (gguf) is required for the function to work, and if "activations" are included in the imatrix file, the estimation process will be much more accurate. However, at the time of writing, this is only available by generating the file using #14891 with --activation-statistics and --output-format gguf options: llama-imatrix -m LLM-Model-F16.gguf -f calibration-data.txt --activation-statistics --output-format gguf -o imatrix-with-activations.gguf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Automatically selecting optimal quantisation types - Help needed to test #15576

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Automatically selecting optimal quantisation types - Help needed to test #15576

Uh oh!

Uh oh!

EAddario Aug 25, 2025

Hi all! reaching out to the community for help in testing a new feature.

Llama-3.2-1B results:

Huihui-MoE-1.2B-A0.6B results:

IMPORTANT NOTE!

Replies: 0 comments

EAddario
Aug 25, 2025