-
Notifications
You must be signed in to change notification settings - Fork 70
Quantization improvements #295
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Tested with q4_0 and q3_K (pure, imatrix), and the improvement is quite significant.
Nice! It seems like your improved Here's your improved
And your improved
Very interesting approach with the gradient. |
To be honest I don't understand these plots. I know yellow is good and blue is bad, and there is a lot of blue, so they must be pretty bad? |
No, the plots of your algorithms are not bad. Blue is simply the color of the max error. I did also include the values for the min mean and max cosine similarities of the plots. If an algorithm had a very big error in one spot, everything else would be yellow. This means the colors can't really be compared directly. The information which can be gotten out of those plots is whether the algorithms have spots where a transition between representable values is very harsh, which can indicate either instability in the algorithm or non-idealness. In this case, the modifications you propose here do improve how the plots look like (for |
And what are the two coordinates of the plot? I understand it is a projection, but what is it that is being projected?
That would be the standard way to approach an optimization problem, no? |
The horizontal coordinates is The vectors tested have the form The script which I'm using is https://github.com/compilade/rounding-experiments/blob/main/equirectangular.py, although I have some local modifications to make it use other rounding algorithms, which are defined in
Sure. Being standard doesn't mean it's not interesting. You have made the gradients explicit, which I appreciate. And if your gradient search and my cumulative search (once the range is reduced) are equivalent (or close enough), that in itself is interesting, since I did not explicitly use gradients. I really like when different approaches end up being equivalent (or close enough) because this makes them easier to understand, explain and generalize to other cases (notably, my approach might be harder to adapt to grid-restricted i-quants). (If they are not equivalent, this is still very cool, even if this is technically using a standard approach) I will compare the speed and perplexity of narrower cumulative search with this once I have some spare time, since I do think reducing the searched range will greatly improve the speed of my (currently quite slow) proposed algorithms. |
Was this needed for some quants of DSL to function? As I ran into issues with a pure iq4_k_r4 quant for the new Deepseek V3 0324 (as my first mix of this finetune was noticeably slower than my first and fastest mix of R1). The pure ran at about the same speed as that R1 mix (I think it should have been a bit faster than it is and the speed loss may be from #259 since for this model I did not convert it myself and grabbed a conversion that was done with mainline), but it was not functional (I forgot to test perplexity before unloading it), either giving a few incomprehensible tokens or just straight to an EOS token from my brief usage. Comparing the quant logs for both, the only different tensors of the functional R1 mix were the following 5:
These two tensors were not in my new mix as mentioned above, being computed (
The new pure V3 had all of the three of the above set to iq4_k_r4. Also for reference the full tensor breakdown of both mixes: R1 fast and functional:
Pure mix of V3_0324:
Do you think that setting output.weight to iq6_k and leaving the rest completely pure would work? When I do make this next quant I might end up converting the model myself to see if #259 was costing me performance (even if I won't be comparing the exact same mix, I think it would still answer that question). |
#259 creates
Not sure about this one. |
Yes I experimented with some quant mixes with those at Q8_0 before to see how much impact they had on PPL (but never isolated effects as the change in PPL was too minor and the TG impact too large for my preferences).
Yes it is unfortunately very sensitive to that, I even considered #259 before I downloaded this preconverted model but decided to try it anyway.
I'll test attn_output.weight set to iq6_k and report back when I get a chance (will first have to download and convert the model so that I can also test #259 ). |
This was also outputting gibberish. It seems both are important. |
It is now more than a year since I added the imatrix to
llama.cpp
. I think we can say that imatrix based quantization is now the standard. Hence, I believe it is no longer necessary to make quantization robust against failure modes that can be triggered when quantizing without an imatrix.Based on this consideration, this PR adds improved versions of
make_qx_quants
, used to quantizeQ4_0, Q5_0, Q6_0, Q3_K, Q6_K
, andquantize_row_iq4_nl_impl
, used to quantizeIQ4_NL
andIQ4_XS
.The following table shows PPL comparisons between the mai branch, this PR, and PR 12557 in mainline
llama.cpp
for LLaMA-v1-7B1(L1-7B in the table), LLaMA-v2-7B1 (L2-7B), Mistral-7B1 (M-7B), LLaMA-3.1-8B-Instruct (L3-8B), and DeepSeek-V2-Lite (DSL). Context is always 512 tokens. Also given are the quantization times (Q-time for short in the table) in seconds on a Ryzen-7950X CPU. Tested is "pure" quantization (i.e., using the--pure
option ofllama-quantize
) with token embeddings and output tensor set toQ8_0
. The quantization command line is1 Why use such ancient models? The LLaMA-v1 models were the basis for k-quants development. I-quants were developed using LLaMA-v1, LLaMA-v2 and Mistral-7B. In my experience, if a quantization technique does well on all 3 of these, it is (almost) guaranteed to do well on any other model out there.
2 I have this model on an old HDD. In this case quantization time is dominated by the time needed to read the data from the HDD. I could have copied the model to the SSD drive, but I think the timing for the other models give enough indication of the relative performance of the various quantization techniques.
3 This quantization type is not available in mainline
llama.cpp
.4 Some of the tensor row size are not divisible by the k- and i-quants super-block size of 256. In mainline
llama.cpp
the quantization fails in that case when using--pure
. I have changedik_llama.cpp
to use the fallback quantization type in that case in PR #294.5 PR 12557 does not change
Q6_K
quantization.Some background
Quantization involves a mixed-integer optimization problem, which is hard to solve in general. But in the case of block-wise quantization, where each block is quantized independently, and hence one has to deal with just 16 or 32 variables, an exact solution is feasible without very long computation times. However, the experience with the LLaMA-v1 series of models collected while developing k-quants showed that the exact solution can often lead to disastrous results in observed quantization quality (e.g., a much higher perplexity or lower HellaSwag score). Hence, k-quants and later i-quants used heuristics to search for a solution only within a carefully tuned range of scales around the round-to-nearest (RTN) value. When I added the i-matrix, the hope was that one can discard the heuristics and use the exact solution instead. But even with an imatrix, it was possible to arrive at a catastrophic failure (see, e.g., the results of the main branch for
Q4_0
andQ5_0
. To avoid such failures, when quantizing without--pure
, a different quantization type is used for theffn_down
tensors in the first few layers). In addition, often quantizations were prepared without an imatrix, so the quantization technique had to be made robust also for this use case. Hence, the heuristics remained.In PR 12557 in mainline$x_i$ and the integer quants $q_i$ , one needs to maximize
llama.cpp
@compilade uses a (nearly) exhaustive search for optimality, whith correspondingly very long quantization times. One can arrive at about the same result much quicker as follows. To minimize the weighted-mean-square-error (WMSE) between the original model weightswhere the$\tilde{x}_i$ is give by $\tilde{x}_i = d q_i$ , and where $d$ is the float block scale. The block scale that minimizes $WMSE$ is given by
w_i
are importances given by, e.g., an imatrix (but can also be defined in some different way when no matrix is available), and the summation is over the elements of a quantization block. The above equation is for a "Type-0" quantization where the quantized model weightThe gradient$g_j$ of the integer quant $q_j$ is given by
If we take a step along the gradient (we are maximizing$F$ , so need to go along the gradient), the quant with the maximum $|g_j|$ will be first to change to the next integer value ($q_j + \Delta_j$ , where $\Delta_j = 1$ if $g_j > 0, -1$ otherwise). Hence we can compute the new value of $F$ by just adding $w_j x_j \Delta_j$ to the numerator and $w_j (2 q_j \Delta_j + 1)$ to the denominator. If the new value of $F$ is greater than the previous highest value, we accept the change, set $q_j \to q_j + \Delta_j$ , compute the new optimum scale $d$ , and repeat the previous steps. If the new value of $F$ is lower than the previous highest $F$ , we break out from the iteration. This is very similar to the exact solution technique, except that there one doesn't check just the quant with the maximum gradient, but adds all possible steps along the gradient that change the quants to the next integer value along the gradient while the quants are within the allowed range, sorts the steps in increasing order, and then goes over the steps updating one quant at a time, computing the updated $F$ , and picking the step that resulted in the maximum value for $F$ . Because of that, this kind of "first order" approximation is much faster than exhaustive search, as can be seen in the above table by comparing quantization run times between this PR and @compilade's PR 12557, while achieving effectively the same quantization accuracy as measured by PPL.
Extending the above algorithm to the non-linear quants$q_i$ with $T(q_i)$ in the above equations, where $T(q_i)$ is the non-linear mapping function (lookup table), i.e., we have $\tilde{x}_i = d T(q_i)$
IQ4_XS
andIQ4_NL
is trivial. One just needs to replace