Skip to content

Quantization of transformer state for matrix-vector products potentially causes numerical accuracy issues #4755

Closed
@cafaxo

Description

@cafaxo

I noticed that sometimes a very odd token is generated in the beginning using the CPU backend.
Example (CPU, zero temperature):

./main -m llama-2-7b.Q4_K_M.gguf --samplers "temp" --temp 0 -p "The Julia programming language." --no-penalize-nl --top-k 0 --top-p 1.0 --min-p 0.0 --repeat-penalty 1.0 -ngl 0 --mlock
[...]
 The Julia programming language. surely, the Julia programming language is a

Example (GPU, zero temperature):

./main -m llama-2-7b.Q4_K_M.gguf --samplers "temp" --temp 0 -p "The Julia programming language." --no-penalize-nl --top-k 0 --top-p 1.0 --min-p 0.0 --repeat-penalty 1.0
[...]
 The Julia programming language.
Julia is a high-level, [...]

The "surely," token is nonsense. The different output is caused by the following difference between the CPU and GPU backends:
Consider a matrix-vector product A*x where A is quantized (e.g. q4) and x is not quantized (e.g. float32).
The CPU backend quantizes x to q8 and then computes the product using an optimized vecdot(q4, q8) routine.
The GPU backend dequantizes A to float32 and then computes the product using float32.

I tried to find out why these kind of nonsense tokens are generated right at the beginning. I have a suspicion, but no definite answer:
Transformer states can be quite sparse at the initial tokens. Example state in the third layer resulting from the first token:
image
These spikes amplify the quantization errors in the corresponding blocks:
image

Transformer states seem to become less sparse / more diffuse over time. I am not completely sure that these spikes are really the cause, but it seems that the q8 quantization in the CPU backend definitely hurts numerical accuracy of matrix-vector products.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions