Quantization of transformer state for matrix-vector products potentially causes numerical accuracy issues

I noticed that sometimes a very odd token is generated in the beginning using the CPU backend.
Example (CPU, zero temperature):
```
./main -m llama-2-7b.Q4_K_M.gguf --samplers "temp" --temp 0 -p "The Julia programming language." --no-penalize-nl --top-k 0 --top-p 1.0 --min-p 0.0 --repeat-penalty 1.0 -ngl 0 --mlock
[...]
 The Julia programming language. surely, the Julia programming language is a
```
Example (GPU, zero temperature):
```
./main -m llama-2-7b.Q4_K_M.gguf --samplers "temp" --temp 0 -p "The Julia programming language." --no-penalize-nl --top-k 0 --top-p 1.0 --min-p 0.0 --repeat-penalty 1.0
[...]
 The Julia programming language.
Julia is a high-level, [...]
```

The "surely," token is nonsense. The different output is caused by the following difference between the CPU and GPU backends:
Consider a matrix-vector product `A*x` where A is quantized (e.g. q4) and x is not quantized (e.g. float32).
The CPU backend **quantizes** x to q8 and then computes the product using an optimized vecdot(q4, q8) routine.
The GPU backend **dequantizes** A to float32 and then computes the product using float32.

I tried to find out why these kind of nonsense tokens are generated right at the beginning. I have a suspicion, but no definite answer:
Transformer states can be quite sparse at the initial tokens. Example state in the third layer resulting from the first token:
<img width="596" alt="image" src="https://github.com/ggerganov/llama.cpp/assets/1753343/b8040ab8-39ae-4b28-859c-ccb29ac437b7">
These spikes amplify the quantization errors in the corresponding blocks:
<img width="596" alt="image" src="https://github.com/ggerganov/llama.cpp/assets/1753343/175e2918-1457-4c73-a82b-2732ed263927">

Transformer states seem to become less sparse / more diffuse over time. I am not completely sure that these spikes are really the cause, but it seems that the q8 quantization in the CPU backend definitely hurts numerical accuracy of matrix-vector products.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Quantization of transformer state for matrix-vector products potentially causes numerical accuracy issues #4755

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Quantization of transformer state for matrix-vector products potentially causes numerical accuracy issues #4755

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions