Description
I noticed that sometimes a very odd token is generated in the beginning using the CPU backend.
Example (CPU, zero temperature):
./main -m llama-2-7b.Q4_K_M.gguf --samplers "temp" --temp 0 -p "The Julia programming language." --no-penalize-nl --top-k 0 --top-p 1.0 --min-p 0.0 --repeat-penalty 1.0 -ngl 0 --mlock
[...]
The Julia programming language. surely, the Julia programming language is a
Example (GPU, zero temperature):
./main -m llama-2-7b.Q4_K_M.gguf --samplers "temp" --temp 0 -p "The Julia programming language." --no-penalize-nl --top-k 0 --top-p 1.0 --min-p 0.0 --repeat-penalty 1.0
[...]
The Julia programming language.
Julia is a high-level, [...]
The "surely," token is nonsense. The different output is caused by the following difference between the CPU and GPU backends:
Consider a matrix-vector product A*x
where A is quantized (e.g. q4) and x is not quantized (e.g. float32).
The CPU backend quantizes x to q8 and then computes the product using an optimized vecdot(q4, q8) routine.
The GPU backend dequantizes A to float32 and then computes the product using float32.
I tried to find out why these kind of nonsense tokens are generated right at the beginning. I have a suspicion, but no definite answer:
Transformer states can be quite sparse at the initial tokens. Example state in the third layer resulting from the first token:
These spikes amplify the quantization errors in the corresponding blocks:
Transformer states seem to become less sparse / more diffuse over time. I am not completely sure that these spikes are really the cause, but it seems that the q8 quantization in the CPU backend definitely hurts numerical accuracy of matrix-vector products.