-
Notifications
You must be signed in to change notification settings - Fork 12.7k
Description
I noticed that using cuBLAS with the F16
model does not give any benefit compared to non-BLAS CPU-only mode:
# with cuBLAS
$ ▶ make clean && LLAMA_CUBLAS=1 make -j && time ./perplexity -m ./models/7B/ggml-model-f16.bin -f ./build/wiki.test.raw --no-mmap -t 12
....
system_info: n_threads = 12 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
perplexity : calculating perplexity over 655 chunks, batch_size=512
14.33 seconds per pass - ETA 2 hours 36 minutes
[1]4.2336,^C^C
# without BLAS
$ ▶ make clean && make -j && time ./perplexity -m ./models/7B/ggml-model-f16.bin -f ./build/wiki.test.raw --no-mmap -t 12
...
system_info: n_threads = 12 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
perplexity : calculating perplexity over 655 chunks, batch_size=512
13.75 seconds per pass - ETA 2 hours 30 minutes
[1]4.2335,^C
System:
- GeForce GTX 1660
- AMD Ryzen 9 5950X
In contrast, when using a quantized model, the cuBLAS run is significantly faster
Is this expected?
I was hoping to have some performance improvement for F16 as well.
Maybe the data transfer is very slow for F16 and it defeats the purpose of offloading to the GPU?
I noticed this after porting the latest ggml
to whisper.cpp
where we use F16 precision and was surprised that cuBLAS does not bring any improvement.
For example, sometime ago I tried using NVBLAS in whisper.cpp
and it did bring some decent improvements: ggml-org/whisper.cpp#220 (comment)
The NVBLAS code change was very trivial: ggml-org/whisper.cpp#239
What could NVBLAS be doing better in this case?