Skip to content

No cuBLAS performance gain for F16 #1249

@ggerganov

Description

@ggerganov

I noticed that using cuBLAS with the F16 model does not give any benefit compared to non-BLAS CPU-only mode:

# with cuBLAS
$ ▶ make clean && LLAMA_CUBLAS=1 make -j && time ./perplexity -m ./models/7B/ggml-model-f16.bin -f ./build/wiki.test.raw --no-mmap -t 12
....
system_info: n_threads = 12 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
perplexity : calculating perplexity over 655 chunks, batch_size=512
14.33 seconds per pass - ETA 2 hours 36 minutes
[1]4.2336,^C^C

# without BLAS
$ ▶ make clean && make -j && time ./perplexity -m ./models/7B/ggml-model-f16.bin -f ./build/wiki.test.raw --no-mmap -t 12
...
system_info: n_threads = 12 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
perplexity : calculating perplexity over 655 chunks, batch_size=512
13.75 seconds per pass - ETA 2 hours 30 minutes
[1]4.2335,^C

System:

  • GeForce GTX 1660
  • AMD Ryzen 9 5950X

In contrast, when using a quantized model, the cuBLAS run is significantly faster

Is this expected?
I was hoping to have some performance improvement for F16 as well.
Maybe the data transfer is very slow for F16 and it defeats the purpose of offloading to the GPU?

I noticed this after porting the latest ggml to whisper.cpp where we use F16 precision and was surprised that cuBLAS does not bring any improvement.

For example, sometime ago I tried using NVBLAS in whisper.cpp and it did bring some decent improvements: ggml-org/whisper.cpp#220 (comment)

The NVBLAS code change was very trivial: ggml-org/whisper.cpp#239
What could NVBLAS be doing better in this case?

Metadata

Metadata

Assignees

Labels

questionFurther information is requested

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions