No cuBLAS performance gain for F16

I noticed that using cuBLAS with the `F16` model does not give any benefit compared to non-BLAS CPU-only mode:

```bash
# with cuBLAS
$ ▶ make clean && LLAMA_CUBLAS=1 make -j && time ./perplexity -m ./models/7B/ggml-model-f16.bin -f ./build/wiki.test.raw --no-mmap -t 12
....
system_info: n_threads = 12 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
perplexity : calculating perplexity over 655 chunks, batch_size=512
14.33 seconds per pass - ETA 2 hours 36 minutes
[1]4.2336,^C^C

# without BLAS
$ ▶ make clean && make -j && time ./perplexity -m ./models/7B/ggml-model-f16.bin -f ./build/wiki.test.raw --no-mmap -t 12
...
system_info: n_threads = 12 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
perplexity : calculating perplexity over 655 chunks, batch_size=512
13.75 seconds per pass - ETA 2 hours 30 minutes
[1]4.2335,^C
```

System:
 - GeForce GTX 1660
 - AMD Ryzen 9 5950X

In contrast, when using a quantized model, the cuBLAS run is significantly faster

Is this expected?
I was hoping to have some performance improvement for F16 as well.
Maybe the data transfer is very slow for F16 and it defeats the purpose of offloading to the GPU?

I noticed this after porting the latest `ggml` to `whisper.cpp` where we use F16 precision and was surprised that cuBLAS does not bring any improvement.

For example, sometime ago I tried using NVBLAS in `whisper.cpp` and it did bring some decent improvements: https://github.com/ggerganov/whisper.cpp/issues/220#issuecomment-1343301445

The NVBLAS code change was very trivial: https://github.com/ggerganov/whisper.cpp/pull/239
What could NVBLAS be doing better in this case?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

No cuBLAS performance gain for F16 #1249

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

No cuBLAS performance gain for F16 #1249

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions