Skip to content

Conversation

noemotiovon
Copy link
Contributor

@noemotiovon noemotiovon commented Aug 25, 2025

What does this PR do?

  1. Refactored the mask computation in Flash Attention, unified the logic without separating prefill and decode.
  2. Optimized performance in non-alibi scenarios by reducing one repeat operation.
  3. Optimized tensor layout in FA from BNSD to BSND, reducing tensor movement time (BSND layout is contiguous in memory), significantly improving performance (15%).
  4. Updated operator management to explicitly mark unsupported cases on 310P devices and when dim is not divisible by 16.

OP Test

......
  11821/11821 tests passed
  Backend CANN0: OK
Backend 2/2: CPU
  Skipping
2/2 backends passed
OK

Model Test

......
llama_perf_sampler_print:    sampling time =      58.82 ms /   257 runs   (    0.23 ms per token,  4369.49 tokens per second)
llama_perf_context_print:        load time =    1744.56 ms
llama_perf_context_print: prompt eval time =      30.98 ms /    20 tokens (    1.55 ms per token,   645.58 tokens per second)
llama_perf_context_print:        eval time =    1266.71 ms /   236 runs   (    5.37 ms per token,   186.31 tokens per second)
llama_perf_context_print:       total time =    2386.79 ms /   256 tokens
llama_perf_context_print:    graphs reused =        235

@github-actions github-actions bot added ggml changes relating to the ggml tensor library for machine learning Ascend NPU issues specific to Ascend NPUs labels Aug 25, 2025
@noemotiovon noemotiovon force-pushed the fa branch 2 times, most recently from df36aa8 to 09f0444 Compare August 26, 2025 08:10
@noemotiovon noemotiovon changed the title [CANN]Optimization of unnecessary repeat in the FA operator CANN: refactor mask handling and improve performance in FA Aug 26, 2025
Copy link
Collaborator

@hipudding hipudding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job! only one minor modification required.

@noemotiovon
Copy link
Contributor Author

noemotiovon commented Aug 26, 2025

I also optimized tensor layout in FA from BNSD to BSND in this PR, reducing tensor movement time (BSND layout is contiguous in memory), significantly improving performance (15%).

Model Test:

......
> 
llama_perf_sampler_print:    sampling time =      97.20 ms /   409 runs   (    0.24 ms per token,  4207.60 tokens per second)
llama_perf_context_print:        load time =    2098.21 ms
llama_perf_context_print: prompt eval time =      27.14 ms /    20 tokens (    1.36 ms per token,   736.97 tokens per second)
llama_perf_context_print:        eval time =    1708.84 ms /   388 runs   (    4.40 ms per token,   227.05 tokens per second)
llama_perf_context_print:       total time =    3553.72 ms /   408 tokens
llama_perf_context_print:    graphs reused =        386

1. Refactored the mask computation in Flash Attention, unified the logic without separating prefill and decode.
2. Optimized performance in non-alibi scenarios by reducing one repeat operation.
3. Updated operator management to explicitly mark unsupported cases on 310P devices and when dim is not divisible by 16.

Signed-off-by: noemotiovon <757486878@qq.com>
Signed-off-by: noemotiovon <757486878@qq.com>
Signed-off-by: noemotiovon <757486878@qq.com>
@hipudding hipudding merged commit 1e74897 into ggml-org:master Aug 27, 2025
49 checks passed
gabe-l-hart added a commit to gabe-l-hart/llama.cpp that referenced this pull request Aug 28, 2025
…nemotron-nano-15409

* origin/master: (59 commits)
scripts: add sqlite3 check for compare-commits.sh (ggml-org#15633)
kv-cache : remove LLAMA_SET_ROWS checks (ggml-org#15505)
gguf-py: byteswapping improvements (ggml-org#12851)
cli : change log to warning to explain reason for stopping (ggml-org#15604)
model-conversion : add mmproj conversion target (ggml-org#15628)
cuda: Add cublasLt_static linking when GGML_STATIC is enabled (ggml-org#15622)
server: higher timeout for tests (ggml-org#15621)
presets : add qwen3-30B-a3b FIM (ggml-org#15616)
HIP: Enable support for ggml_backend_cuda_register_host_buffer (ggml-org#15615)
kv-cache : better estimate of n_kv for multi-sequence batches (ggml-org#15610)
CANN: refactor mask handling and improve performance in FA (ggml-org#15561)
ggml-cpu : add basic RVV support for vector f32 ops (ggml-org#15057)
common : add -m to bash completion for --model [no ci] (ggml-org#15591)
OpenCL: add fused group_norm/norm, mul, add (ggml-org#15314)
tests : fix test-opt with GGML_BACKEND_DL (ggml-org#15599)
SYCL: fix rms_norm_mul_add for tensor dim not a multiple of sg_size (ggml-org#15592)
mtmd : fix mtmd ios build (ggml-org#15579)
tests: add performance test for mul mat id (ggml-org#15543)
llamafile: PowerPC Sgemm Optimization (ggml-org#15558)
graph : fix assert in memory-less build_attn (ggml-org#15590)
...
gabe-l-hart added a commit to gabe-l-hart/llama.cpp that referenced this pull request Aug 28, 2025
…upport

* origin/master: (61 commits)
scripts: add sqlite3 check for compare-commits.sh (ggml-org#15633)
kv-cache : remove LLAMA_SET_ROWS checks (ggml-org#15505)
gguf-py: byteswapping improvements (ggml-org#12851)
cli : change log to warning to explain reason for stopping (ggml-org#15604)
model-conversion : add mmproj conversion target (ggml-org#15628)
cuda: Add cublasLt_static linking when GGML_STATIC is enabled (ggml-org#15622)
server: higher timeout for tests (ggml-org#15621)
presets : add qwen3-30B-a3b FIM (ggml-org#15616)
HIP: Enable support for ggml_backend_cuda_register_host_buffer (ggml-org#15615)
kv-cache : better estimate of n_kv for multi-sequence batches (ggml-org#15610)
CANN: refactor mask handling and improve performance in FA (ggml-org#15561)
ggml-cpu : add basic RVV support for vector f32 ops (ggml-org#15057)
common : add -m to bash completion for --model [no ci] (ggml-org#15591)
OpenCL: add fused group_norm/norm, mul, add (ggml-org#15314)
tests : fix test-opt with GGML_BACKEND_DL (ggml-org#15599)
SYCL: fix rms_norm_mul_add for tensor dim not a multiple of sg_size (ggml-org#15592)
mtmd : fix mtmd ios build (ggml-org#15579)
tests: add performance test for mul mat id (ggml-org#15543)
llamafile: PowerPC Sgemm Optimization (ggml-org#15558)
graph : fix assert in memory-less build_attn (ggml-org#15590)
...
gabe-l-hart added a commit to gabe-l-hart/llama.cpp that referenced this pull request Aug 28, 2025
…g-model-disabled-agent-prefill

* origin/master: (76 commits)
scripts: add sqlite3 check for compare-commits.sh (ggml-org#15633)
kv-cache : remove LLAMA_SET_ROWS checks (ggml-org#15505)
gguf-py: byteswapping improvements (ggml-org#12851)
cli : change log to warning to explain reason for stopping (ggml-org#15604)
model-conversion : add mmproj conversion target (ggml-org#15628)
cuda: Add cublasLt_static linking when GGML_STATIC is enabled (ggml-org#15622)
server: higher timeout for tests (ggml-org#15621)
presets : add qwen3-30B-a3b FIM (ggml-org#15616)
HIP: Enable support for ggml_backend_cuda_register_host_buffer (ggml-org#15615)
kv-cache : better estimate of n_kv for multi-sequence batches (ggml-org#15610)
CANN: refactor mask handling and improve performance in FA (ggml-org#15561)
ggml-cpu : add basic RVV support for vector f32 ops (ggml-org#15057)
common : add -m to bash completion for --model [no ci] (ggml-org#15591)
OpenCL: add fused group_norm/norm, mul, add (ggml-org#15314)
tests : fix test-opt with GGML_BACKEND_DL (ggml-org#15599)
SYCL: fix rms_norm_mul_add for tensor dim not a multiple of sg_size (ggml-org#15592)
mtmd : fix mtmd ios build (ggml-org#15579)
tests: add performance test for mul mat id (ggml-org#15543)
llamafile: PowerPC Sgemm Optimization (ggml-org#15558)
graph : fix assert in memory-less build_attn (ggml-org#15590)
...
gabe-l-hart added a commit to gabe-l-hart/llama.cpp that referenced this pull request Aug 28, 2025
…nemotron-nano-15409

* origin/master: (59 commits)
scripts: add sqlite3 check for compare-commits.sh (ggml-org#15633)
kv-cache : remove LLAMA_SET_ROWS checks (ggml-org#15505)
gguf-py: byteswapping improvements (ggml-org#12851)
cli : change log to warning to explain reason for stopping (ggml-org#15604)
model-conversion : add mmproj conversion target (ggml-org#15628)
cuda: Add cublasLt_static linking when GGML_STATIC is enabled (ggml-org#15622)
server: higher timeout for tests (ggml-org#15621)
presets : add qwen3-30B-a3b FIM (ggml-org#15616)
HIP: Enable support for ggml_backend_cuda_register_host_buffer (ggml-org#15615)
kv-cache : better estimate of n_kv for multi-sequence batches (ggml-org#15610)
CANN: refactor mask handling and improve performance in FA (ggml-org#15561)
ggml-cpu : add basic RVV support for vector f32 ops (ggml-org#15057)
common : add -m to bash completion for --model [no ci] (ggml-org#15591)
OpenCL: add fused group_norm/norm, mul, add (ggml-org#15314)
tests : fix test-opt with GGML_BACKEND_DL (ggml-org#15599)
SYCL: fix rms_norm_mul_add for tensor dim not a multiple of sg_size (ggml-org#15592)
mtmd : fix mtmd ios build (ggml-org#15579)
tests: add performance test for mul mat id (ggml-org#15543)
llamafile: PowerPC Sgemm Optimization (ggml-org#15558)
graph : fix assert in memory-less build_attn (ggml-org#15590)
...
Minh141120 pushed a commit to menloresearch/llama.cpp that referenced this pull request Aug 29, 2025
…15561)

* CANN(flash-attn): refactor mask handling and improve performance

1. Refactored the mask computation in Flash Attention, unified the logic without separating prefill and decode.
2. Optimized performance in non-alibi scenarios by reducing one repeat operation.
3. Updated operator management to explicitly mark unsupported cases on 310P devices and when dim is not divisible by 16.

Signed-off-by: noemotiovon <757486878@qq.com>

* [CANN]: fix review

Signed-off-by: noemotiovon <757486878@qq.com>

* [CANN]: Optimization FA BNSD to BSND

Signed-off-by: noemotiovon <757486878@qq.com>

---------

Signed-off-by: noemotiovon <757486878@qq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Ascend NPU issues specific to Ascend NPUs ggml changes relating to the ggml tensor library for machine learning
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants