-
Notifications
You must be signed in to change notification settings - Fork 12.9k
CUDA: fuse adds, fuse add with rms norm #15631
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
49cf98e
to
4d10578
Compare
Could you also post results with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this missing an extra case for ggml_cuda_can_fuse
to handle the fused add?
Sorry, I accidentally pressed the wrong button and submitted the review before I was done. |
No because it doesn't use it for the fused adds, I use |
Resultsmaster
this PR
|
61f2b2a
to
34ee6e6
Compare
34ee6e6
to
b64ba1c
Compare
Performance changes
|
There seems to be another issue, this PR introduces a segfault in |
I thought this test would be run with the CI, I was wrong |
You have to add |
I think it only works in the comment, not the title. :) |
Ok thanks! Trying it in #15660. I think it might be good to auto trigger test-backend-ops, at least for the affected backends in a PR |
I saw, but it only works in comment/description, so it wasn't triggered. Edit: And yes, it would be useful to somehow autotrigger ggml-ci for the backend in question, but I don't think that's easily doable. |
* CUDA: fused add with rms_norm_mul * Non-broadcast fuse works * Add fused adds * format * Remove n_fuse from template params * Address review comments * Move template inside binbcast
This PR does two things:
Results on a 4090
Master
Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
PR
Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
Master
Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
PR
Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes