Skip to content

ggml-vulkan: adds support for op CONV_TRANSPOSE_1D #13813

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Jun 4, 2025

Conversation

etasnadi
Copy link
Contributor

  • ggml-vulkan: adds support for op CONV_TRANSPOSE_1D

  • test-backend-ops: adds additional tests for CONV_TRANSPOSE_1D

* test-backend-ops: adds more spohisticated tests for CONV_TRANSPOSE_1D
@github-actions github-actions bot added testing Everything test related Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels May 26, 2025
Number of additional tests reduced to 108.
@etasnadi
Copy link
Contributor Author

etasnadi commented May 27, 2025

The Ubuntu 22 runner stopped because test-backend-ops crashes with segfault. I've observed this behavior on a raw build before as well so I don't know if it has anything to do with my modifications or if it makes this happen more likely.

Edit: when the app crashes with segfault it is always during the exit after the tests were executed.

@jeffbolznv
Copy link
Collaborator

It's showing failures in some of the conv_transpose_1d tests, e.g.:

29:   CONV_TRANSPOSE_1D(ne_input=[2173,1,1,1],ne_kernel=[1337,1,1,1],s0=1,p0=0,d0=1): [CONV_TRANSPOSE_1D] NMSE = 0.459058786 > 0.000000100 FAIL
29:   CONV_TRANSPOSE_1D(ne_input=[2173,1,1,1],ne_kernel=[1337,1,1,1],s0=2,p0=0,d0=1): [CONV_TRANSPOSE_1D] NMSE = 0.222932697 > 0.000000100 FAIL
29:   CONV_TRANSPOSE_1D(ne_input=[2173,1,1,1],ne_kernel=[1337,1,1,1],s0=3,p0=0,d0=1): [CONV_TRANSPOSE_1D] NMSE = 0.167964696 > 0.000000100 FAIL

Then it appears to crash on the first test after the conv_transpose_1d tests, which does seem like it's related to this change somehow.

@etasnadi
Copy link
Contributor Author

It's showing failures in some of the conv_transpose_1d tests, e.g.:

29:   CONV_TRANSPOSE_1D(ne_input=[2173,1,1,1],ne_kernel=[1337,1,1,1],s0=1,p0=0,d0=1): [CONV_TRANSPOSE_1D] NMSE = 0.459058786 > 0.000000100 FAIL
29:   CONV_TRANSPOSE_1D(ne_input=[2173,1,1,1],ne_kernel=[1337,1,1,1],s0=2,p0=0,d0=1): [CONV_TRANSPOSE_1D] NMSE = 0.222932697 > 0.000000100 FAIL
29:   CONV_TRANSPOSE_1D(ne_input=[2173,1,1,1],ne_kernel=[1337,1,1,1],s0=3,p0=0,d0=1): [CONV_TRANSPOSE_1D] NMSE = 0.167964696 > 0.000000100 FAIL

Then it appears to crash on the first test after the conv_transpose_1d tests, which does seem like it's related to this change somehow.

That's interesting. I could finally reproduce these errors with llvmpipe, but not with the discrete GPU. I need time to investigate what's the root cause.

@jeffbolznv
Copy link
Collaborator

Wild guess - any chance of uninitialized shared memory?

@etasnadi
Copy link
Contributor Author

Wild guess - any chance of uninitialized shared memory?

It can be, but there is a chance that the code is correct but llvmpipe what's the root cause of the failing tests.

I executed parameter sweeps (Cin, K, L) and turns out that if one (or combination of parameters) are large enough, the test fails. Even if I test with Cin=20000, L=64, K=4 the test fails that's suspicious because Cin "does not influence the logic too much". Cin=20000, L=64, K=3 passes.

Argmax also segfaults on llvmpipe on my computer for some reason.

etasnadi added 2 commits May 28, 2025 23:06
* Removes extra whitespaces.

* Adds int64->int32 casts to prevent possible warnings.
@etasnadi
Copy link
Contributor Author

etasnadi commented May 28, 2025

Wild guess - any chance of uninitialized shared memory?

No, I was not aware of the fact that iterations in loops are simply skipped after reaching a certain limit on llvmpipe, so I decided to reduce the maximum test problem size to 1337x13 (input x kernel) in order to to pass the tests on llvmpipee in commit 2813cf4. Llvmpipe starts to trim the loops after ~30k iterations so this number should be safe and also realistic opposed to my previous edge cases. 5120x512 also worked for me with 128 threads resulting in ~20k iterations per thread.

@etasnadi
Copy link
Contributor Author

etasnadi commented Jun 3, 2025

@slaren @0cc4m @jeffbolznv any plans to launch the pipelines to see if this can be merged finally?

@@ -9964,6 +10029,8 @@ static bool ggml_backend_vk_device_supports_op(ggml_backend_dev_t dev, const ggm
case GGML_OP_COUNT_EQUAL:
case GGML_OP_IM2COL:
case GGML_OP_TIMESTEP_EMBEDDING:
case GGML_OP_CONV_TRANSPOSE_1D:
return op->src[0]->type == GGML_TYPE_F32 && op->src[1]->type == GGML_TYPE_F32;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the CI crash happens here - these other ops shouldn't fall through to this logic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's definitely an issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out, fixed in commit 4e04f45. The snippet did indeed cause the crash and seems it's now working locally:

  5635/5635 tests passed
  Backend Vulkan0: OK

@0cc4m
Copy link
Collaborator

0cc4m commented Jun 4, 2025

Thank you!

@0cc4m 0cc4m merged commit 0d39844 into ggml-org:master Jun 4, 2025
46 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning testing Everything test related Vulkan Issues specific to the Vulkan backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants