Add intrinsic for dynamic shared memory #146181

Flakebi · 2025-09-03T22:32:23Z

Shared memory is a memory region that is shared between all threads in a thread block/workgroup on GPUs. Dynamic shared memory is in that memory region, though the allocated size is specified late, when launching a kernel.

Shared memory in amdgpu and nvptx lives in address space 3. Dynamic shared memory is implemented by creating an external global variable in address space 3. The global is declared with size 0, as the actual size is only known at runtime. It is defined behavior in LLVM to access an external global outside the defined size.

As far as I know, there is no similar way to get the allocated size of dynamic shared memory on amdgpu an nvptx, so users have to pass this out-of-band or rely on target specific ways.

Tracking issue: #135516

rustbot · 2025-09-03T22:32:28Z

r? @petrochenkov

rustbot has assigned @petrochenkov.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

rustbot · 2025-09-03T22:32:32Z

Some changes occurred in src/tools/compiletest

cc @jieyouxu

Some changes occurred in compiler/rustc_codegen_ssa

cc @WaffleLapkin

Some changes occurred to the intrinsics. Make sure the CTFE / Miri interpreter
gets adapted for the changes, if necessary.

cc @rust-lang/miri, @RalfJung, @oli-obk, @lcnr

Shared memory is a memory region that is shared between all threads in a thread block/workgroup on GPUs. Dynamic shared memory is in that memory region, though the allocated size is specified late, when launching a kernel. Shared memory in amdgpu and nvptx lives in address space 3. Dynamic shared memory is implemented by creating an external global variable in address space 3. The global is declared with size 0, as the actual size is only known at runtime. It is defined behavior in LLVM to access an external global outside the defined size. As far as I know, there is no similar way to get the allocated size of dynamic shared memory on amdgpu an nvptx, so users have to pass this out-of-band or rely on target specific ways.

RalfJung · 2025-09-04T06:28:17Z

library/core/src/intrinsics/mod.rs

+#[rustc_nounwind]
+#[unstable(feature = "dynamic_shared_memory", issue = "135513")]
+#[cfg(any(target_arch = "amdgpu", target_arch = "nvptx64"))]
+pub fn dynamic_shared_memory<T: ?Sized>() -> *mut T;


Note that outside the GPU world, "shared memory" typically refers to memory shared between processes. So I would suggest using a name that's less likely to be confused, like something that explicitly involves "GPU" or so.

This sounds like a form of "global" memory (similar to a static item), but then apparently OpenCL calls it "local" which is very confusing...

Does it make sense to add a mod gpu?
I think there are more intrinsics for gpus that make can be added (although more in the traditional intrinsic sense, relating to an instruction, edit: re-exposing intrinsics from core::arch::nvptx and the amdgpu equivalent).

Or should it be in core::arch::gpu?
(From #135516 (comment), cc @workingjubilee)

Rust intrinsic names are not namespaced. They are exposed in a module, but inside the compiler they are identified entirely by their name. So moving them into a different module doesn't alleviate the need for a clear name that will be understandable to non-GPU people working in the compiler (which is the vast majority of compiler devs).

If there's more GPU intrinsics to come, moving them into a gpu.rs file here still might make sense.

I don't have a strong opinion on how the eventually stable public API is organized, I am commenting entirely as someone who has an interest in keeping the set of intrinsics the Rust compiler offers understandable and well-defined (the ones in this folder, not the ones in core::arch which you call "more traditional" but that's very dependent on your background ;). These intrinsics are just an implementation detail, but every intrinsic we add here is a new language primitive -- it's like adding a new keyword, just without the syntax discussions and perma-unstable. In the past we used to have intrinsics that entirely break the internal consistency of the language, and we used to have intrinsics whose safety requirements were very poorly documented.

RalfJung · 2025-09-04T06:31:24Z

library/core/src/intrinsics/mod.rs

+/// All pointers returned by `dynamic_shared_memory` point to the same address,
+/// so alias the same memory.


Does this mean that all dynamic_shared_memory::<T> for the same T return the same pointer, or do even dynamic_shared_memory::<T> and dynamic_shared_memory::<U> point to the same memory?

All of them alias, independent of the type.
It’s probably worth explaining the concept of shared memory in the comment?

Maybe this makes it clearer:

Returns a pointer to dynamic shared memory.

Shared memory is a memory region that is shared between all threads in
the same block/work-group. It is usually faster than global memory, which is
shared between all threads on a GPU.
Dynamic shared memory is in the shared memory region, though the allocated
size is specified late, when launching a gpu-kernel.

The pointer returned by dynamic_shared_memory() is the start of the dynamic
shared memory region. All calls to dynamic_shared_memory in a block/work-group,
independent of the generic type, return the same address, so alias the same memory.
The returned pointer is aligned by at least the alignment of T.

Other APIs

CUDA and HIP call this shared memory, shared between threads in a block.
OpenCL and SYCL call this local memory, shared between threads in a work-group.
GLSL calls this shared memory, shared between invocations in a work group.
DirectX calls this groupshared memory, shared between threads in a thread-group.

It’s probably worth explaining the concept of shared memory in the comment?

It's probably worth using a more specific term ("GPU shared memory" or so) since many people reading this will think "shared memory" refers to its more standard meaning of memory shared across different processes (often set up via mmap). It's unfortunate that GPU vendors chose to overload this term, but when working in a more general-purpose codebase you can't just assume everyone to know the conventions of the GPU community, and you can't give general-purpose terms a GPU-specific meaning without risking confusion.

Other APIs

CUDA and HIP call this shared memory, shared between threads in a block.
OpenCL and SYCL call this local memory, shared between threads in a work-group.
GLSL calls this shared memory, shared between invocations in a work group.
DirectX calls this groupshared memory, shared between threads in a thread-group.

This sort of "translation guide" is not actually useful if you are not familiar with any of these things, so I would just leave it out as it is a distraction from the actual description. Especially since it's very easy to go look up a definition of, say, OpenCL's local memory, see it referred to as "this is GLSL's shared memory", look up a definition of that and see it referred to as basically the same idea as groupshared memory in DirectX, then look up a definition of that and... you get the idea. Our definition should stand on its own.

The translation guide might be useful to people that are familiar with these things and wondering why we are making up our own terms.

RalfJung · 2025-09-04T06:32:32Z

compiler/rustc_codegen_llvm/src/intrinsic.rs

+                    if alignment > llvm::LLVMGetAlignment(global) {
+                        llvm::LLVMSetAlignment(global, alignment);
+                    }
+                }


What does this do when there are multiple translation units? It seems like each would declare its own version of the global, potentially with a different alignment?

All of them return the same pointer.
I assume the linker ensures the maximum alignment is taken and the runtime will ensure that it’s honored.
(Though note that multiple translation units are currently not really supported in the backends, they both rely on IR linking.)

All of them return the same pointer.

By which magic does that happen? That's not usually how LLVM globals work, AFAIK.

I assume the linker ensures the maximum alignment is taken and the runtime will ensure that it’s honored.

That would be rather surprising to me; usually a global can only be declared in one translation unit and then must be imported in all the others. Your current implementation might end up with a separate symbol for each TU, I am not sure.

EDIT: Ah I didn't realize these are extern. The name LLVMRustGetOrInsertGlobal does not make that clear. I understand the name is pre-existing.

(Though note that multiple translation units are currently not really supported in the backends, they both rely on IR linking.)

How does that work in rustc where each crate is a separate TU?

Yes, the key is that they are extern. I agree it would not make sense for non-extern globals.

How does that work in rustc where each crate is a separate TU?

We do (fat/full) LTO. All crates compile to bitcode, than that bitcode is linked and compiled as one LLVM IR module.
The Rust amdgpu backend uses the existing lto/linker-plugin-lto infrastructure, the Rust nvptx backend is declaring IR/bitcode as the output format and building the bitcode-linker, to link them together.

Usually an alignment attribute on an extern global encodes an assumption -- it's a promise to LLVM that someone else has aligned the global sufficiently.

It looks like this here relies on some sort of magic handshake where whatever component actually creates the global will honor whatever alignment you can find in the bitcode?

That's pretty fragile -- usually in LLVM it's always correct to reduce the alignment attribute on such declarations, it just means we know less about what the real alignment is. (This is in contrast to alignment annotations on definitions which defines the real alignment, so it can be safely increased but not reduced.) Seems like some system here is (ab)using LLVM attributes in ways that break some of their core properties, or am I misunderstanding something?

Is this magic handshake documented anywhere?

RalfJung · 2025-09-04T06:33:44Z

compiler/rustc_codegen_llvm/src/intrinsic.rs

@@ -532,6 +534,22 @@ impl<'ll, 'tcx> IntrinsicCallBuilderMethods<'tcx> for Builder<'_, 'll, 'tcx> {
                return Ok(());
            }

+            sym::dynamic_shared_memory => {
+                let global = self.declare_global_in_addrspace(
+                    "dynamic_shared_memory",


Is there a special magic meaning to this hard-coded name for the global?

When linking together Rust code and code in other languages, is this the name everyone must use consistently to ensure things behave correctly?

The name has no meaning.
I didn’t try, but I’m rather confident to claim that all external addrspace(3) globals alias. There is no way to get two different dynamic shared memory pointers.
Using the name here and updating the global’s alignment is to generate readable and “clean” LLVM IR.

I’m rather confident to claim that all external addrspace(3) globals alias

That seems very unusual, so if the intrinsic impl relies on this there should be a comment referring to the place where this is documented.

RalfJung · 2025-09-04T09:45:19Z

compiler/rustc_abi/src/lib.rs

@@ -1716,6 +1716,8 @@ pub struct AddressSpace(pub u32);
 impl AddressSpace {
    /// LLVM's `0` address space.
    pub const ZERO: Self = AddressSpace(0);
+    /// The address space for shared memory on nvptx and amdgpu.
+    pub const SHARED: Self = AddressSpace(3);


This should have a name that makes it more clear that it is GPU-specific.

RalfJung · 2025-09-04T10:10:24Z

library/core/src/intrinsics/mod.rs

+/// The returned pointer is the start of the dynamic shared memory region.
+/// All pointers returned by `dynamic_shared_memory` point to the same address,
+/// so alias the same memory.
+/// The returned pointer is aligned by at least the alignment of `T`.


Speaking of safety requirements... how does one use this pointer? I get that it is aligned, but does it point to enough memory to store a T? If it's always the same address, doesn't everyone overwrite each other's data all the time? This API looks very odd for a non-GPU person, and it's not clear to me whether that is resolved by having more magic behavior (which should be documented or at least referenced here), or whether there's higher-level APIs built on top that deal with this (but this intrinsic provides so few guarantees, I can't see how that should be possible).

Typically, intrinsic documentations should be detailed enough that I can read and write code using the intrinsic and know exactly whether the code is correct and what it will do in all circumstances. I don't know if there's any hope of achieving that with GPU intrinsics, but if not then we need to have a bit of a wider discussion -- we have had bad experience with just importing "externally defined" semantics into Rust without considering all the interactions (in general, it is not logically coherent to have semantics externally defined).

The current docs would let me implement this intrinsic by just always returning 1024, and emitting a compile error if T has alignment bigger than 1024. I doubt that's a legal implementation. But that means the docs are not precise enough to describe what the implementation must do.

RalfJung · 2025-09-04T10:14:42Z

library/core/src/intrinsics/mod.rs

+#[rustc_nounwind]
+#[unstable(feature = "dynamic_shared_memory", issue = "135513")]
+#[cfg(any(target_arch = "amdgpu", target_arch = "nvptx64"))]
+pub fn dynamic_shared_memory<T: ?Sized>() -> *mut T;


The ?Sized here is almost certainly wrong. The implementation uses align_of which only makes sense for sized types. Unsized types don't have an alignment by themselves, you need to know the wide ptr metadata to determine the alignment.

RalfJung · 2025-09-04T10:20:06Z

library/core/src/intrinsics/mod.rs

+/// The returned pointer is the start of the dynamic shared memory region.
+/// All pointers returned by `dynamic_shared_memory` point to the same address,
+/// so alias the same memory.
+/// The returned pointer is aligned by at least the alignment of `T`.


Is there some prior discussion of the design decision to determine the alignment by giving a type parameter? I could also be a const generic parameter, for instance. I don't have an opinion on the matter since I am an outsider to the GPU world, but as a compiler team member it'd be good to know if this is something you thought about for 5 minutes or whether there's some sort of larger design by a team that has a vision of how all these things will fit together.

There is some discussion in #135516. I don’t mind either way, I thought (for 5 minutes ;)) that specifying the type of the returned pointer makes sense.
I’m not much of a GPU programmer, but I think in most cases, you would store an array in dynamic shared memory, or maybe a struct followed by a dynamically sized array (or maybe two/n arrays of different types).

For just a struct, static shared memory would make more sense, though we don’t support that yet (there’s some discussion in the tracking issue, but I think that’s more complicated to design and implement).

RalfJung · 2025-09-04T10:28:19Z

Sorry for drowning you in questions here, but extending the core language with new operations (as in, adding a new intrinsic doing things that couldn't be done before) is a big deal, and we had a bad experience in the past when this was done without wider discussion in the team to ensure that the intrinsics actually make sense in the context of Rust. Not everything that exists in the hardware can be 1:1 exposed in Rust, sometimes this requires a lot of work and sometimes it's just basically impossible. It can be a lot of work to clean these things up later, and as someone who did a bunch of that work, I'd rather not have to do it again. :)

Flakebi · 2025-09-04T11:33:41Z

I agree that it makes a lot of sense to have the discussion now. Thanks for taking a look and helping to design something useful!

Speaking of safety requirements... how does one use this pointer?

Heh, yes, that’s something that should be mentioned in the doc comment as well. (Especially comments on how to safely use it.)

I get that it is aligned, but does it point to enough memory to store a T?

Depends on the size specified on the CPU side when launching the gpu-kernel. It may or it may not.

If it's always the same address, doesn't everyone overwrite each other's data all the time? This API looks very odd for a non-GPU person, and it's not clear to me whether that is resolved by having more magic behavior (which should be documented or at least referenced here), or whether there's higher-level APIs built on top that deal with this (but this intrinsic provides so few guarantees, I can't see how that should be possible).

There are “higher-level APIs” like “do a fast matrix-matrix multiplication”, but not much in-between. I’d assume that people usually use this in its raw form.
On GPUs, accessing memory is orders of magnitude slower than it is on CPUs. But, GPUs

have a lot more registers (e.g. up to 256 32-bit registers on amdgpu)
and shared memory, which is essentially a software-defined cache.

Two general use cases are: 1) All threads in a group load a part from global memory (the RAM/VRAM) and store it in shared memory. Then all threads read from the collaboratively loaded data. 2) All threads in a group do some work and collaborate on shared memory (with atomics or so) to aggregate results. Then one of the threads stores the final result to global memory.

So, shared memory is meant to be accessed collaboratively and the developer must ensure proper synchronization. It is hard to provide a safe abstraction for this and tbh, I don’t want to try 😅 (though I can see 3rd party crates doing this – at least to some extent).

From Rust’s perspective, guarantees should be the same as with memory that’s shared between processes.

Typically, intrinsic documentations should be detailed enough that I can read and write code using the intrinsic and know exactly whether the code is correct and what it will do in all circumstances. I don't know if there's any hope of achieving that with GPU intrinsics, but if not then we need to have a bit of a wider discussion -- we have had bad experience with just importing "externally defined" semantics into Rust without considering all the interactions (in general, it is not logically coherent to have semantics externally defined).

I agree, it would be nice to have good documentation for the intrinsics in Rust!

RalfJung · 2025-09-04T13:07:49Z

Depends on the size specified on the CPU side when launching the gpu-kernel. It may or it may not.

Wait, there's a single static size set when launching the kernel? Why is it called "dynamic" memory? "dynamic" memory usually means malloc/free, i.e. you can get any amount of fresh memory during runtime (until RAM is full obviously).

Are you saying dynamic shared memory is neither dynamic in the normal sense nor shared in the normal sense? ;)

petrochenkov · 2025-09-04T13:13:20Z

r? @RalfJung

RalfJung · 2025-09-04T13:31:21Z

I won't be able to do the final approval here, I can just help with ensuring that the intrinsics are documented well enough that they can be understood without GPU expertise, and that the LLVM codegen looks vaguely reasonable.

I don't know if we have anyone who actually knows how the generated LLVM IR should look like and can ensure it makes sense. r? @nikic maybe?

bjorn3 · 2025-09-04T17:15:10Z

tests/codegen-llvm/dynamic_shared_memory.rs

+//@ [x86] needs-llvm-components: x86
+//@ [x86] should-fail
+#![feature(core_intrinsics, dynamic_shared_memory)]
+#![no_std]


Would using mini-core instead of the precompiled core in the sysroot work instead of hard-coding -Ctarget-cpu=gfx900 in the build system? That way you can also omit the only-amdgpu and only-nvptx64 directives and instead make it run whenever LLVM supports it.

I don't think hardcoding gfx900 as the target will work at all, different uarches can have wildly different codegen'd binaries and are usually not backwards compatible. I think this is a case where we want the user to specify the target architecture that they are compiling for.

This would probably work for smoke testing but in the long term it's probably going to cause more pains than it's worth.

Given that this is just about local testing that the compiler works at all, none of the binary compatibility concerns really matter to us. This code is not executed.

But I agree that we should probably just use the minicore for this instead of adding code to the build system that we might have to rip out later. It's easy to update tests... unless they are built on a fragile assumption in the build system.

I could probably duplicate the intrinsic in minicore, but wouldn’t that defeat the purpose of this test?
The goal here is to test the intrinsic that is added to core.

I think this is a case where we want the user to specify the target architecture that they are compiling for.

Sure, that’s what the target enforces :) Unfortunately it’s the exact thing that throws a wrench into writing tests for the standard library.

Duplicating the intrinsic declaration is fine; what you are testing is the implementation and that is not being duplicated.

bjorn3 · 2025-09-04T17:16:26Z

tests/codegen-llvm/dynamic_shared_memory.rs

+//@ [nvptx] only-nvptx64
+//@ [nvptx] needs-llvm-components: nvptx
+//@ [x86] compile-flags: --target x86_64-unknown-linux-gnu
+//@ [x86] only-x86_64


This is not enough when running on x86_64 Windows. You did need only-x86_64-unknown-linux-gnu I think.

workingjubilee

I agree with Ralf. More specifically, I suggest referring to this idea internally as "GPU workgroup memory" (or "GPU threadgroup memory"). The notion of a "block" is a term that is highly loaded in the compiler so, while it is somewhat fortunate that CUDA and HIP prefer the same terms, we actually want to pick the words that don't overlap.

I'm also not entirely sure of this entire approach, but we will likely encounter this definition-level issue regardless of approach, and some of the code will be similar anyway, yeah?

View changes since this review

workingjubilee · 2025-09-04T19:11:06Z

compiler/rustc_abi/src/lib.rs

@@ -1716,6 +1716,8 @@ pub struct AddressSpace(pub u32);
 impl AddressSpace {
    /// LLVM's `0` address space.
    pub const ZERO: Self = AddressSpace(0);
+    /// The address space for shared memory on nvptx and amdgpu.


"shared" between whomst?

The concept is called “shared memory”, so I meant it as a reference to that concept, not as an explanation ;)
I’ll make this more descriptive (though as at stands, we’ll first need to decide on nomenclature to use throughout the Rust codebase).

workingjubilee · 2025-09-04T19:19:30Z

library/core/src/intrinsics/mod.rs

@@ -3238,6 +3238,23 @@ pub(crate) const fn miri_promise_symbolic_alignment(ptr: *const (), align: usize
    )
 }

+/// Returns a pointer to dynamic shared memory.


I know this is a short description, but it is just repeating the function signature:

pub fn dynamic_shared_memory<T: ?Sized>() -> *mut T; /* ^^^^^^^^^^^^^^^^^^^^ ^^ ^^^^^^ | | | | | | -- "dynamic shared memory" | - "a pointer" - "returns" */

I'm basically repeating the same comment I already made about "shared with what?" except with adding that "dynamic" also is pretty vague since in computer programming it's always relative to some notion of "static". Yet the "static" is not obviously present in this description.

workingjubilee · 2025-09-04T19:22:48Z

tests/codegen-llvm/dynamic_shared_memory.rs

+//@ [x86] needs-llvm-components: x86
+//@ [x86] should-fail
+#![feature(core_intrinsics, dynamic_shared_memory)]
+#![no_std]


Given that this is just about local testing that the compiler works at all, none of the binary compatibility concerns really matter to us. This code is not executed.

But I agree that we should probably just use the minicore for this instead of adding code to the build system that we might have to rip out later. It's easy to update tests... unless they are built on a fragile assumption in the build system.

workingjubilee · 2025-09-04T19:27:57Z

library/core/src/intrinsics/mod.rs

+/// All pointers returned by `dynamic_shared_memory` point to the same address,
+/// so alias the same memory.


Other APIs

CUDA and HIP call this shared memory, shared between threads in a block.
OpenCL and SYCL call this local memory, shared between threads in a work-group.
GLSL calls this shared memory, shared between invocations in a work group.
DirectX calls this groupshared memory, shared between threads in a thread-group.

This sort of "translation guide" is not actually useful if you are not familiar with any of these things, so I would just leave it out as it is a distraction from the actual description. Especially since it's very easy to go look up a definition of, say, OpenCL's local memory, see it referred to as "this is GLSL's shared memory", look up a definition of that and see it referred to as basically the same idea as groupshared memory in DirectX, then look up a definition of that and... you get the idea. Our definition should stand on its own.

workingjubilee · 2025-09-04T19:38:13Z

compiler/rustc_codegen_llvm/src/intrinsic.rs

+            sym::dynamic_shared_memory => {
+                let global = self.declare_global_in_addrspace(
+                    "dynamic_shared_memory",
+                    self.type_array(self.type_i8(), 0),
+                    AddressSpace::SHARED,
+                );


Hm. I'm not sure this is the correct design for this, because it makes the calls to an intrinsic alter global program state during compilation, which feels very dicey and assumes that the calls to that intrinsic will experience codegen. I'm not sure how substantial this concern is, however, so I'm willing to be persuaded.

Ah good point... if you call this function in dead code, it may or may not have any effect.

How do other languages handle controlling the alignment of this magic global?

To be fair, even if it is called, it does not have any effect. It’s just a getter for a pointer that exists anyway. It does not change anything if it’s dead-code-eliminated.
So, to make it clear, calling the intrinsic does not alter any program state.

We're talking about the funky behavior where DCE'ing the intrinsic removes its compile-time effect on the alignment.

RalfJung · 2025-09-05T07:36:47Z

I'm honestly still lost about the overall picture here -- it looks like the program is in charge of controlling the alignment but some other component (the CPU code that launches the kernel, presumably) is in charge of controlling the size of this memory? How on Earth does it makes sense to split the responsibility of picking size and align across such a big abstraction barrier?

Flakebi · 2025-09-05T08:03:02Z

I’ll need to look more into the alignment and how it works. I’m not too confident that what I claim here is actually correct, I just assumed that the alignment on the extern global is used for something (I can’t find anything documented, i’ll try to look into the amdgpu source code).
Regarding aliasing, nvidia is clear about this (source)

the size of the array is determined at launch time (see Execution Configuration). All variables declared in this fashion, start at the same address in memory, so that the layout of the variables in the array must be explicitly managed through offsets.

Wait, there's a single static size set when launching the kernel? Why is it called "dynamic" memory? "dynamic" memory usually means malloc/free, i.e. you can get any amount of fresh memory during runtime (until RAM is full obviously).

Are you saying dynamic shared memory is neither dynamic in the normal sense nor shared in the normal sense? ;)

Yes, it is more dynamic than the traditional, static shared memory though! In graphics shaders (which existed long before cuda/hip/…), shared memory is used by declaring an annotated global variable (the same also works in cuda/hip/…). Therefore the amount of shared memory available in a kernel is fixed at kernel compile-time.
For GPGPU APIs, this was not flexible enough. These APIs allow specifying the size of a workgroup at launch-time and often the size of shared memory needs to grow accordingly. Therefore the introduction of “dynamic” shared memory.

Just to mention it, an alternative to the intrinsic would be to follow cuda/hip and get the pointer by declaring an #[groupshared] extern "C" dynamic_shared_mem: u8 global or so. All such declarations would alias though, so I’m not sure this is any clearer 😅

RalfJung · 2025-09-05T08:25:26Z

Right, thanks for the explanation. Both "dynamic" and "shared" need to be carefully qualified then to avoid confusing it with the usual meaning these terms have outside the GPU realm (in particular since the other meanings do show up quite a bit in rustc, at least for "dynamic").

Since you mentioned it, I like "group-shared" as a term. That makes it clear that it's a specific kind of shared and it's hopefully something one can google.

"dynamic" is awkward... its basically "fixed at launch time" (in contrast to "fixed at build time"). Not sure what a good alternative term would be here. From a Rust AM perspective, this is much closer to static than to dynamic. If the intrinsic name also includes gpu and the docs are clear about this, maybe that's enough.

kulst · 2025-09-05T09:19:58Z

"dynamic" is awkward... its basically "fixed at launch time" (in contrast to "fixed at build time"). Not sure what a good alternative term would be here. From a Rust AM perspective, this is much closer to static than to dynamic. If the intrinsic name also includes gpu and the docs are clear about this, maybe that's enough.

To pick up on your nomenclature:
How about calling the intrinsic launched_groupshared_memory and the static alternative later maybe
fixed_groupshared_memory or simply groupshared_memory.

Flakebi · 2025-09-05T09:38:24Z

FWIW, as static groupshared memory has been mentioned a couple of times. I think static groupshared variables need to be declared as proper global variables in Rust. Getting them through an intrinsic will not work out well (think about wanting two globals with the same type, the intrinsic getting duplicated through inlining or tail duplication, or wanting to link to a cuda/hip program using the same variable).

RalfJung · 2025-09-05T11:03:58Z

Yeah, what you call "static memory" seems to be very similar to Rust's statics -- the "groupshared" part makes them more like thread-local statics, but still.

The "dynamic" ones OTOH seem to be more like a single global pointer that's just passed in by whomever launches the kernel. Well not quite, something has to decide which "group" or so sees which value... or maybe they all initially see the same value but when they change it that's only visible within the same group or so?

So, groupshared_kernel_param_memory might also be an option: the memory is a parameter to the kernel (a bit like argv).

But, if that's all too alien I can also live with "dynamic" as long as it's clearly documented. It's not the only term that has different meanings in different communities.

rustbot assigned petrochenkov Sep 3, 2025

Flakebi mentioned this pull request Sep 3, 2025

Tracking Issue for NVPTX shared memory #135516

Open

3 tasks

This comment has been minimized.

Sign in to view

Flakebi force-pushed the dynamic-shared-memory branch from 0aa0e58 to 3ebaccb Compare September 3, 2025 22:43

This comment has been minimized.

Sign in to view

Flakebi force-pushed the dynamic-shared-memory branch from 3ebaccb to 2378959 Compare September 3, 2025 22:50

RalfJung reviewed Sep 4, 2025

View reviewed changes

rustbot assigned RalfJung and unassigned petrochenkov Sep 4, 2025

rustbot assigned nikic and unassigned RalfJung Sep 4, 2025

bjorn3 reviewed Sep 4, 2025

View reviewed changes

workingjubilee requested changes Sep 4, 2025

View reviewed changes

		/// All pointers returned by `dynamic_shared_memory` point to the same address,
		/// so alias the same memory.

Add intrinsic for dynamic shared memory #146181

Are you sure you want to change the base?

Add intrinsic for dynamic shared memory #146181

Conversation

Flakebi commented Sep 3, 2025 • edited by rustbot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rustbot commented Sep 3, 2025

Uh oh!

rustbot commented Sep 3, 2025

Uh oh!

This comment has been minimized.

This comment has been minimized.

RalfJung Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Flakebi Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RalfJung Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Other APIs

Uh oh!

RalfJung Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RalfJung Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RalfJung Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RalfJung Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RalfJung commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Flakebi commented Sep 4, 2025

Uh oh!

RalfJung commented Sep 4, 2025

Uh oh!

petrochenkov commented Sep 4, 2025

Flakebi commented Sep 3, 2025 •

edited by rustbot

Loading

RalfJung Sep 4, 2025 •

edited

Loading

Flakebi Sep 4, 2025 •

edited

Loading

RalfJung Sep 4, 2025 •

edited

Loading

RalfJung Sep 4, 2025 •

edited

Loading

RalfJung Sep 4, 2025 •

edited

Loading

RalfJung Sep 4, 2025 •

edited

Loading

RalfJung Sep 4, 2025 •

edited

Loading

RalfJung commented Sep 4, 2025 •

edited

Loading

Teapot4195 Sep 4, 2025 •

edited

Loading

workingjubilee left a comment •

edited by rustbot

Loading

RalfJung Sep 5, 2025 •

edited

Loading