Unit tests and benchmark for subgroup2 and workgroup2 stuff #192

keptsecret · 2025-05-07T08:12:22Z

No description provided.

devshgraphicsprogramming · 2025-06-02T20:59:52Z

29_Arithmetic2Bench/app_resources/benchmarkSubgroup.comp.hlsl

+static void subbench(NBL_CONST_REF_ARG(type_t) sourceVal)
+{
+    using config_t = nbl::hlsl::subgroup2::Configuration<SUBGROUP_SIZE_LOG2>;
+    using params_t = nbl::hlsl::subgroup2::ArithmeticParams<config_t, typename Binop::base_t, N, nbl::hlsl::jit::device_capabilities>;


don't use JIT traits, you need to benchmark both (compile pipeline twice) see #190 (comment)

devshgraphicsprogramming · 2025-06-03T06:55:34Z

29_Arithmetic2Bench/main.cpp

+// method emulations on the CPU, to verify the results of the GPU methods
+template<class Binop>
+struct emulatedReduction
+{
+	using type_t = typename Binop::type_t;
+
+	static inline void impl(type_t* out, const type_t* in, const uint32_t itemCount)
+	{
+		const type_t red = std::reduce(in,in+itemCount,Binop::identity,Binop());
+		std::fill(out,out+itemCount,red);
+	}
+
+	static inline constexpr const char* name = "reduction";
+};
+template<class Binop>
+struct emulatedScanInclusive
+{
+	using type_t = typename Binop::type_t;
+
+	static inline void impl(type_t* out, const type_t* in, const uint32_t itemCount)
+	{
+		std::inclusive_scan(in,in+itemCount,out,Binop());
+	}
+	static inline constexpr const char* name = "inclusive_scan";
+};
+template<class Binop>
+struct emulatedScanExclusive
+{
+	using type_t = typename Binop::type_t;
+
+	static inline void impl(type_t* out, const type_t* in, const uint32_t itemCount)
+	{
+		std::exclusive_scan(in,in+itemCount,out,Binop::identity,Binop());
+	}
+	static inline constexpr const char* name = "exclusive_scan";
+};


example 23 does that, don't check results, so you can benchmark faster

devshgraphicsprogramming · 2025-06-03T07:18:45Z

29_Arithmetic2Bench/main.cpp

+		ISwapchain::SCreationParams swapchainParams = { .surface = m_surface->getSurface() };
+		if (!swapchainParams.deduceFormat(m_physicalDevice))
+			return logFail("Could not choose a Surface Format for the Swapchain!");


to achieve #190 (comment) you just need to add EUF_STORAGE usage here

devshgraphicsprogramming · 2025-06-03T07:19:39Z

29_Arithmetic2Bench/main.cpp

+		// TODO: get the element count from argv
+		const uint32_t elementCount = Output<>::ScanElementCount;
+		// populate our random data buffer on the CPU and create a GPU copy
+		inputData = new uint32_t[elementCount];
+		{
+			std::mt19937 randGenerator(0xdeadbeefu);
+			for (uint32_t i = 0u; i < elementCount; i++)
+				inputData[i] = randGenerator(); // TODO: change to using xoroshiro, then we can skip having the input buffer at all
+
+			IGPUBuffer::SCreationParams inputDataBufferCreationParams = {};
+			inputDataBufferCreationParams.size = sizeof(Output<>::data[0]) * elementCount;
+			inputDataBufferCreationParams.usage = IGPUBuffer::EUF_STORAGE_BUFFER_BIT | IGPUBuffer::EUF_TRANSFER_DST_BIT | IGPUBuffer::EUF_SHADER_DEVICE_ADDRESS_BIT;
+			m_utils->createFilledDeviceLocalBufferOnDedMem(
+				SIntendedSubmitInfo{.queue=getTransferUpQueue()},
+				std::move(inputDataBufferCreationParams),
+				inputData
+			).move_into(gpuinputDataBuffer);
+		}


use pcg or other hash in the shader and get rid of the input buffer, it will reduce the data access overhead

#190 (comment)

devshgraphicsprogramming · 2025-06-03T07:20:46Z

29_Arithmetic2Bench/main.cpp

+		// create buffer to store BDA of output buffers
+		{
+			std::array<uint64_t, OutputBufferCount> outputAddresses;
+			for (uint32_t i = 0; i < OutputBufferCount; i++)
+				outputAddresses[i] = outputBuffers[i]->getDeviceAddress();
+
+			IGPUBuffer::SCreationParams params;
+			params.usage = IGPUBuffer::EUF_STORAGE_BUFFER_BIT | IGPUBuffer::EUF_TRANSFER_DST_BIT | IGPUBuffer::EUF_INLINE_UPDATE_VIA_CMDBUF | IGPUBuffer::EUF_SHADER_DEVICE_ADDRESS_BIT;
+			params.size = OutputBufferCount * sizeof(uint64_t);
+			m_utils->createFilledDeviceLocalBufferOnDedMem(SIntendedSubmitInfo{ .queue = getTransferUpQueue() }, std::move(params), outputAddresses.data()).move_into(gpuOutputAddressesBuffer);
+		}
+		pc.inputBufAddress = gpuinputDataBuffer->getDeviceAddress();
+		pc.outputAddressBufAddress = gpuOutputAddressesBuffer->getDeviceAddress();


I'd prefer the Push Constants to have an array of output addresses, this way you're doing 1 extra BDA load to find where you're storing to (huge latency as it introduces a dependent store)

devshgraphicsprogramming · 2025-06-03T07:21:27Z

29_Arithmetic2Bench/main.cpp

+				binding[1].count = OutputBufferCount;
+				dsLayout = m_device->createDescriptorSetLayout(binding);


huh, aren't you using BDA for output?

devshgraphicsprogramming · 2025-06-03T07:23:19Z

29_Arithmetic2Bench/main.cpp

+			smart_refctd_ptr<IGPUDescriptorSetLayout> benchLayout;
+			{
+				IGPUDescriptorSetLayout::SBinding binding[1];
+				binding[0] = { {},2,IDescriptor::E_TYPE::ET_STORAGE_IMAGE,IGPUDescriptorSetLayout::SBinding::E_CREATE_FLAGS::ECF_UPDATE_AFTER_BIND_BIT,IShader::E_SHADER_STAGE::ESS_COMPUTE,1u,nullptr };


nitpick if you make it an array large enough to hold all your swapchain images, then it only needs ECF_PARTIALLY_BOUND_BIT (because swapchain may be created with less images than your max)

devshgraphicsprogramming · 2025-06-03T07:35:37Z

29_Arithmetic2Bench/main.cpp

+
+		// const auto MaxWorkgroupSize = m_physicalDevice->getLimits().maxComputeWorkGroupInvocations;
+		const auto MinSubgroupSize = m_physicalDevice->getLimits().minSubgroupSize;
+		const auto MaxSubgroupSize = m_physicalDevice->getLimits().maxSubgroupSize;


unused var?

devshgraphicsprogramming · 2025-06-03T07:37:08Z

29_Arithmetic2Bench/main.cpp

+				benchSets[i] = createBenchmarkPipelines<ArithmeticOp, DoWorkgroupBenchmarks>(workgroupBenchSource, benchPplnLayout.get(), elementCount, hlsl::findMSB(MinSubgroupSize), workgroupSizes[i], ItemsPerInvocation, NumLoops);
+		}
+		else
+		{
+			for (uint32_t i = 0; i < workgroupSizes.size(); i++)
+				benchSets[i] = createBenchmarkPipelines<ArithmeticOp, DoWorkgroupBenchmarks>(subgroupBenchSource, benchPplnLayout.get(), elementCount, hlsl::findMSB(MinSubgroupSize), workgroupSizes[i], ItemsPerInvocation, NumLoops);


benchmark with MAX subgroup size, not MIN

devshgraphicsprogramming · 2025-06-03T07:39:49Z

29_Arithmetic2Bench/main.cpp

+	template<class BinOp>
+	using ArithmeticOp = emulatedReduction<BinOp>;	// change this to test other arithmetic ops


bench all 3 within a single application run (separate pipelines ofc.

devshgraphicsprogramming · 2025-06-03T07:40:20Z

29_Arithmetic2Bench/main.cpp

+	uint32_t* inputData = nullptr;
+	smart_refctd_ptr<IGPUBuffer> gpuinputDataBuffer;
+	constexpr static inline uint32_t OutputBufferCount = 8u;
+	smart_refctd_ptr<IGPUBuffer> outputBuffers[OutputBufferCount];
+	smart_refctd_ptr<IGPUBuffer> gpuOutputAddressesBuffer;
+	PushConstantData pc;
+
+	smart_refctd_ptr<ISemaphore> sema;
+	uint64_t timelineValue = 0;
+	smart_refctd_ptr<ICPUBuffer> resultsBuffer;


don't compare so get rid of inputData and resultsBuffer

devshgraphicsprogramming · 2025-06-03T07:40:37Z

29_Arithmetic2Bench/main.cpp

+	void logTestOutcome(bool passed, uint32_t workgroupSize)
+	{
+		if (passed)
+			m_logger->log("Passed test #%u", ILogger::ELL_INFO, workgroupSize);
+		else
+		{
+			totalFailCount++;
+			m_logger->log("Failed test #%u", ILogger::ELL_ERROR, workgroupSize);
+		}
+	}


devshgraphicsprogramming · 2025-06-03T07:45:46Z

29_Arithmetic2Bench/main.cpp

+		options.preprocessorOptions.includeFinder = includeFinder;
+
+		const uint32_t subgroupSize = 0x1u << subgroupSizeLog2;
+		const uint32_t itemsPerWG = workgroupSize <= subgroupSize ? workgroupSize * itemsPerInvoc : itemsPerInvoc * max(workgroupSize >> subgroupSizeLog2, subgroupSize) << subgroupSizeLog2;	// TODO use Config somehow


make a NBL_CONSTEXPR non templated variant of the config to use in C++ I was about to ask for that cause I realized that from the C++ side the utility is cumbersome to use

devshgraphicsprogramming · 2025-06-03T07:46:45Z

29_Arithmetic2Bench/main.cpp

+			const uint32_t workgroupSizeLog2 = hlsl::findMSB(workgroupSize);
+			const std::string definitions[7] = {
+				"workgroup2::" + arith_name,
+				std::to_string(workgroupSizeLog2),
+				std::to_string(itemsPerWG),
+				std::to_string(itemsPerInvoc),
+				std::to_string(subgroupSizeLog2),
+				std::to_string(numLoops),
+				std::to_string(arith_name=="reduction")
+			};
+
+			const IShaderCompiler::SMacroDefinition defines[7] = {
+				{ "OPERATION", definitions[0] },
+				{ "WORKGROUP_SIZE_LOG2", definitions[1] },
+				{ "ITEMS_PER_WG", definitions[2] },
+				{ "ITEMS_PER_INVOCATION", definitions[3] },
+				{ "SUBGROUP_SIZE_LOG2", definitions[4] },
+				{ "NUM_LOOPS", definitions[5] },
+				{ "IS_REDUCTION", definitions[6] }
+			};
+			options.preprocessorOptions.extraDefines = { defines, defines + 7 };
+
+			overriddenUnspecialized = compiler->compileToSPIRV((const char*)source->getContent()->getPointer(), options);
+		}
+		else
+		{
+			const std::string definitions[5] = { 
+				"subgroup2::" + arith_name,
+				std::to_string(workgroupSize),
+				std::to_string(itemsPerInvoc),
+				std::to_string(subgroupSizeLog2),
+				std::to_string(numLoops)
+			};


also would be awesome if the said NBL_CONSTEXPR config had a method to spit out its templated counterpart instantiated type as a std::string

can be like

template<template<typename> Binop> using {PERFIX}Config{SUFFIX} = nbl::hlsl::workgroup2::Config<Binop, ... stuff from the runtime struct ....>;

make one for subgroup too

devshgraphicsprogramming · 2025-06-03T07:49:44Z

29_Arithmetic2Bench/main.cpp

+		// barrier transition to GENERAL
+		{
+			IGPUCommandBuffer::SPipelineBarrierDependencyInfo::image_barrier_t imageBarriers[1];
+			imageBarriers[0].barrier = {
+				   .dep = {
+					   .srcStageMask = PIPELINE_STAGE_FLAGS::NONE,
+					   .srcAccessMask = ACCESS_FLAGS::NONE,
+					   .dstStageMask = PIPELINE_STAGE_FLAGS::COMPUTE_SHADER_BIT,
+					   .dstAccessMask = ACCESS_FLAGS::SHADER_WRITE_BITS
+					}
+			};
+			imageBarriers[0].image = dummyImg.get();
+			imageBarriers[0].subresourceRange = {
+				.aspectMask = IImage::EAF_COLOR_BIT,
+				.baseMipLevel = 0u,
+				.levelCount = 1u,
+				.baseArrayLayer = 0u,
+				.layerCount = 1u
+			};
+			imageBarriers[0].oldLayout = IImage::LAYOUT::UNDEFINED;
+			imageBarriers[0].newLayout = IImage::LAYOUT::GENERAL;
+
+			cmdbuf->pipelineBarrier(E_DEPENDENCY_FLAGS::EDF_NONE, { .imgBarriers = imageBarriers });
+		}


does Nsight stop capturing if you don't transition the image ?

I already mentioned this #190 (comment)

devshgraphicsprogramming · 2025-06-03T07:50:13Z

29_Arithmetic2Bench/main.cpp

+		// bind dummy image
+		IGPUImageView::SCreationParams viewParams = {
+			.flags = IGPUImageView::ECF_NONE,
+			.subUsages = IGPUImage::E_USAGE_FLAGS::EUF_STORAGE_BIT,
+			.image = dummyImg,
+			.viewType = IGPUImageView::ET_2D,
+			.format = dummyImg->getCreationParameters().format
+		};
+		auto dummyImgView = m_device->createImageView(std::move(viewParams));
+
+		video::IGPUDescriptorSet::SDescriptorInfo dsInfo;
+		dsInfo.info.image.imageLayout = IImage::LAYOUT::GENERAL;
+		dsInfo.desc = dummyImgView;
+
+		IGPUDescriptorSet::SWriteDescriptorSet dsWrites[1u] =
+		{
+			{
+				.dstSet = benchDs.get(),
+				.binding = 2u,
+				.arrayElement = 0u,
+				.count = 1u,
+				.info = &dsInfo,
+			}
+		};
+		m_device->updateDescriptorSets(1u, dsWrites, 0u, nullptr);


this needs to go, the swapchain images need to be written into the descriptor set at application startup

I already mentioned this #190 (comment)

devshgraphicsprogramming · 2025-06-03T07:51:24Z

29_Arithmetic2Bench/main.cpp

+		resultsBuffer = ICPUBuffer::create({ outputBuffers[0]->getSize() });
+		smart_refctd_ptr<IGPUCommandBuffer> cmdbuf;
+		{
+			smart_refctd_ptr<nbl::video::IGPUCommandPool> cmdpool = m_device->createCommandPool(computeQueue->getFamilyIndex(),IGPUCommandPool::CREATE_FLAGS::RESET_COMMAND_BUFFER_BIT);
+			if (!cmdpool->createCommandBuffers(IGPUCommandPool::BUFFER_LEVEL::PRIMARY,{&cmdbuf,1}))
+			{
+				logFail("Failed to create Command Buffers!\n");
+				return false;
+			}
+		}


you don't even use this commandbuffer

devshgraphicsprogramming · 2025-06-03T08:38:59Z

29_Arithmetic2Bench/main.cpp

+		const auto MinSubgroupSize = m_physicalDevice->getLimits().minSubgroupSize;
+		const auto MaxSubgroupSize = m_physicalDevice->getLimits().maxSubgroupSize;
+
+		const auto SubgroupSizeLog2 = hlsl::findMSB(MinSubgroupSize);


you should be benchmarking with max subgroup size, not min

devshgraphicsprogramming · 2025-06-03T08:40:45Z

29_Arithmetic2Bench/main.cpp

+			std::string caption = "[Nabla Engine] Geometry Creator";
+			{
+				caption += ", displaying [all objects]";
+				m_window->setCaption(caption);
+			}
+			m_surface->present(m_currentImageAcquire.imageIndex, rendered);


don't change caption every frame, it affects your benchmark, I already mentioned this in #190

devshgraphicsprogramming · 2025-06-03T08:48:18Z

23_Arithmetic2UnitTest/main.cpp

+	// reflects calculations in workgroup2::ArithmeticConfiguration
+	uint32_t calculateItemsPerWorkgroup(const uint32_t workgroupSize, const uint32_t subgroupSize, const uint32_t itemsPerInvocation)
+	{
+		if (workgroupSize <= subgroupSize)
+			return workgroupSize * itemsPerInvocation;
+
+		const uint8_t subgroupSizeLog2 = hlsl::findMSB(subgroupSize);
+		const uint8_t workgroupSizeLog2 = hlsl::findMSB(workgroupSize);
+
+		const uint16_t levels = (workgroupSizeLog2 == subgroupSizeLog2) ? 1 :
+			(workgroupSizeLog2 > subgroupSizeLog2 * 2 + 2) ? 3 : 2;
+
+		const uint16_t itemsPerInvocationProductLog2 = max(workgroupSizeLog2 - subgroupSizeLog2 * levels, 0);
+		uint16_t itemsPerInvocation1 = (levels == 3) ? min(itemsPerInvocationProductLog2, 2) : itemsPerInvocationProductLog2;
+		itemsPerInvocation1 = uint16_t(1u) << itemsPerInvocation1;
+
+		uint32_t virtualWorkgroupSize = 1u << max(subgroupSizeLog2 * levels, workgroupSizeLog2);
+
+		return itemsPerInvocation * virtualWorkgroupSize;
+	}


we definitely need a workgroup2::CxprArithmeticConfiguration that does all the templated one does but with NBL_CONSTEXPR labelled methods

devshgraphicsprogramming · 2025-06-03T08:49:07Z

29_Arithmetic2Bench/app_resources/benchmarkSubgroup.comp.hlsl

+    return nbl::hlsl::glsl::gl_WorkGroupID().x*WORKGROUP_SIZE+nbl::hlsl::workgroup::SubgroupContiguousIndex();
+}
+
+bool canStore() {return true;}


devshgraphicsprogramming · 2025-06-03T08:49:37Z

29_Arithmetic2Bench/app_resources/benchmarkSubgroup.comp.hlsl

+    using params_t = nbl::hlsl::subgroup2::ArithmeticParams<config_t, typename Binop::base_t, N, nbl::hlsl::jit::device_capabilities>;
+    type_t value = sourceVal;
+
+    const uint64_t outputBufAddr = vk::RawBufferLoad<uint64_t>(pc.outputAddressBufAddress + Binop::BindingIndex * sizeof(uint64_t), sizeof(uint64_t));


dependant store (depends on this read) store all output addresses in push constant

devshgraphicsprogramming · 2025-06-03T09:15:33Z

29_Arithmetic2Bench/app_resources/benchmarkWorkgroup.comp.hlsl

+typedef vector<uint32_t, config_t::ItemsPerInvocation_0> type_t;
+
+// final (level 1/2) scan needs to fit in one subgroup exactly
+groupshared uint32_t scratch[config_t::SharedScratchElementCount];
+
+struct ScratchProxy
+{
+    template<typename AccessType, typename IndexType>
+    void get(const IndexType ix, NBL_REF_ARG(AccessType) value)
+    {
+        value = scratch[ix];
+    }
+    template<typename AccessType, typename IndexType>
+    void set(const IndexType ix, const AccessType value)
+    {
+        scratch[ix] = value;
+    }
+
+    uint32_t atomicOr(const uint32_t ix, const uint32_t value)
+    {
+        return nbl::hlsl::glsl::atomicOr(scratch[ix],value);
+    }
+
+    void workgroupExecutionAndMemoryBarrier()
+    {
+        nbl::hlsl::glsl::barrier();
+        //nbl::hlsl::glsl::memoryBarrierShared(); implied by the above
+    }
+};
+
+
+template<class Config, class Binop>
+struct DataProxy
+{
+    using dtype_t = vector<uint32_t, Config::ItemsPerInvocation_0>;
+    static_assert(nbl::hlsl::is_same_v<dtype_t, type_t>);
+
+    // we don't want to write/read storage multiple times in loop; doesn't seem optimized out in generated spirv
+    template<typename AccessType, typename IndexType>
+    void get(const IndexType ix, NBL_REF_ARG(dtype_t) value)
+    {
+        // value = inputValue[ix];
+        value = nbl::hlsl::promote<dtype_t>(globalIndex());
+    }
+    template<typename AccessType, typename IndexType>
+    void set(const IndexType ix, const dtype_t value)
+    {
+        // output[Binop::BindingIndex].template Store<type_t>(sizeof(uint32_t) + sizeof(type_t) * ix, value);
+    }
+
+    void workgroupExecutionAndMemoryBarrier()
+    {
+        nbl::hlsl::glsl::barrier();
+        //nbl::hlsl::glsl::memoryBarrierShared(); implied by the above
+    }
+};
+
+template<class Config, class Binop>
+struct PreloadedDataProxy
+{
+    using dtype_t = vector<uint32_t, Config::ItemsPerInvocation_0>;
+    static_assert(nbl::hlsl::is_same_v<dtype_t, type_t>);
+
+    NBL_CONSTEXPR_STATIC_INLINE uint32_t PreloadedDataCount = Config::VirtualWorkgroupSize / Config::WorkgroupSize;
+
+    template<typename AccessType, typename IndexType>
+    void get(const IndexType ix, NBL_REF_ARG(AccessType) value)
+    {
+        value = preloaded[(ix-nbl::hlsl::workgroup::SubgroupContiguousIndex())>>Config::WorkgroupSizeLog2];
+    }
+    template<typename AccessType, typename IndexType>
+    void set(const IndexType ix, const AccessType value)
+    {
+        preloaded[(ix-nbl::hlsl::workgroup::SubgroupContiguousIndex())>>Config::WorkgroupSizeLog2] = value;
+    }
+
+    void preload()
+    {
+        const uint32_t workgroupOffset = nbl::hlsl::glsl::gl_WorkGroupID().x * Config::VirtualWorkgroupSize;
+        [unroll]
+        for (uint32_t idx = 0; idx < PreloadedDataCount; idx++)
+            preloaded[idx] = vk::RawBufferLoad<dtype_t>(pc.inputBufAddress + (workgroupOffset + idx * Config::WorkgroupSize + nbl::hlsl::workgroup::SubgroupContiguousIndex()) * sizeof(dtype_t));
+    }
+    void unload()
+    {
+        const uint32_t workgroupOffset = nbl::hlsl::glsl::gl_WorkGroupID().x * Config::VirtualWorkgroupSize;
+        uint64_t outputBufAddr = vk::RawBufferLoad<uint64_t>(pc.outputAddressBufAddress + Binop::BindingIndex * sizeof(uint64_t));
+        [unroll]
+        for (uint32_t idx = 0; idx < PreloadedDataCount; idx++)
+            vk::RawBufferStore<dtype_t>(outputBufAddr + sizeof(uint32_t) + sizeof(dtype_t) * (workgroupOffset + idx * Config::WorkgroupSize + nbl::hlsl::workgroup::SubgroupContiguousIndex()), preloaded[idx], sizeof(uint32_t));
+    }
+
+    void workgroupExecutionAndMemoryBarrier()
+    {
+        nbl::hlsl::glsl::barrier();
+        //nbl::hlsl::glsl::memoryBarrierShared(); implied by the above
+    }
+
+    dtype_t preloaded[PreloadedDataCount];
+};


maybe you wanna make a common header for ex23 and 29 with this?

devshgraphicsprogramming · 2025-06-03T09:15:51Z

29_Arithmetic2Bench/app_resources/benchmarkWorkgroup.comp.hlsl

+    if (globalIndex()==0u)
+        vk::RawBufferStore<uint32_t>(outputBufAddr, nbl::hlsl::glsl::gl_SubgroupSize());


don't store this, this is only for test

devshgraphicsprogramming · 2025-06-03T09:16:18Z

29_Arithmetic2Bench/app_resources/benchmarkWorkgroup.comp.hlsl

+    PreloadedDataProxy<config_t,Binop> dataAccessor;
+    dataAccessor.preload();


best to generate from hash

devshgraphicsprogramming · 2025-06-03T09:16:29Z

29_Arithmetic2Bench/app_resources/benchmarkWorkgroup.comp.hlsl

+type_t benchmark()
+{
+    const type_t sourceVal = vk::RawBufferLoad<type_t>(pc.inputBufAddress + globalIndex() * sizeof(type_t));
+
+    subbench<bit_and<uint32_t> >(sourceVal);
+    subbench<bit_xor<uint32_t> >(sourceVal);
+    subbench<bit_or<uint32_t> >(sourceVal);
+    subbench<plus<uint32_t> >(sourceVal);
+    subbench<multiplies<uint32_t> >(sourceVal);
+    subbench<minimum<uint32_t> >(sourceVal);
+    subbench<maximum<uint32_t> >(sourceVal);
+    return sourceVal;
+}
+
+
+uint32_t globalIndex()
+{
+    return nbl::hlsl::glsl::gl_WorkGroupID().x*ITEMS_PER_WG+nbl::hlsl::workgroup::SubgroupContiguousIndex();
+}
+
+bool canStore()
+{
+    return nbl::hlsl::workgroup::SubgroupContiguousIndex()<ITEMS_PER_WG;
+}


see #190 on how to cleanup

keptsecret added 30 commits March 27, 2025 15:26

initial benchmark example copy

8090a2d

test subgroup2 funcs correct

3a2ff14

fix test

dd021a0

benchmarking shader + pipeline working

ca21941

begin adding fake frames for nsight profiler

0bb41db

merge master, fix conflicts

24a93bb

re-numbered example to avoid duplicate

17dda8e

fake frames for nsight

3d4e0f2

use correct shader, spirv line dbinfo for nsight

0192999

support for 1 item per invoc

8c9d55e

handle when items per invoc =1

07d6980

minor fixes

be756d5

changes in Param, Config usage

1963b51

coalesced load/store data

99cf5d8

Merge branch 'master' into scan_perf_bench

1d5e433

fixed some bugs

a3bb526

disable test by default

355c605

refactor to load data as vectors, consecutive uints

6b57674

initial wg scan test

7da1bec

working? test for workgroup2 reduce

750b3d2

fixes to test

f11b3df

tests with multiple items per invoc

9f690ee

inclusive scan test

755f89a

exclusive scan test, remove comments

b8415ad

benchmark shader, new common header

474281d

test smaller workgroup sizes

7d06332

expanded scratch proxy funcs

874557c

simplify scratch,proxy to just scalar types

28ea75f

move all tests into new example

e8c2831

Merge branch 'master' into new_wg_scan_test

93b4d0b

devshgraphicsprogramming reviewed Jun 2, 2025

View reviewed changes

devshgraphicsprogramming reviewed Jun 3, 2025

View reviewed changes

keptsecret added 4 commits June 5, 2025 15:00

various minor adjustments to unit tests

90ba926

simplified data accessors

19d7fe0

tests for native and emulated subgroup op

fdace31

removed redundant stuff

d6680f2

		binding[1].count = OutputBufferCount;
		dsLayout = m_device->createDescriptorSetLayout(binding);

		template<class BinOp>
		using ArithmeticOp = emulatedReduction<BinOp>; // change this to test other arithmetic ops

		if (globalIndex()==0u)
		vk::RawBufferStore<uint32_t>(outputBufAddr, nbl::hlsl::glsl::gl_SubgroupSize());

		PreloadedDataProxy<config_t,Binop> dataAccessor;
		dataAccessor.preload();

Unit tests and benchmark for subgroup2 and workgroup2 stuff #192

Are you sure you want to change the base?

Unit tests and benchmark for subgroup2 and workgroup2 stuff #192

Uh oh!

Conversation

keptsecret commented May 7, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

devshgraphicsprogramming Jun 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

devshgraphicsprogramming Jun 3, 2025 •

edited

Loading