bitonic sort sample #209

CrabExtra · 2025-08-11T18:31:33Z

No description provided.

Signed-off-by: CrabeExtra <abbasgaroosi7@gmail.com>

devshgraphicsprogramming · 2025-08-11T19:26:30Z

@Fletterio wanna try your hand at a review?

CMakeLists.txt

13_BitonicSort/app_resources/bitonic_sort_shader.comp.hlsl

13_BitonicSort/main.cpp

devshgraphicsprogramming · 2025-09-03T08:20:03Z

13_BitonicSort/main.cpp

+		auto limits = m_physicalDevice->getLimits();
+		const uint32_t max_shared_memory_size = limits.maxComputeSharedMemorySize;
+		const uint32_t max_workgroup_size = limits.maxComputeWorkGroupInvocations; // Get actual GPU limit
+		const uint32_t bytes_per_elements = sizeof(uint32_t) * 2; // 2 uint32_t per element (key and value)


you should get this with sizeof() of the structs in app_resources/common.hlsl

I said this 2 months ago, and still not done

13_BitonicSort/main.cpp

13_BitonicSort/app_resources/bitonic_sort_shader.comp.hlsl

13_BitonicSort/main.cpp

Fletterio

Just realized these were all pending, most are outdated see if any still hold up

CMakeLists.txt

13_BitonicSort/app_resources/bitonic_sort_shader.comp.hlsl

13_BitonicSort/main.cpp

13_BitonicSort/app_resources/bitonic_sort_shader.comp.hlsl

Fletterio · 2025-08-11T21:11:36Z

13_BitonicSort/app_resources/bitonic_sort_shader.comp.hlsl

+                }
+            }
+
+            GroupMemoryBarrierWithGroupSync();


If compareDistance < waveSize, these barriers serve no purpose, you are overbarriering. In fact writing to shared memory at the end of every such iteration is also pointless.

The proper way to avoid this overbarriering is to branch behaviour based on whether compareDistance < waveSize or not. All steps with compareDistance < waveSize can be done in one go. Threads shuffle their elements around using subgroup intrinsics (shuffleXor, namely), once per every compareDistance value less than the starting one, and then write back to shared memory only once. This is what we do with the FFT, although I don't expect you to infer that from the code since it can be a bit obscure. @ me on discord if you want to figure out the way we handle this with the FFT, I can explain better there since I need to draw diagrams and write a bunch more

13_BitonicSort/app_resources/bitonic_sort_shader.comp.hlsl

devshgraphicsprogramming · 2025-11-08T22:26:38Z

13_BitonicSort/app_resources/bitonic_sort_shader.comp.hlsl

+struct Accessor
+{
+	static Accessor create(const uint64_t address)
+	{
+		Accessor accessor;
+		accessor.address = address;
+		return accessor;
+	}
+
+	template <typename AccessType, typename IndexType>
+	void get(const IndexType index, NBL_REF_ARG(AccessType) value)
+	{
+		value = vk::RawBufferLoad<AccessType>(address + index * sizeof(AccessType));
+	}
+
+	template <typename AccessType, typename IndexType>
+	void set(const IndexType index, const AccessType value)
+	{
+		vk::RawBufferStore<AccessType>(address + index * sizeof(AccessType), value);
+	}
+
+	uint64_t address;
+};


there's readymade BDA accessors you can use AFAIK

devshgraphicsprogramming · 2025-11-08T22:27:55Z

CMakeLists.txt

-	add_subdirectory(12_MeshLoaders)
-	#
-	#add_subdirectory(13_MaterialCompiler EXCLUDE_FROM_ALL)
+	add_subdirectory(12_MeshLoaders EXCLUDE_FROM_ALL)


you erroneously added EXCLUDE_FROM_ALL to example 12 and now its omitted from CI

devshgraphicsprogramming · 2025-11-08T22:32:20Z

13_BitonicSort/app_resources/common.hlsl

+	uint64_t deviceBufferAddress;
+};
+
+NBL_CONSTEXPR uint32_t WorkgroupSizeLog2 = 10;  // 1024 threads (2^10)


512 is optimal residency on all GPUs

so log2 of 9

devshgraphicsprogramming · 2025-11-08T22:33:05Z

13_BitonicSort/main.cpp

+					std::cout << "(" << key << "," << value << "), ";
+					if ((i + 1) % 20 == 0) {
+						std::cout << "\n";
+					}
+				}
+				std::cout << "\nElement count: " << elementCount << "\n";
+
+				bool is_sorted = true;
+				int32_t error_index = -1;
+				for (uint32_t i = 1; i < elementCount; i++) {
+					uint32_t prevKey = data[(i - 1) * 2];
+					uint32_t currKey = data[i * 2];
+					if (currKey < prevKey) {
+						is_sorted = false;
+						error_index = i;
+						break;
+					}
+				}
+
+				if (is_sorted) {
+					std::cout << "Array is correctly sorted!\n";
+				}
+				else {
+					std::cout << "Array is NOT sorted correctly!\n";
+					std::cout << "Error at index " << error_index << ":\n";
+					std::cout << "  Previous key [" << (error_index - 1) << "] = " << data[(error_index - 1) * 2] << "\n";
+					std::cout << "  Current key  [" << error_index << "] = " << data[error_index * 2] << "\n";
+					std::cout << "  (" << data[error_index * 2] << " < " << data[(error_index - 1) * 2] << " is WRONG!)\n";


please use m_logger instead of std::cout

devshgraphicsprogramming · 2025-11-10T10:02:20Z

13_BitonicSort/main.cpp

+
+			deviceLocalBufferParams.queueFamilyIndexCount = 1;
+			deviceLocalBufferParams.queueFamilyIndices = &queueFamilyIndex;
+			deviceLocalBufferParams.size = sizeof(uint32_t) * elementCount * 2;  // *2 because we store (key, value) pairs


have a struct like KeyValue so you can take a sizeof of it then

devshgraphicsprogramming · 2025-11-10T10:04:45Z

13_BitonicSort/app_resources/common.hlsl

+};
+
+NBL_CONSTEXPR uint32_t WorkgroupSizeLog2 = 10;  // 1024 threads (2^10)
+NBL_CONSTEXPR uint32_t ElementsPerThreadLog2 = 2;  // 4 elements per thread (2^2) - VIRTUAL THREADING!


I wouldn't really call that virtual threads, virtual threads is if you are able to make a workgroup of size lets say 512 behave as if its 4096

processing multiple elements per invocation is an orthogonal extra to that and it helps with subgroupShuffle utilization and loading from global memory

devshgraphicsprogramming · 2025-11-10T10:22:40Z

13_BitonicSort/main.cpp

+
+		IQueue* const queue = getComputeQueue();
+
+		const uint32_t inputSize = sizeof(uint32_t) * elementCount * 2;  // *2 because we store (key, value) pairs


compute this before making your m_deviceLocalBuffer and use it throughout

devshgraphicsprogramming · 2025-11-10T10:23:43Z

13_BitonicSort/main.cpp

+				inputPtr[i * 2] = key;
+				inputPtr[i * 2 + 1] = value;


again another reason to have a named struct so you're not doing this unreadable type punning

devshgraphicsprogramming · 2025-11-10T10:24:03Z

13_BitonicSort/main.cpp

+				std::cout << "(" << key << "," << value << "), ";
+				if ((i + 1) % 20 == 0) {
+					std::cout << "\n";
+				}
+			}
+			std::cout << "\nElement count: " << elementCount << "\n";


use the m_logger instead of std::cout

or an std::ostringstream and then m_logger to print the contents

devshgraphicsprogramming · 2025-11-10T10:28:55Z

13_BitonicSort/main.cpp

+			{
+				assert(dstOffset == 0 && size == outputSize);
+
+				std::cout << "Sorted array: ";


name it differently, Output ARray

"Sorted array" is confusing cause it makes me think you're outputting a reference sorted array on the CPU

devshgraphicsprogramming · 2025-11-10T10:35:02Z

13_BitonicSort/main.cpp

+
+			cmdbuf->pushConstants(m_pipeline->getLayout(), IShader::E_SHADER_STAGE::ESS_COMPUTE, 0u, sizeof(pc), &pc);
+
+			cmdbuf->dispatch(1, 1, 1);


I'd consider preparing multiple test data sets, right now you only have one random one, and having 1 workgroup per test (so still one dispatch).

I'd add:

all keys equal

keys already sorted

keys in exact reverse
as manual test cases.

Furthermore this should be a stable sort (unlike the a counting sort), its important to check things like stability so making some test with some neighbouring equal keys in the input array is important to check they have not changed places.

devshgraphicsprogramming · 2025-11-10T10:37:25Z

13_BitonicSort/main.cpp

+					if (currKey < prevKey) {
+						is_sorted = false;
+						error_index = i;
+						break;
+					}


because Bitonic should be stable, you should also check

else if (currKey==prevKey && currValue>prevValue) // check stability, this is why we've initialized the values in such a particular way { // fail, not to bad sorting but instability }

and also assert(curValue!=prevValue); because of how we initialized the values

bitonic sort sample added

fd346a0

Signed-off-by: CrabeExtra <abbasgaroosi7@gmail.com>

devshgraphicsprogramming reviewed Sep 1, 2025

View reviewed changes

CMakeLists.txt Show resolved Hide resolved

devshgraphicsprogramming reviewed Sep 1, 2025

View reviewed changes

13_BitonicSort/app_resources/bitonic_sort_shader.comp.hlsl Outdated Show resolved Hide resolved