|
| 1 | +| Proposal ID | CP015 | |
| 2 | +|-------------|--------| |
| 3 | +| Name | SYCL Specialization Constant | |
| 4 | +| Date of Creation | 18 April 2018 | |
| 5 | +| Target | SYCL 1.2.1 extension / SYCL 2.2 | |
| 6 | +| Current Status | _Work in progress_ | |
| 7 | +| Reply-to | Victor Lomüller <victor@codeplay.com> | |
| 8 | +| Original author | Victor Lomüller <victor@codeplay.com>, Toomas Remmelg <toomas.remmelg@codeplay.com> | |
| 9 | +| Contributors | Victor Lomüller <victor@codeplay.com>, Toomas Remmelg <toomas.remmelg@codeplay.com>, Ruyman Reyes <ruyman@codeplay.com> | |
| 10 | + |
| 11 | +# SYCL Specialization Constant |
| 12 | + |
| 13 | +## Motivation |
| 14 | + |
| 15 | +Many applications use runtime known constants to adapt their behaviors to their runtime environment. |
| 16 | +Such constants are unknown when the developer compiles the application but will remain invariant through-out the application execution. |
| 17 | +This is especially true for highly tuned software that requires information about the hardware on which the the application is running. |
| 18 | + |
| 19 | +Since OpenCL C kernels are being fully compiled at runtime, those constants are usually expressed as macro and the value is passed to online compiler when the kernel is being compiled. |
| 20 | +However, SYCL being statically compiled, it is not possible to use this approach. Template based techniques might not be possible or come at the price of code size explosion. |
| 21 | + |
| 22 | +SPIR-V, the standard intermediate representation for shader and compute kernels, introduced "specialization constants" as a way to replace this macro usage in statically compiled kernels. |
| 23 | +Specialization constants in SPIR-V are treated as constants whose value is not known at the time of the SPIR-V module generation. |
| 24 | +Providing these constants before building the module for the actual target provides the compiler with the opportunity to further optimize the program. |
| 25 | + |
| 26 | +This proposal introduces a way to express such runtime constants in SYCL programs. |
| 27 | +Even if the motivation is derived from a SPIR-V concept, its usage is not limited to device compilers targeting SPIR-V. |
| 28 | + |
| 29 | +## Specialization Constant Overview |
| 30 | + |
| 31 | +The following SYCL program present a specialization constant are expressed. |
| 32 | + |
| 33 | +```cpp |
| 34 | +#include <CL/sycl.hpp> |
| 35 | +#include <vector> |
| 36 | + |
| 37 | +class specialized_kernel; |
| 38 | +class runtime_const; |
| 39 | + |
| 40 | +// Fetch a value at runtime. |
| 41 | +float get_value(); |
| 42 | + |
| 43 | +int main() { |
| 44 | + cl::sycl::queue queue; |
| 45 | + cl::sycl::program program(queue.get_context()); |
| 46 | + |
| 47 | + // Create a specialization constant. |
| 48 | + cl::sycl::experimental::spec_constant<float, runtime_const> my_constant = |
| 49 | + program.set_spec_constant<runtime_const>(get_value()); |
| 50 | + program.build_with_kernel_type<specialized_kernel>(); |
| 51 | + |
| 52 | + std::vector<float> vec(1); |
| 53 | + { |
| 54 | + cl::sycl::buffer<float, 1> buffer(vec.data(), vec.size()); |
| 55 | + |
| 56 | + queue.submit([&](cl::sycl::handler& cgh) { |
| 57 | + auto acc = cgh.get_access<cl::sycl::access::mode::write>(buffer); |
| 58 | + cgh.single_task<specialized_kernel>( |
| 59 | + program.get_kernel<specialized_kernel>(), |
| 60 | + [=]() { acc[0] = my_constant.get(); }); |
| 61 | + }); |
| 62 | + } |
| 63 | +} |
| 64 | +``` |
| 65 | +In this example, the call to `set_spec_constant` binds the value returned by the call to `get_value` to the SYCL `program`. |
| 66 | +At static compilation time, the value is unknown to the SYCL device compiler, thus cannot be used by the optimizations. |
| 67 | +At runtime, `get_value` is evaluated and bond to the SYCL `program`, giving the opportunity for the underlying OpenCL runtime to use it during the kernel build. |
| 68 | +The function `set_spec_constant` returns a `spec_constant` object allowing the user to use the value inside the kernel. |
| 69 | +After all runtime values are bounded to the program, the program is built. |
| 70 | + |
| 71 | +The specialization constant `my_constant` is later used inside `specialized_kernel` and the expression `my_constant.get()` returns the value returned by the call to `get_value()`. |
| 72 | +If the target natively supports specialization constant, this value will be known by the underlying OpenCL consumer when it builds the kernel. |
| 73 | + |
| 74 | +A more concrete example would be a blocked matrix-matrix multiply. |
| 75 | +In this version of the algorithm, threads in a work-group collectively load elements from global to local memory and then perform part of the operation on the block. |
| 76 | +Typically, the operation would look like this: |
| 77 | +```cpp |
| 78 | +template <typename T> |
| 79 | +void mat_multiply(cl::sycl::queue& q, T* MA, T* MB, T* MC, int matSize) { |
| 80 | + auto device = q.get_device(); |
| 81 | + // Choose a block size based on some information about the device. |
| 82 | + auto maxBlockSize = |
| 83 | + device.get_info<cl::sycl::info::device::max_work_group_size>(); |
| 84 | + auto blockSize = prevPowerOfTwo(std::sqrt(maxBlockSize)); |
| 85 | + blockSize = std::min(matSize, blockSize); |
| 86 | + |
| 87 | + { |
| 88 | + range<1> dimensions(matSize * matSize); |
| 89 | + buffer<T> bA(MA, dimensions); |
| 90 | + buffer<T> bB(MB, dimensions); |
| 91 | + buffer<T> bC(MC, dimensions); |
| 92 | + |
| 93 | + q.submit([&](handler& cgh) { |
| 94 | + auto pA = bA.template get_access<access::mode::read>(cgh); |
| 95 | + auto pB = bB.template get_access<access::mode::read>(cgh); |
| 96 | + auto pC = bC.template get_access<access::mode::write>(cgh); |
| 97 | + auto localRange = range<1>(blockSize * blockSize); |
| 98 | + |
| 99 | + accessor<T, 1, access::mode::read_write, access::target::local> pBA( |
| 100 | + localRange, cgh); |
| 101 | + accessor<T, 1, access::mode::read_write, access::target::local> pBB( |
| 102 | + localRange, cgh); |
| 103 | + |
| 104 | + cgh.parallel_for<class mxm_kernel<T>>( |
| 105 | + nd_range<2>{range<2>(matSize, matSize), |
| 106 | + range<2>(blockSize, blockSize)}, |
| 107 | + [=](nd_item<2> it) { |
| 108 | + // Current block |
| 109 | + int blockX = it.get_group(0); |
| 110 | + int blockY = it.get_group(1); |
| 111 | + |
| 112 | + // Current local item |
| 113 | + int localX = it.get_local(0); |
| 114 | + int localY = it.get_local(1); |
| 115 | + |
| 116 | + // Start in the A matrix |
| 117 | + int a_start = matSize * blockSize * blockY; |
| 118 | + // End in the b matrix |
| 119 | + int a_end = a_start + matSize - 1; |
| 120 | + // Start in the b matrix |
| 121 | + int b_start = blockSize * blockX; |
| 122 | + |
| 123 | + // Result for the current C(i,j) element |
| 124 | + T tmp = 0.0f; |
| 125 | + // We go through all a, b blocks |
| 126 | + for (int a = a_start, b = b_start; a <= a_end; |
| 127 | + a += blockSize, b += (blockSize * matSize)) { |
| 128 | + // Copy the values in shared memory collectively |
| 129 | + pBA[localY * blockSize + localX] = |
| 130 | + pA[a + matSize * localY + localX]; |
| 131 | + // Note the swap of X/Y to maintain contiguous access |
| 132 | + pBB[localX * blockSize + localY] = |
| 133 | + pB[b + matSize * localY + localX]; |
| 134 | + it.barrier(access::fence_space::local_space); |
| 135 | + // Now each thread adds the value of its sum |
| 136 | + for (int k = 0; k < blockSize; k++) { |
| 137 | + tmp += |
| 138 | + pBA[localY * blockSize + k] * pBB[localX * blockSize + k]; |
| 139 | + } |
| 140 | + // The barrier ensures that all threads have written to local |
| 141 | + // memory before continuing |
| 142 | + it.barrier(access::fence_space::local_space); |
| 143 | + } |
| 144 | + auto elemIndex = |
| 145 | + it.get_global(1) * it.get_global_range()[0] + it.get_global(0); |
| 146 | + // Each thread updates its position |
| 147 | + pC[elemIndex] = tmp; |
| 148 | + }); |
| 149 | + }); |
| 150 | + } |
| 151 | +} |
| 152 | +``` |
| 153 | +In this example, `blockSize` depends on a runtime feature that is a hardware constant for a given device, thus never changes once known. |
| 154 | +The main issue is that this value is treated as a constant variable, and the compiler is unable to use it to perform optimizations like constant propagation or loop unrolling. |
| 155 | +It can even have an adverse effect as the value will use a register and the compiler may be forced to spill or reload the value multiple times. |
| 156 | +
|
| 157 | +Using specialization constants, the routine can be rewritten as: |
| 158 | +```cpp |
| 159 | +template <typename T> |
| 160 | +void mat_multiply(cl::sycl::queue& q, T* MA, T* MB, T* MC, int matSize) { |
| 161 | + auto device = q.get_device(); |
| 162 | + // Choose a block size based on some information about the device. |
| 163 | + auto maxBlockSize = |
| 164 | + device.get_info<cl::sycl::info::device::max_work_group_size>(); |
| 165 | + auto blockSizeCst = prevPowerOfTwo(std::sqrt(maxBlockSize)); |
| 166 | + blockSizeCst = std::min(matSize, blockSize); |
| 167 | +
|
| 168 | + cl::sycl::program program(queue.get_context()); |
| 169 | +
|
| 170 | + // Create a specialization constant to encapsulate blockSize. |
| 171 | + auto blockSize = program.set_spec_constant<class block_size>(blockSizeCst); |
| 172 | + program.build_with_kernel_type<class mxm_kernel<T>>(); |
| 173 | +
|
| 174 | + { |
| 175 | + range<1> dimensions(matSize * matSize); |
| 176 | + buffer<T> bA(MA, dimensions); |
| 177 | + buffer<T> bB(MB, dimensions); |
| 178 | + buffer<T> bC(MC, dimensions); |
| 179 | +
|
| 180 | + q.submit([&](handler& cgh) { |
| 181 | + auto pA = bA.template get_access<access::mode::read>(cgh); |
| 182 | + auto pB = bB.template get_access<access::mode::read>(cgh); |
| 183 | + auto pC = bC.template get_access<access::mode::write>(cgh); |
| 184 | + auto localRange = range<1>(blockSize * blockSize); |
| 185 | +
|
| 186 | + accessor<T, 1, access::mode::read_write, access::target::local> pBA( |
| 187 | + localRange, cgh); |
| 188 | + accessor<T, 1, access::mode::read_write, access::target::local> pBB( |
| 189 | + localRange, cgh); |
| 190 | +
|
| 191 | + cgh.parallel_for<class mxm_kernel<T>>( |
| 192 | + program.get_kernel<class mxm_kernel<T>>(), |
| 193 | + nd_range<2>{range<2>(matSize, matSize), |
| 194 | + range<2>(blockSize, blockSize)}, |
| 195 | + [=](nd_item<2> it) { |
| 196 | + // Current block |
| 197 | + int blockX = it.get_group(0); |
| 198 | + int blockY = it.get_group(1); |
| 199 | +
|
| 200 | + // Current local item |
| 201 | + int localX = it.get_local(0); |
| 202 | + int localY = it.get_local(1); |
| 203 | +
|
| 204 | + // Start in the A matrix |
| 205 | + int a_start = matSize * blockSize * blockY; |
| 206 | + // End in the b matrix |
| 207 | + int a_end = a_start + matSize - 1; |
| 208 | + // Start in the b matrix |
| 209 | + int b_start = blockSize * blockX; |
| 210 | +
|
| 211 | + // Result for the current C(i,j) element |
| 212 | + T tmp = 0.0f; |
| 213 | + // We go through all a, b blocks |
| 214 | + for (int a = a_start, b = b_start; a <= a_end; |
| 215 | + a += blockSize, b += (blockSize * matSize)) { |
| 216 | + // Copy the values in shared memory collectively |
| 217 | + pBA[localY * blockSize + localX] = |
| 218 | + pA[a + matSize * localY + localX]; |
| 219 | + // Note the swap of X/Y to maintain contiguous access |
| 220 | + pBB[localX * blockSize + localY] = |
| 221 | + pB[b + matSize * localY + localX]; |
| 222 | + it.barrier(access::fence_space::local_space); |
| 223 | + // Now each thread adds the value of its sum |
| 224 | + for (int k = 0; k < blockSize; k++) { |
| 225 | + tmp += |
| 226 | + pBA[localY * blockSize + k] * pBB[localX * blockSize + k]; |
| 227 | + } |
| 228 | + // The barrier ensures that all threads have written to local |
| 229 | + // memory before continuing |
| 230 | + it.barrier(access::fence_space::local_space); |
| 231 | + } |
| 232 | + auto elemIndex = |
| 233 | + it.get_global(1) * it.get_global_range()[0] + it.get_global(0); |
| 234 | + // Each thread updates its position |
| 235 | + pC[elemIndex] = tmp; |
| 236 | + }); |
| 237 | + }); |
| 238 | + } |
| 239 | +} |
| 240 | +``` |
| 241 | +In this example, `blockSize` is now a specialization constant holding the value same value as before, meaning that the value is now injected inside the module, allow the OpenCL consumer to use the value in the optimizations. |
| 242 | +Note that the specialization constant ID is independent from the template parameter `T` from which the kernel depends on. This means that all kernel instances will share the same value. |
| 243 | + |
| 244 | + |
| 245 | +## Specialization Constant Representation |
| 246 | + |
| 247 | +Specialization constants are encapsulated into a `cl::sycl::experimental::spec_constant` immutable object which can be passed to a SYCL kernel as a parameter. |
| 248 | +This object can only be constructed by the SYCL runtime. |
| 249 | +Accessing the value is done either explicitly via a `get` function or an implicit conversion. |
| 250 | + |
| 251 | +The `cl::sycl::experimental::spec_constant` interface is defined as follows: |
| 252 | + |
| 253 | +```cpp |
| 254 | +namespace cl { |
| 255 | +namespace sycl { |
| 256 | +namespace experimental { |
| 257 | + |
| 258 | +template <typename T, typename ID = T> |
| 259 | +class spec_constant { |
| 260 | +private: |
| 261 | + // Implementation defined constructor. |
| 262 | + spec_constant(/* Implementation defined */); |
| 263 | +public: |
| 264 | + spec_constant(); |
| 265 | + |
| 266 | + T get() const; // explicit access. |
| 267 | + operator T() const; // implicit conversion. |
| 268 | +}; |
| 269 | + |
| 270 | +} // namespace experimental |
| 271 | +} // namespace sycl |
| 272 | +} // namespace cl |
| 273 | +``` |
| 274 | +
|
| 275 | +Where `T` is the type of the constant. To be valid, the type `T` must be standard layout and trivially copyable. |
| 276 | +The template parameter `ID` is a unique name to designate the specialization constant. |
| 277 | +The name follows the same requirement and restrictions as the SYCL kernel names. |
| 278 | +It is valid for a program to reuse a kernel name for a specialization constant name and vice versa. |
| 279 | +
|
| 280 | +There is no guarantees about the size of the object, whether or not the constant is stored in memory is left as an implementation detail. |
| 281 | +
|
| 282 | +A `cl::sycl::experimental::spec_constant` object is considered initialized once the result of a `cl::sycl::program::set_spec_constant` is assigned to it. |
| 283 | +
|
| 284 | +Once initialized, `cl::sycl::experimental::spec_constant` objects are immutable, attempts to circumvent this property produces undefined behavior. |
| 285 | +
|
| 286 | +`cl::sycl::experimental::spec_constant` is default constructible, although the object is not considered initialized until the result of the call to `cl::sycl::program::set_spec_constant` is assigned to it. |
| 287 | +
|
| 288 | +Attempts to use an uninitialized `cl::sycl::experimental::spec_constant` produces undefined behavior. |
| 289 | +
|
| 290 | +## Building Programs with Specialization Constants |
| 291 | +
|
| 292 | +SYCL program requiring a specialization constant value to be built must first set them before building. |
| 293 | +
|
| 294 | +The program interface is extended to include a mechanism to set the constant. |
| 295 | +
|
| 296 | +```cpp |
| 297 | +namespace cl { |
| 298 | +namespace sycl { |
| 299 | +
|
| 300 | +class program { |
| 301 | +// ... |
| 302 | +public: |
| 303 | + template <typename ID, typename T> |
| 304 | + spec_constant<T, ID> set_spec_constant(T cst); |
| 305 | +// ... |
| 306 | +}; |
| 307 | +
|
| 308 | +} // namespace sycl |
| 309 | +} // namespace cl |
| 310 | +``` |
| 311 | + |
| 312 | +The templated member function `set_spec_constant` takes a runtime value of type `T` that will be used to set the specialization constant named `ID`. |
| 313 | +Multiple specialization constants can be set for the same program by calling `set_spec_constant` multiple times. |
| 314 | +Previously created `cl::sycl::experimental::spec_constant` objects becomes invalids and any usage of invalided objects produce undefined behavior. |
| 315 | + |
| 316 | +A specialization constant value can be overwritten if the program was not built before by recalling `set_spec_constant` with the same `ID` and the new value. |
| 317 | +Although the type `T` of the specialization constant must remain the same. |
| 318 | + |
| 319 | +Once all specialization constants are set, the program can be compile/built using program's function `compile_with_kernel_type`/`build_with_kernel_type`. |
| 320 | + |
| 321 | +If a required specialization constant is not set before calling `compile_with_kernel_type` / `build_with_kernel_type`, a `cl::sycl::experimental::spec_const_error` is thrown and the build of the kernel fails. |
| 322 | + |
| 323 | +For a same kernel, it is valid to set different specialization constants to different `cl::sycl::program` that builds it. |
| 324 | + |
| 325 | +After the kernel is built, it is no longer possible to set new specialization constants. |
| 326 | +A `cl::sycl::experimental::spec_const_error` exception will be thrown if the user attempt change it after the kernel has been built. |
| 327 | + |
| 328 | +## Build issue caused by Specialization Constants |
| 329 | + |
| 330 | +The following error class is added: |
| 331 | +```cpp |
| 332 | +namespace cl { |
| 333 | +namespace sycl { |
| 334 | +namespace experimental { |
| 335 | + |
| 336 | +class spec_const_error : public compile_program_error; |
| 337 | + |
| 338 | +} // namespace experimental |
| 339 | +} // namespace sycl |
| 340 | +} // namespace cl |
| 341 | +``` |
| 342 | +
|
| 343 | +This error can be thrown if a specialization constant compilation error occurs. |
| 344 | +
|
| 345 | +## OpenCL Interoperability |
| 346 | +
|
| 347 | +In SYCL, specialization constants use typenames to identify them rather than using the SPIR-V/OpenCL numerical identifiers. |
| 348 | +
|
| 349 | +To allow interoperability with OpenCL, uses can use this a special templated type as the SYCL specialization constant identifier to specify the numerical identifier of a specialization constant inside the module: |
| 350 | +```cpp |
| 351 | +namespace cl { |
| 352 | +namespace sycl { |
| 353 | +namespace experimental { |
| 354 | +
|
| 355 | +template <unsigned NID> |
| 356 | +struct spec_constant_id { |
| 357 | + static constexpr unsigned id = NID; |
| 358 | +}; |
| 359 | +
|
| 360 | +} // namespace experimental |
| 361 | +} // namespace sycl |
| 362 | +} // namespace cl |
| 363 | +``` |
| 364 | + |
| 365 | +The runtime will use the value `NID` provided by the template parameter to set the specialization constant. |
| 366 | +If the specified identifier does not exist in the module, a `cl::sycl::experimental::spec_const_error` error is thrown. |
| 367 | + |
| 368 | +For example: |
| 369 | +```cpp |
| 370 | + // Create a specialization constant. |
| 371 | + auto my_constant = program.set_spec_constant<cl::sycl::experimental::spec_constant_id<42>>(get_value()); |
| 372 | +``` |
| 373 | +In this call, the runtime will bind the value with the specialization constant with the identifier `42`. |
0 commit comments