Skip to content

Commit f1b2484

Browse files
Victor LomullerNaghasan
authored andcommitted
Specialization constant extension for SYCL
It propose a way to express and use the SPIR-V concept of specialization constant into SYCL.
1 parent 15d9eee commit f1b2484

File tree

2 files changed

+374
-0
lines changed

2 files changed

+374
-0
lines changed

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -53,3 +53,4 @@ Each proposal in the table below will be tagged with one of the following states
5353
| CP012 | [Data Movement in C++](data-movement/index.md) | ISO C++ SG1, SG14 | 30 May 2017 | 28 August 2017 | _Work in Progress_ |
5454
| CP013 | [Supporting Heterogeneous & Distributed Computing Through Affinity](affinity/index.md) | ISO C++ SG1, SG14 | 15 November 2017 | 27 Novemeber 2017 | _Work in Progress_ |
5555
| CP014 | [Shared Virtual Memory](svm/index.md) | SYCL 2.2 | 22 January 2018 | 22 January 2018 | _Work in Progress_ |
56+
| CP015 | [Specialization Constant](spec-constant/index.md) | SYCL 1.2.1 extension / SYCL 2.2 | 24 April 2018 | 24 April 2018 | _Work in Progress_ |

spec-constant/index.md

Lines changed: 373 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,373 @@
1+
| Proposal ID | CP015 |
2+
|-------------|--------|
3+
| Name | SYCL Specialization Constant |
4+
| Date of Creation | 18 April 2018 |
5+
| Target | SYCL 1.2.1 extension / SYCL 2.2 |
6+
| Current Status | _Work in progress_ |
7+
| Reply-to | Victor Lomüller <victor@codeplay.com> |
8+
| Original author | Victor Lomüller <victor@codeplay.com>, Toomas Remmelg <toomas.remmelg@codeplay.com> |
9+
| Contributors | Victor Lomüller <victor@codeplay.com>, Toomas Remmelg <toomas.remmelg@codeplay.com>, Ruyman Reyes <ruyman@codeplay.com> |
10+
11+
# SYCL Specialization Constant
12+
13+
## Motivation
14+
15+
Many applications use runtime known constants to adapt their behaviors to their runtime environment.
16+
Such constants are unknown when the developer compiles the application but will remain invariant through-out the application execution.
17+
This is especially true for highly tuned software that requires information about the hardware on which the the application is running.
18+
19+
Since OpenCL C kernels are being fully compiled at runtime, those constants are usually expressed as macro and the value is passed to online compiler when the kernel is being compiled.
20+
However, SYCL being statically compiled, it is not possible to use this approach. Template based techniques might not be possible or come at the price of code size explosion.
21+
22+
SPIR-V, the standard intermediate representation for shader and compute kernels, introduced "specialization constants" as a way to replace this macro usage in statically compiled kernels.
23+
Specialization constants in SPIR-V are treated as constants whose value is not known at the time of the SPIR-V module generation.
24+
Providing these constants before building the module for the actual target provides the compiler with the opportunity to further optimize the program.
25+
26+
This proposal introduces a way to express such runtime constants in SYCL programs.
27+
Even if the motivation is derived from a SPIR-V concept, its usage is not limited to device compilers targeting SPIR-V.
28+
29+
## Specialization Constant Overview
30+
31+
The following SYCL program present a specialization constant are expressed.
32+
33+
```cpp
34+
#include <CL/sycl.hpp>
35+
#include <vector>
36+
37+
class specialized_kernel;
38+
class runtime_const;
39+
40+
// Fetch a value at runtime.
41+
float get_value();
42+
43+
int main() {
44+
cl::sycl::queue queue;
45+
cl::sycl::program program(queue.get_context());
46+
47+
// Create a specialization constant.
48+
cl::sycl::experimental::spec_constant<float, runtime_const> my_constant =
49+
program.set_spec_constant<runtime_const>(get_value());
50+
program.build_with_kernel_type<specialized_kernel>();
51+
52+
std::vector<float> vec(1);
53+
{
54+
cl::sycl::buffer<float, 1> buffer(vec.data(), vec.size());
55+
56+
queue.submit([&](cl::sycl::handler& cgh) {
57+
auto acc = cgh.get_access<cl::sycl::access::mode::write>(buffer);
58+
cgh.single_task<specialized_kernel>(
59+
program.get_kernel<specialized_kernel>(),
60+
[=]() { acc[0] = my_constant.get(); });
61+
});
62+
}
63+
}
64+
```
65+
In this example, the call to `set_spec_constant` binds the value returned by the call to `get_value` to the SYCL `program`.
66+
At static compilation time, the value is unknown to the SYCL device compiler, thus cannot be used by the optimizations.
67+
At runtime, `get_value` is evaluated and bond to the SYCL `program`, giving the opportunity for the underlying OpenCL runtime to use it during the kernel build.
68+
The function `set_spec_constant` returns a `spec_constant` object allowing the user to use the value inside the kernel.
69+
After all runtime values are bounded to the program, the program is built.
70+
71+
The specialization constant `my_constant` is later used inside `specialized_kernel` and the expression `my_constant.get()` returns the value returned by the call to `get_value()`.
72+
If the target natively supports specialization constant, this value will be known by the underlying OpenCL consumer when it builds the kernel.
73+
74+
A more concrete example would be a blocked matrix-matrix multiply.
75+
In this version of the algorithm, threads in a work-group collectively load elements from global to local memory and then perform part of the operation on the block.
76+
Typically, the operation would look like this:
77+
```cpp
78+
template <typename T>
79+
void mat_multiply(cl::sycl::queue& q, T* MA, T* MB, T* MC, int matSize) {
80+
auto device = q.get_device();
81+
// Choose a block size based on some information about the device.
82+
auto maxBlockSize =
83+
device.get_info<cl::sycl::info::device::max_work_group_size>();
84+
auto blockSize = prevPowerOfTwo(std::sqrt(maxBlockSize));
85+
blockSize = std::min(matSize, blockSize);
86+
87+
{
88+
range<1> dimensions(matSize * matSize);
89+
buffer<T> bA(MA, dimensions);
90+
buffer<T> bB(MB, dimensions);
91+
buffer<T> bC(MC, dimensions);
92+
93+
q.submit([&](handler& cgh) {
94+
auto pA = bA.template get_access<access::mode::read>(cgh);
95+
auto pB = bB.template get_access<access::mode::read>(cgh);
96+
auto pC = bC.template get_access<access::mode::write>(cgh);
97+
auto localRange = range<1>(blockSize * blockSize);
98+
99+
accessor<T, 1, access::mode::read_write, access::target::local> pBA(
100+
localRange, cgh);
101+
accessor<T, 1, access::mode::read_write, access::target::local> pBB(
102+
localRange, cgh);
103+
104+
cgh.parallel_for<class mxm_kernel<T>>(
105+
nd_range<2>{range<2>(matSize, matSize),
106+
range<2>(blockSize, blockSize)},
107+
[=](nd_item<2> it) {
108+
// Current block
109+
int blockX = it.get_group(0);
110+
int blockY = it.get_group(1);
111+
112+
// Current local item
113+
int localX = it.get_local(0);
114+
int localY = it.get_local(1);
115+
116+
// Start in the A matrix
117+
int a_start = matSize * blockSize * blockY;
118+
// End in the b matrix
119+
int a_end = a_start + matSize - 1;
120+
// Start in the b matrix
121+
int b_start = blockSize * blockX;
122+
123+
// Result for the current C(i,j) element
124+
T tmp = 0.0f;
125+
// We go through all a, b blocks
126+
for (int a = a_start, b = b_start; a <= a_end;
127+
a += blockSize, b += (blockSize * matSize)) {
128+
// Copy the values in shared memory collectively
129+
pBA[localY * blockSize + localX] =
130+
pA[a + matSize * localY + localX];
131+
// Note the swap of X/Y to maintain contiguous access
132+
pBB[localX * blockSize + localY] =
133+
pB[b + matSize * localY + localX];
134+
it.barrier(access::fence_space::local_space);
135+
// Now each thread adds the value of its sum
136+
for (int k = 0; k < blockSize; k++) {
137+
tmp +=
138+
pBA[localY * blockSize + k] * pBB[localX * blockSize + k];
139+
}
140+
// The barrier ensures that all threads have written to local
141+
// memory before continuing
142+
it.barrier(access::fence_space::local_space);
143+
}
144+
auto elemIndex =
145+
it.get_global(1) * it.get_global_range()[0] + it.get_global(0);
146+
// Each thread updates its position
147+
pC[elemIndex] = tmp;
148+
});
149+
});
150+
}
151+
}
152+
```
153+
In this example, `blockSize` depends on a runtime feature that is a hardware constant for a given device, thus never changes once known.
154+
The main issue is that this value is treated as a constant variable, and the compiler is unable to use it to perform optimizations like constant propagation or loop unrolling.
155+
It can even have an adverse effect as the value will use a register and the compiler may be forced to spill or reload the value multiple times.
156+
157+
Using specialization constants, the routine can be rewritten as:
158+
```cpp
159+
template <typename T>
160+
void mat_multiply(cl::sycl::queue& q, T* MA, T* MB, T* MC, int matSize) {
161+
auto device = q.get_device();
162+
// Choose a block size based on some information about the device.
163+
auto maxBlockSize =
164+
device.get_info<cl::sycl::info::device::max_work_group_size>();
165+
auto blockSizeCst = prevPowerOfTwo(std::sqrt(maxBlockSize));
166+
blockSizeCst = std::min(matSize, blockSize);
167+
168+
cl::sycl::program program(queue.get_context());
169+
170+
// Create a specialization constant to encapsulate blockSize.
171+
auto blockSize = program.set_spec_constant<class block_size>(blockSizeCst);
172+
program.build_with_kernel_type<class mxm_kernel<T>>();
173+
174+
{
175+
range<1> dimensions(matSize * matSize);
176+
buffer<T> bA(MA, dimensions);
177+
buffer<T> bB(MB, dimensions);
178+
buffer<T> bC(MC, dimensions);
179+
180+
q.submit([&](handler& cgh) {
181+
auto pA = bA.template get_access<access::mode::read>(cgh);
182+
auto pB = bB.template get_access<access::mode::read>(cgh);
183+
auto pC = bC.template get_access<access::mode::write>(cgh);
184+
auto localRange = range<1>(blockSize * blockSize);
185+
186+
accessor<T, 1, access::mode::read_write, access::target::local> pBA(
187+
localRange, cgh);
188+
accessor<T, 1, access::mode::read_write, access::target::local> pBB(
189+
localRange, cgh);
190+
191+
cgh.parallel_for<class mxm_kernel<T>>(
192+
program.get_kernel<class mxm_kernel<T>>(),
193+
nd_range<2>{range<2>(matSize, matSize),
194+
range<2>(blockSize, blockSize)},
195+
[=](nd_item<2> it) {
196+
// Current block
197+
int blockX = it.get_group(0);
198+
int blockY = it.get_group(1);
199+
200+
// Current local item
201+
int localX = it.get_local(0);
202+
int localY = it.get_local(1);
203+
204+
// Start in the A matrix
205+
int a_start = matSize * blockSize * blockY;
206+
// End in the b matrix
207+
int a_end = a_start + matSize - 1;
208+
// Start in the b matrix
209+
int b_start = blockSize * blockX;
210+
211+
// Result for the current C(i,j) element
212+
T tmp = 0.0f;
213+
// We go through all a, b blocks
214+
for (int a = a_start, b = b_start; a <= a_end;
215+
a += blockSize, b += (blockSize * matSize)) {
216+
// Copy the values in shared memory collectively
217+
pBA[localY * blockSize + localX] =
218+
pA[a + matSize * localY + localX];
219+
// Note the swap of X/Y to maintain contiguous access
220+
pBB[localX * blockSize + localY] =
221+
pB[b + matSize * localY + localX];
222+
it.barrier(access::fence_space::local_space);
223+
// Now each thread adds the value of its sum
224+
for (int k = 0; k < blockSize; k++) {
225+
tmp +=
226+
pBA[localY * blockSize + k] * pBB[localX * blockSize + k];
227+
}
228+
// The barrier ensures that all threads have written to local
229+
// memory before continuing
230+
it.barrier(access::fence_space::local_space);
231+
}
232+
auto elemIndex =
233+
it.get_global(1) * it.get_global_range()[0] + it.get_global(0);
234+
// Each thread updates its position
235+
pC[elemIndex] = tmp;
236+
});
237+
});
238+
}
239+
}
240+
```
241+
In this example, `blockSize` is now a specialization constant holding the value same value as before, meaning that the value is now injected inside the module, allow the OpenCL consumer to use the value in the optimizations.
242+
Note that the specialization constant ID is independent from the template parameter `T` from which the kernel depends on. This means that all kernel instances will share the same value.
243+
244+
245+
## Specialization Constant Representation
246+
247+
Specialization constants are encapsulated into a `cl::sycl::experimental::spec_constant` immutable object which can be passed to a SYCL kernel as a parameter.
248+
This object can only be constructed by the SYCL runtime.
249+
Accessing the value is done either explicitly via a `get` function or an implicit conversion.
250+
251+
The `cl::sycl::experimental::spec_constant` interface is defined as follows:
252+
253+
```cpp
254+
namespace cl {
255+
namespace sycl {
256+
namespace experimental {
257+
258+
template <typename T, typename ID = T>
259+
class spec_constant {
260+
private:
261+
// Implementation defined constructor.
262+
spec_constant(/* Implementation defined */);
263+
public:
264+
spec_constant();
265+
266+
T get() const; // explicit access.
267+
operator T() const; // implicit conversion.
268+
};
269+
270+
} // namespace experimental
271+
} // namespace sycl
272+
} // namespace cl
273+
```
274+
275+
Where `T` is the type of the constant. To be valid, the type `T` must be standard layout and trivially copyable.
276+
The template parameter `ID` is a unique name to designate the specialization constant.
277+
The name follows the same requirement and restrictions as the SYCL kernel names.
278+
It is valid for a program to reuse a kernel name for a specialization constant name and vice versa.
279+
280+
There is no guarantees about the size of the object, whether or not the constant is stored in memory is left as an implementation detail.
281+
282+
A `cl::sycl::experimental::spec_constant` object is considered initialized once the result of a `cl::sycl::program::set_spec_constant` is assigned to it.
283+
284+
Once initialized, `cl::sycl::experimental::spec_constant` objects are immutable, attempts to circumvent this property produces undefined behavior.
285+
286+
`cl::sycl::experimental::spec_constant` is default constructible, although the object is not considered initialized until the result of the call to `cl::sycl::program::set_spec_constant` is assigned to it.
287+
288+
Attempts to use an uninitialized `cl::sycl::experimental::spec_constant` produces undefined behavior.
289+
290+
## Building Programs with Specialization Constants
291+
292+
SYCL program requiring a specialization constant value to be built must first set them before building.
293+
294+
The program interface is extended to include a mechanism to set the constant.
295+
296+
```cpp
297+
namespace cl {
298+
namespace sycl {
299+
300+
class program {
301+
// ...
302+
public:
303+
template <typename ID, typename T>
304+
spec_constant<T, ID> set_spec_constant(T cst);
305+
// ...
306+
};
307+
308+
} // namespace sycl
309+
} // namespace cl
310+
```
311+
312+
The templated member function `set_spec_constant` takes a runtime value of type `T` that will be used to set the specialization constant named `ID`.
313+
Multiple specialization constants can be set for the same program by calling `set_spec_constant` multiple times.
314+
Previously created `cl::sycl::experimental::spec_constant` objects becomes invalids and any usage of invalided objects produce undefined behavior.
315+
316+
A specialization constant value can be overwritten if the program was not built before by recalling `set_spec_constant` with the same `ID` and the new value.
317+
Although the type `T` of the specialization constant must remain the same.
318+
319+
Once all specialization constants are set, the program can be compile/built using program's function `compile_with_kernel_type`/`build_with_kernel_type`.
320+
321+
If a required specialization constant is not set before calling `compile_with_kernel_type` / `build_with_kernel_type`, a `cl::sycl::experimental::spec_const_error` is thrown and the build of the kernel fails.
322+
323+
For a same kernel, it is valid to set different specialization constants to different `cl::sycl::program` that builds it.
324+
325+
After the kernel is built, it is no longer possible to set new specialization constants.
326+
A `cl::sycl::experimental::spec_const_error` exception will be thrown if the user attempt change it after the kernel has been built.
327+
328+
## Build issue caused by Specialization Constants
329+
330+
The following error class is added:
331+
```cpp
332+
namespace cl {
333+
namespace sycl {
334+
namespace experimental {
335+
336+
class spec_const_error : public compile_program_error;
337+
338+
} // namespace experimental
339+
} // namespace sycl
340+
} // namespace cl
341+
```
342+
343+
This error can be thrown if a specialization constant compilation error occurs.
344+
345+
## OpenCL Interoperability
346+
347+
In SYCL, specialization constants use typenames to identify them rather than using the SPIR-V/OpenCL numerical identifiers.
348+
349+
To allow interoperability with OpenCL, uses can use this a special templated type as the SYCL specialization constant identifier to specify the numerical identifier of a specialization constant inside the module:
350+
```cpp
351+
namespace cl {
352+
namespace sycl {
353+
namespace experimental {
354+
355+
template <unsigned NID>
356+
struct spec_constant_id {
357+
static constexpr unsigned id = NID;
358+
};
359+
360+
} // namespace experimental
361+
} // namespace sycl
362+
} // namespace cl
363+
```
364+
365+
The runtime will use the value `NID` provided by the template parameter to set the specialization constant.
366+
If the specified identifier does not exist in the module, a `cl::sycl::experimental::spec_const_error` error is thrown.
367+
368+
For example:
369+
```cpp
370+
// Create a specialization constant.
371+
auto my_constant = program.set_spec_constant<cl::sycl::experimental::spec_constant_id<42>>(get_value());
372+
```
373+
In this call, the runtime will bind the value with the specialization constant with the identifier `42`.

0 commit comments

Comments
 (0)