You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Apr 28, 2023. It is now read-only.
This mapping option controls the maximum number of elements per thread
that are promoted into the private memory (hopefully, registers, but we
cannot guarantee this at the CUDA level). The value is optional in the
protocol buffers. When not provided, query the maximum number of
threads per block from CUDA device properties and divide it by the
number of threads in the block to obtain the per-thread limitation.
Note that using all registers in a single block will likely limit the
occupancy of SMs, potentially degrading performance. Introducing the
limiting factor is primarily motivated by this effect, and it lets the
caller to require the mapper to use less registers, potentially
increasing the occupancy. Since register allocation is performed by the
downstream compiler, this option is a mere recommendation and is
expressed in terms of (untyped) elements rather than actual registers.
It would be impossible to account for all registers required by the main
computation (that is, necessary to store the data loaded from memory
during operations) at the CUDA level, that also contribute to the
register pressure of the kernel.
Although limiting the number of promoted elements number of registers
available per thread may seem too constraining for occupancy, it is
strictly better than the current approach where we may promote even more
elements, which then get spilled into the slow local memory.
0 commit comments