Skip to content

Commit bd12f57

Browse files
authored
Update the wording of the bulk_execution_affinity properties in P1436 (#112)
* Rename `bulk_execution_affinity_t::scatter_t` to `bulk_execution_affinity_t::spread_t`. * Rename `bulk_execution_affinity_t::compact_t` to `bulk_execution_affinity_t::close_t`. * Refine the wording of the `bulk_execution_affinity_t` properties to clarify the requirements on binding and chunking, based on feedback from SG1. * Add a note about a follow on paper for proposing a chunking property. * Add some additional wording to describe the terms of art used in the wording.
1 parent 6c369dd commit bd12f57

File tree

1 file changed

+40
-23
lines changed

1 file changed

+40
-23
lines changed

affinity/cpp-23/d1436r3.md

Lines changed: 40 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,10 @@ This paper is the result of discussions from man contributors within SG1, SG14 a
1818

1919
### P1436r3 (PRA 2020)
2020

21+
* Rename `bulk_execution_affinity_t::scatter_t` to `bulk_execution_affinity_t::spread_t`.
22+
* Rename `bulk_execution_affinity_t::compact_t` to `bulk_execution_affinity_t::close_t`.
23+
* Refine the wording of the `bulk_execution_affinity_t` properties to clarify the requirements on binding and chunking, based on feedback from SG1.
24+
2125
### P1436r2 (BEL 2019)
2226

2327
* Alter the wording on the `bulk_execution_affinity_t` properties so they are now hints that request the executor provide a particular pattern of binding, rather than a guarantee.
@@ -101,7 +105,7 @@ The affinity interface we propose should help computers achieve a much higher fr
101105

102106
To identify the requirements for supporting affinity we have looked at a number of use cases where affinity between memory locality and execution can provide better performance.
103107

104-
Consider the following code example *(Listing 1)* where the C\+\+17 parallel STL algorithm `for_each` is used to modify the elements of a `std::vector` `data` on an *executor* that will execute on a NUMA system with a number of CPU cores. However the memory is allocated by the `std::vector` default allocator immediately during the construction of `data` on memory local to the calling thread of execution. This means that the memory allocated for `data` may have poor locality to all of the NUMA regions on the system, other than the one in which the constructor executed. Therefore, accesses in the parallel `for_each` made by threads in other NUMA regions will incur high latency. In this example, this is avoided by migrating `data` to have better affinity with the NUMA regions on the system using an *executor* with the `bulk_execution_affinity.scatter` property applied, before it is accessed by the `for_each`. Note that a mechanism for migration is not yet specified in this paper, so this example currently uses an arbitrary vendor API, `vendor_api::migrate`. Our intention is that a future revision of this paper will specify a standard mechanism for migration
108+
Consider the following code example *(Listing 1)* where the C\+\+17 parallel STL algorithm `for_each` is used to modify the elements of a `std::vector` `data` on an *executor* that will execute on a NUMA system with a number of CPU cores. However the memory is allocated by the `std::vector` default allocator immediately during the construction of `data` on memory local to the calling thread of execution. This means that the memory allocated for `data` may have poor locality to all of the NUMA regions on the system, other than the one in which the constructor executed. Therefore, accesses in the parallel `for_each` made by threads in other NUMA regions will incur high latency. In this example, this is avoided by migrating `data` to have better affinity with the NUMA regions on the system using an *executor* with the `bulk_execution_affinity.spread` property applied, before it is accessed by the `for_each`. Note that a mechanism for migration is not yet specified in this paper, so this example currently uses an arbitrary vendor API, `vendor_api::migrate`. Our intention is that a future revision of this paper will specify a standard mechanism for migration
105109

106110
```cpp
107111
// NUMA executor representing N NUMA regions.
@@ -112,12 +116,12 @@ numa_executor exec;
112116
std::vector<float> data(N * SIZE);
113117

114118
// Require the NUMA executor to bind its migration of memory to the underlying
115-
// memory resources in a scatter pattern.
119+
// memory resources in a spread pattern.
116120
auto affinityExec = std::execution::require(exec,
117-
bulk_execution_affinity.scatter);
121+
bulk_execution_affinity.spread);
118122

119123
// Migrate the memory allocated for the vector across the NUMA regions in a
120-
// scatter pattern.
124+
// spread pattern.
121125
vendor_api::migrate(data, affinityExec);
122126

123127
// Placement of data is local to NUMA region 0, so data for execution on other
@@ -127,7 +131,7 @@ std::for_each(std::execution::par.on(affinityExec), std::begin(data),
127131
```
128132
*Listing 1: Migrating previously allocated memory.*
129133
130-
Now consider a similar code example *(Listing 2)* where again the C\+\+17 parallel STL algorithm `for_each` is used to modify the elements of a `std::vector` `data` on an *executor* that will execute on a NUMA system with a number of CPU cores. However, instead of migrating `data` to have affinity with the NUMA regions, `data` is allocated within a bulk execution by an *executor* with the `bulk_execution_affinity.scatter` property applied so that `data` is allocated with affinity. Then when the `for_each` is called with the same executor, `data` maintains its affinity with the NUMA regions.
134+
Now consider a similar code example *(Listing 2)* where again the C\+\+17 parallel STL algorithm `for_each` is used to modify the elements of a `std::vector` `data` on an *executor* that will execute on a NUMA system with a number of CPU cores. However, instead of migrating `data` to have affinity with the NUMA regions, `data` is allocated within a bulk execution by an *executor* with the `bulk_execution_affinity.spread` property applied so that `data` is allocated with affinity. Then when the `for_each` is called with the same executor, `data` maintains its affinity with the NUMA regions.
131135
132136
```cpp
133137
// NUMA executor representing N NUMA regions.
@@ -139,9 +143,9 @@ std::vector<std::unique_ptr<float[SIZE]>> data{};
139143
data.reserve(N);
140144
141145
// Require the NUMA executor to bind its allocation of memory to the underlying
142-
// memory resources in a scatter patter.
146+
// memory resources in a spread pattern.
143147
auto affinityExec = std::execution::require(exec,
144-
bulk_execution_affinity.scatter);
148+
bulk_execution_affinity.spread);
145149
146150
// Launch a bulk execution that will allocate each unique_ptr in the vector with
147151
// locality to the nearest NUMA region.
@@ -263,27 +267,27 @@ constexpr memory_locality_intersection_t memory_locality_intersection;
263267
264268
## Bulk execution affinity properties
265269
266-
We propose an executor property group called `bulk_execution_affinity` which contains the nested properties `none`, `balanced`, `scatter` and `compact`. Each of these properties, if applied to an *executor* provides a hint to the `executor` that requests a particular binding of *execution agents* to the *execution resources* associated with the *executor* in a particular pattern.
270+
We propose an executor property group called `bulk_execution_affinity` which contains the nested properties `none`, `balanced`, `spread` and `close`. Each of these properties, if applied to an *executor* provides a hint to the `executor` that requests a particular binding of *execution agents* to the *execution resources* associated with the *executor* in a particular pattern.
267271
268272
### Example
269273
270-
Below is an example *(Listing 4)* of executing a parallel task over 8 threads using `bulk_execute`, with the affinity binding `bulk_execution_affinity.scatter`. We request affinity binding using `prefer` and then check to see if the executor is able to support it using `query`.
274+
Below is an example *(Listing 4)* of executing a parallel task over 8 threads using `bulk_execute`, with the affinity binding `bulk_execution_affinity.spread`. We request affinity binding using `prefer` and then check to see if the executor is able to support it using `query`.
271275
272276
```cpp
273277
{
274278
bulk_executor exec;
275279
276280
auto affExec = execution::prefer(exec,
277-
execution::bulk_execution_affinity.scatter);
281+
execution::bulk_execution_affinity.spread);
278282
279-
if (execution::query(affExec, execution::bulk_execution_affinity.scatter)) {
280-
std::cout << "bulk_execute using bulk_execution_affinity.scatter"
283+
if (execution::query(affExec, execution::bulk_execution_affinity.spread)) {
284+
std::cout << "bulk_execute using bulk_execution_affinity.spread"
281285
<< std::endl;
282286
}
283287
284-
affExec.bulk_execute([](std::size_t i, shared s) {
288+
execution::bulk_execute(affExec, [](std::size_t i) {
285289
func(i);
286-
}, 8, sharedFactory);
290+
}, 8);
287291
}
288292
```
289293
*Listing 4: Example of using the bulk_execution_affinity property*
@@ -292,24 +296,33 @@ Below is an example *(Listing 4)* of executing a parallel task over 8 threads us
292296

293297
The `bulk_execution_affinity_t` properties are a group of mutually exclusive behavioral properties (as defined in P0443 [[22]][p0443]) which provide a hint to the *executor* to, if possible, bind the *execution agents* created by a bulk invocation from an *executor*, to the underlying *execution resources* in a particular pattern relative to their physical closeness.
294298

299+
The `bulk_execution_affinity_t` nested properties are defined using the following terms of art:
300+
* *Available concurrency*; which is defined the number of *execution resources* available to an *executor* which can be bound to *execution agents* concurrently, assuming no contention.
301+
* *Locality distance*; which is defined an implementation-defined metric for measuring the relative affinity between *execution resources* whereby *execution resources* with a lower *locality distance* are likely to have similar latency in memory access operations, for a given memory location.
302+
303+
The `bulk_execution_affinity_t` nested properties also refer to the subdivision of *execution resources*, which is an implementation-defined method of subdividing the *available concurrency*, generally based on groupings of *execution resources* with the lowest *locality distance* to each other.
304+
305+
> [*Note:* An alternative term of art for *locality distance* could be *locality interference*. *--end note*]
306+
295307
The `bulk_execution_affinity_t` property provides nested property types and objects as described below, where:
296308
* `e` denotes an executor object of type `E`,
297309
* `f` denotes a function object of type `F&&`,
298-
* `s` denotes a shape object of type `execution::executor_shape<E>`, and
299-
* `sf` denotes a function object of type `SF`.
310+
* `s` denotes a shape object of type `execution::executor_shape<E>`,
311+
* `sf` denotes a function object of type `SF`, and
312+
* a call to `execution::bulk_execute(e, f, s)` creates a consecutive sequence of work-items from `0` to `s-1`, mapped to the available concurrency of `e`, that is a number of execution resources, which are subdivided in some implementation-defined way.
300313

301314
| Nested Property Type | Nested Property Name | Requirements |
302315
|----------------------|----------------------|--------------|
303-
| `bulk_execution_affinity_t::none_t` | `bulk_execution_affinity_t::none` | A call to `e.bulk_execute(f, s, sf)` has no requirements on the binding of *execution agents* to the underlying *execution resources*. |
304-
| `bulk_execution_affinity_t::scatter_t` | `bulk_execution_scatter_t::scatter` | A call to `e.bulk_execute(f, s, sf)` should aim to bind the created *execution agents* to the underlying *execution resources* (ordered by physical closeness) such that they are distributed equally across the *execution resources* in a round-robin fashion. <br><br> If the execution context associated with `e` is not able to bind the *execution agents* to the underlying *execution resources* as requested it should fall back to `bulk_execution_affinity_t::none_t`. |
305-
| `bulk_execution_affinity_t::compact_t` | `bulk_execution_compact_t::compact` | A call to `e.bulk_execute(f, s, sf)` should aim to bind the created *execution agents* to the underlying *execution resources* such that they are distributed as close as possible to the *execution resource* of the *thread of execution* which created them. <br><br> If the execution context associated with `e` is not able to bind the *execution agents* to the underlying *execution resources* as requested it should fall back to `bulk_execution_affinity_t::none_t`. |
306-
| bulk_execution_affinity_t::balanced_t | bulk_execution_balanced_t::balanced | A call to `e.bulk_execute(f, s, sf)` should aim to bind the created *execution agents* to the underlying *execution resources* (ordered by physical closeness) such that they are distributed equally across the *execution resources* in a bin packing fashion. <br><br> If the execution context associated with `e` is not able to bind the *execution agents* to the underlying *execution resources* as requested it should fall back to `bulk_execution_affinity_t::none_t`. |
316+
| `bulk_execution_affinity_t::none_t` | `bulk_execution_affinity_t::none` | A call to `execution::bulk_execute(e, f, s)` is not required to bind the created *execution agents* for the work-items of the iteration space specified by `s` to *execution resources*. |
317+
| `bulk_execution_affinity_t::spread_t` | `bulk_execution_spread_t::spread` | A call to `e.bulk_execute(f, s, sf)` should aim to bind the created *execution agents* for the work-items of the iteration space specified by `s` to *execution resources* such that the average locality distance of adjacent work-items in the same subdivision of the available concurrency is maximized and the average locality distance of adjacent work-items in different subdivisions of the available concurrency is maximized. The binding of all *execution agents* to all *execution resources* must not result in the difference between the number of *execution agents* assigned to any *execution resources* being greater than `1`. <br><br> If `e` is not able to fulfil this aim the it should fall back to `bulk_execution_affinity_t::none_t`. |
318+
| `bulk_execution_affinity_t::close_t` | `bulk_execution_close_t::close` | A call to `e.bulk_execute(f, s, sf)` should aim to bind the created *execution agents* for the work-items of the iteration space specified by `s` to *execution resources* such that the average locality distance between adjacent work-items is minimized. The binding of all *execution agents* to all *execution resources* must not result in the difference between the number of *execution agents* assigned to any *execution resources* being greater than `1`. <br><br> If `e` is not able to fulfil this aim the it should fall back to `bulk_execution_affinity_t::none_t`. |
319+
| `bulk_execution_affinity_t::balanced_t` | `bulk_execution_balanced_t::balanced` | A call to `e.bulk_execute(f, s, sf)` should aim to bind the created *execution agents* for the work-items of the iteration space specified by `s` to *execution resources* such that the average locality distance of adjacent work-items in the same subdivision of the available concurrency is minimized and the average locality distance of adjacent work-items in different subdivisions of the available concurrency is maximized. The binding of all *execution agents* to all *execution resources* must not result in the difference between the number of *execution agents* assigned to any *execution resources* being greater than `1`. <br><br> If `e` is not able to fulfil this aim the it should fall back to `bulk_execution_affinity_t::none_t`. |
307320

308-
> [*Note:* An implementation is free to choose how it maps individual work items to the underlying *execution resources*, providing it aims to achieve the requested affinity relationship. *--end note*]
321+
> [*Note:* Note: The subdivision of the available concurrency is implementation-defined. *--end note*]
309322

310-
> [*Note:* It's expected that the default value of `bulk_execution_affinity_t` for most executors be `bulk_execution_affinity_t::none_t`. *--end note*]
323+
> [*Note:* Note: If the number of work-items specified by `s` is larger than the available concurrency, the manner in which that iteration space is subdivided into a consecutive sequence of work-items is implementation-defined. *--end note*]
311324

312-
> [*Note:* The terms used for the `bulk_execution_affinity_t` nested properties are derived from the OpenMP properties [[33]][openmp-affinity] including the Intel specific balanced affinity binding [[[34]][intel-balanced-affinity] *--end note*]
325+
> [*Note:* It's expected that the default value of `bulk_execution_affinity_t` for most executors be `bulk_execution_affinity_t::none_t`. *--end note*]
313326
314327
> [*Note:* If two *executors* `e1` and `e2` invoke a bulk execution function in order, where `execution::query(e1, execution::context) == query(e2, execution::context)` is `true` and `execution::query(e1, execution::bulk_execution_affinity) == query(e2, execution::bulk_execution_affinity)` is `false`, this will likely result in `e1` binding *execution agents* if necessary to achieve the requested affinity pattern and then `e2` rebinding to achieve the new affinity pattern. Rebinding *execution agents* to *execution resources* may take substantial time and may affect performance of subsequent code. *--end note*]
315328
@@ -454,6 +467,10 @@ The value returned from `execution::query(e1, memory_locality_intersection_t(e2)
454467
455468
There are a number of additional features which we are considering for inclusion in this paper but are not ready yet.
456469
470+
## Iteration space subdivision property
471+
472+
It is defined in this proposal for the `bulk_execution_affinity_t` properties that when the size in an invocation of `execution::bulk_execute` is greater than the *available concurrency* then it is implementation how that iteration space is subdivided into a consecutive sequence of work-items. The authors of this proposal intend to propose a follow up property for specifying how an iteration space should be subdivided into chunks in this case.
473+
457474
## Migrating data
458475
459476
This paper currently provides a mechanism for detecting whether two *executors* share a common memory locality. However, it does not provide a way to invoke migration of data allocated local to one *executor* into the locality of another *executor*.

0 commit comments

Comments
 (0)