You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Update the wording of the bulk_execution_affinity properties in P1436 (#112)
* Rename `bulk_execution_affinity_t::scatter_t` to `bulk_execution_affinity_t::spread_t`.
* Rename `bulk_execution_affinity_t::compact_t` to `bulk_execution_affinity_t::close_t`.
* Refine the wording of the `bulk_execution_affinity_t` properties to clarify the requirements on binding and chunking, based on feedback from SG1.
* Add a note about a follow on paper for proposing a chunking property.
* Add some additional wording to describe the terms of art used in the
wording.
Copy file name to clipboardExpand all lines: affinity/cpp-23/d1436r3.md
+40-23Lines changed: 40 additions & 23 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -18,6 +18,10 @@ This paper is the result of discussions from man contributors within SG1, SG14 a
18
18
19
19
### P1436r3 (PRA 2020)
20
20
21
+
* Rename `bulk_execution_affinity_t::scatter_t` to `bulk_execution_affinity_t::spread_t`.
22
+
* Rename `bulk_execution_affinity_t::compact_t` to `bulk_execution_affinity_t::close_t`.
23
+
* Refine the wording of the `bulk_execution_affinity_t` properties to clarify the requirements on binding and chunking, based on feedback from SG1.
24
+
21
25
### P1436r2 (BEL 2019)
22
26
23
27
* Alter the wording on the `bulk_execution_affinity_t` properties so they are now hints that request the executor provide a particular pattern of binding, rather than a guarantee.
@@ -101,7 +105,7 @@ The affinity interface we propose should help computers achieve a much higher fr
101
105
102
106
To identify the requirements for supporting affinity we have looked at a number of use cases where affinity between memory locality and execution can provide better performance.
103
107
104
-
Consider the following code example *(Listing 1)* where the C\+\+17 parallel STL algorithm `for_each` is used to modify the elements of a `std::vector` `data` on an *executor* that will execute on a NUMA system with a number of CPU cores. However the memory is allocated by the `std::vector` default allocator immediately during the construction of `data` on memory local to the calling thread of execution. This means that the memory allocated for `data` may have poor locality to all of the NUMA regions on the system, other than the one in which the constructor executed. Therefore, accesses in the parallel `for_each` made by threads in other NUMA regions will incur high latency. In this example, this is avoided by migrating `data` to have better affinity with the NUMA regions on the system using an *executor* with the `bulk_execution_affinity.scatter` property applied, before it is accessed by the `for_each`. Note that a mechanism for migration is not yet specified in this paper, so this example currently uses an arbitrary vendor API, `vendor_api::migrate`. Our intention is that a future revision of this paper will specify a standard mechanism for migration
108
+
Consider the following code example *(Listing 1)* where the C\+\+17 parallel STL algorithm `for_each` is used to modify the elements of a `std::vector` `data` on an *executor* that will execute on a NUMA system with a number of CPU cores. However the memory is allocated by the `std::vector` default allocator immediately during the construction of `data` on memory local to the calling thread of execution. This means that the memory allocated for `data` may have poor locality to all of the NUMA regions on the system, other than the one in which the constructor executed. Therefore, accesses in the parallel `for_each` made by threads in other NUMA regions will incur high latency. In this example, this is avoided by migrating `data` to have better affinity with the NUMA regions on the system using an *executor* with the `bulk_execution_affinity.spread` property applied, before it is accessed by the `for_each`. Note that a mechanism for migration is not yet specified in this paper, so this example currently uses an arbitrary vendor API, `vendor_api::migrate`. Our intention is that a future revision of this paper will specify a standard mechanism for migration
105
109
106
110
```cpp
107
111
// NUMA executor representing N NUMA regions.
@@ -112,12 +116,12 @@ numa_executor exec;
112
116
std::vector<float> data(N * SIZE);
113
117
114
118
// Require the NUMA executor to bind its migration of memory to the underlying
115
-
// memory resources in a scatter pattern.
119
+
// memory resources in a spread pattern.
116
120
auto affinityExec = std::execution::require(exec,
117
-
bulk_execution_affinity.scatter);
121
+
bulk_execution_affinity.spread);
118
122
119
123
// Migrate the memory allocated for the vector across the NUMA regions in a
120
-
// scatter pattern.
124
+
// spread pattern.
121
125
vendor_api::migrate(data, affinityExec);
122
126
123
127
// Placement of data is local to NUMA region 0, so data for execution on other
Now consider a similar code example *(Listing 2)* where again the C\+\+17 parallel STL algorithm `for_each` is used to modify the elements of a `std::vector` `data` on an *executor* that will execute on a NUMA system with a number of CPU cores. However, instead of migrating `data` to have affinity with the NUMA regions, `data` is allocated within a bulk execution by an *executor* with the `bulk_execution_affinity.scatter` property applied so that `data` is allocated with affinity. Then when the `for_each` is called with the same executor, `data` maintains its affinity with the NUMA regions.
134
+
Now consider a similar code example *(Listing 2)* where again the C\+\+17 parallel STL algorithm `for_each` is used to modify the elements of a `std::vector` `data` on an *executor* that will execute on a NUMA system with a number of CPU cores. However, instead of migrating `data` to have affinity with the NUMA regions, `data` is allocated within a bulk execution by an *executor* with the `bulk_execution_affinity.spread` property applied so that `data` is allocated with affinity. Then when the `for_each` is called with the same executor, `data` maintains its affinity with the NUMA regions.
We propose an executor property group called `bulk_execution_affinity` which contains the nested properties `none`, `balanced`, `scatter` and `compact`. Each of these properties, if applied to an *executor* provides a hint to the `executor` that requests a particular binding of *execution agents* to the *execution resources* associated with the *executor* in a particular pattern.
270
+
We propose an executor property group called `bulk_execution_affinity` which contains the nested properties `none`, `balanced`, `spread` and `close`. Each of these properties, if applied to an *executor* provides a hint to the `executor` that requests a particular binding of *execution agents* to the *execution resources* associated with the *executor* in a particular pattern.
267
271
268
272
### Example
269
273
270
-
Below is an example *(Listing 4)* of executing a parallel task over 8 threads using `bulk_execute`, with the affinity binding `bulk_execution_affinity.scatter`. We request affinity binding using `prefer` and then check to see if the executor is able to support it using `query`.
274
+
Below is an example *(Listing 4)* of executing a parallel task over 8 threads using `bulk_execute`, with the affinity binding `bulk_execution_affinity.spread`. We request affinity binding using `prefer` and then check to see if the executor is able to support it using `query`.
271
275
272
276
```cpp
273
277
{
274
278
bulk_executor exec;
275
279
276
280
auto affExec = execution::prefer(exec,
277
-
execution::bulk_execution_affinity.scatter);
281
+
execution::bulk_execution_affinity.spread);
278
282
279
-
if (execution::query(affExec, execution::bulk_execution_affinity.scatter)) {
280
-
std::cout << "bulk_execute using bulk_execution_affinity.scatter"
283
+
if (execution::query(affExec, execution::bulk_execution_affinity.spread)) {
284
+
std::cout << "bulk_execute using bulk_execution_affinity.spread"
281
285
<< std::endl;
282
286
}
283
287
284
-
affExec.bulk_execute([](std::size_t i, shared s) {
*Listing 4: Example of using the bulk_execution_affinity property*
@@ -292,24 +296,33 @@ Below is an example *(Listing 4)* of executing a parallel task over 8 threads us
292
296
293
297
The `bulk_execution_affinity_t` properties are a group of mutually exclusive behavioral properties (as defined in P0443 [[22]][p0443]) which provide a hint to the *executor* to, if possible, bind the *execution agents* created by a bulk invocation from an *executor*, to the underlying *execution resources* in a particular pattern relative to their physical closeness.
294
298
299
+
The `bulk_execution_affinity_t` nested properties are defined using the following terms of art:
300
+
**Available concurrency*; which is defined the number of *execution resources* available to an *executor* which can be bound to *execution agents* concurrently, assuming no contention.
301
+
**Locality distance*; which is defined an implementation-defined metric for measuring the relative affinity between *execution resources* whereby *execution resources* with a lower *locality distance* are likely to have similar latency in memory access operations, for a given memory location.
302
+
303
+
The `bulk_execution_affinity_t` nested properties also refer to the subdivision of *execution resources*, which is an implementation-defined method of subdividing the *available concurrency*, generally based on groupings of *execution resources* with the lowest *locality distance* to each other.
304
+
305
+
> [*Note:* An alternative term of art for *locality distance* could be *locality interference*. *--end note*]
306
+
295
307
The `bulk_execution_affinity_t` property provides nested property types and objects as described below, where:
296
308
* `e` denotes an executor object of type `E`,
297
309
* `f` denotes a function object of type `F&&`,
298
-
*`s` denotes a shape object of type `execution::executor_shape<E>`, and
299
-
*`sf` denotes a function object of type `SF`.
310
+
* `s` denotes a shape object of type `execution::executor_shape<E>`,
311
+
* `sf` denotes a function object of type `SF`, and
312
+
* a call to `execution::bulk_execute(e, f, s)` creates a consecutive sequence of work-items from `0` to `s-1`, mapped to the available concurrency of `e`, that is a number of execution resources, which are subdivided in some implementation-defined way.
300
313
301
314
| Nested Property Type | Nested Property Name | Requirements |
|`bulk_execution_affinity_t::none_t`|`bulk_execution_affinity_t::none`| A call to `e.bulk_execute(f, s, sf)`has no requirements on the binding of *execution agents*to the underlying*execution resources*. |
304
-
|`bulk_execution_affinity_t::scatter_t`|`bulk_execution_scatter_t::scatter`| A call to `e.bulk_execute(f, s, sf)` should aim to bind the created *execution agents*to the underlying *execution resources*(ordered by physical closeness) such that they are distributed equally across the *execution resources* in a round-robin fashion. <br><br> If the execution context associated with `e` is not able to bind the *execution agents* to the underlying *execution resources*as requested it should fall back to `bulk_execution_affinity_t::none_t`. |
305
-
|`bulk_execution_affinity_t::compact_t`|`bulk_execution_compact_t::compact`| A call to `e.bulk_execute(f, s, sf)` should aim to bind the created *execution agents*to the underlying *execution resources* such that they are distributed as close as possible to the*execution resource* of the *thread of execution* which created them. <br><br> If the execution context associated with `e` is not able to bind the *execution agents* to the underlying *execution resources* as requested it should fall back to `bulk_execution_affinity_t::none_t`. |
306
-
| bulk_execution_affinity_t::balanced_t | bulk_execution_balanced_t::balanced | A call to `e.bulk_execute(f, s, sf)` should aim to bind the created *execution agents*to the underlying *execution resources*(ordered by physical closeness) such that they are distributed equally across the *execution resources*in a bin packing fashion. <br><br> If the execution context associated with `e` is not able to bind the *execution agents* to the underlying *execution resources*as requested it should fall back to `bulk_execution_affinity_t::none_t`. |
316
+
| `bulk_execution_affinity_t::none_t` | `bulk_execution_affinity_t::none` | A call to `execution::bulk_execute(e, f, s)` is not required to bind the created *execution agents* for the work-items of the iteration space specified by `s` to *execution resources*. |
317
+
| `bulk_execution_affinity_t::spread_t` | `bulk_execution_spread_t::spread` | A call to `e.bulk_execute(f, s, sf)` should aim to bind the created *execution agents* for the work-items of the iteration space specified by `s` to *execution resources* such that the average locality distance of adjacent work-items in the same subdivision of the available concurrency is maximized and the average locality distance of adjacent work-items in different subdivisions of the available concurrency is maximized. The binding of all *execution agents* to all *execution resources* must notresult in the difference between the number of *execution agents* assigned to any *execution resources* being greater than `1`. <br><br> If `e` is not able to fulfil this aim the it should fall back to `bulk_execution_affinity_t::none_t`. |
318
+
| `bulk_execution_affinity_t::close_t` | `bulk_execution_close_t::close` | A call to `e.bulk_execute(f, s, sf)` should aim to bind the created *execution agents* for the work-items of the iteration space specified by `s` to *execution resources* such that the average locality distance between adjacent work-items is minimized. The binding of all *execution agents* to all *execution resources* must not result in the difference between the number of *execution agents* assigned to any *execution resources* being greater than `1`. <br><br> If `e` is not able to fulfil this aim the it should fall back to `bulk_execution_affinity_t::none_t`. |
319
+
| `bulk_execution_affinity_t::balanced_t` | `bulk_execution_balanced_t::balanced` | A call to `e.bulk_execute(f, s, sf)` should aim to bind the created *execution agents* for the work-items of the iteration space specified by `s` to *execution resources* such that the average locality distance of adjacent work-items in the same subdivision of the available concurrency is minimized and the average locality distance of adjacent work-items in different subdivisions of the available concurrency is maximized. The binding of all *execution agents* to all *execution resources* must notresult in the difference between the number of *execution agents* assigned to any *execution resources* being greater than `1`. <br><br> If `e` is not able to fulfil this aim the it should fall back to `bulk_execution_affinity_t::none_t`. |
307
320
308
-
> [*Note:* An implementation is free to choose how it maps individual work items to the underlying *execution resources*, providing it aims to achieve the requested affinity relationship. *--end note*]
321
+
> [*Note:* Note: The subdivision of the available concurrency is implementation-defined. *--end note*]
309
322
310
-
> [*Note:* It's expected that the default value of `bulk_execution_affinity_t` for most executors be `bulk_execution_affinity_t::none_t`. *--end note*]
323
+
> [*Note:* Note: If the number of work-items specified by `s` is larger than the available concurrency, the manner in which that iteration space is subdivided into a consecutive sequence of work-items is implementation-defined. *--end note*]
311
324
312
-
> [*Note:* The terms used for the `bulk_execution_affinity_t` nested properties are derived from the OpenMP properties [[33]][openmp-affinity] including the Intel specific balanced affinity binding [[[34]][intel-balanced-affinity] *--end note*]
325
+
> [*Note:* It's expected that the default value of `bulk_execution_affinity_t` for most executors be `bulk_execution_affinity_t::none_t`. *--end note*]
313
326
314
327
> [*Note:* If two *executors* `e1` and `e2` invoke a bulk execution function in order, where `execution::query(e1, execution::context) == query(e2, execution::context)` is `true` and `execution::query(e1, execution::bulk_execution_affinity) == query(e2, execution::bulk_execution_affinity)` is `false`, this will likely result in `e1` binding *execution agents* if necessary to achieve the requested affinity pattern and then `e2` rebinding to achieve the new affinity pattern. Rebinding *execution agents* to *execution resources* may take substantial time and may affect performance of subsequent code. *--end note*]
315
328
@@ -454,6 +467,10 @@ The value returned from `execution::query(e1, memory_locality_intersection_t(e2)
454
467
455
468
There are a number of additional features which we are considering for inclusion in this paper but are not ready yet.
456
469
470
+
## Iteration space subdivision property
471
+
472
+
It is defined in this proposal for the `bulk_execution_affinity_t` properties that when the size in an invocation of `execution::bulk_execute` is greater than the *available concurrency* then it is implementation how that iteration space is subdivided into a consecutive sequence of work-items. The authors of this proposal intend to propose a follow up property for specifying how an iteration space should be subdivided into chunks in this case.
473
+
457
474
## Migrating data
458
475
459
476
This paper currently provides a mechanism for detecting whether two *executors* share a common memory locality. However, it does not provide a way to invoke migration of data allocated local to one *executor* into the locality of another *executor*.
0 commit comments