You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: affinity/cpp-20/dXXX1r0.md
+17-17Lines changed: 17 additions & 17 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -54,7 +54,7 @@
54
54
55
55
# Abstract
56
56
57
-
This paper is the result of a request from SG1 at the 2018 San Diego meeting to split P0796: Supporting Heterogeneous & Distributed Computing Through Affinity [[35]][p0796] into two separate papers, one for the high-level interface and one for the low-level interface. This paper focusses on the high-level interface; a series of properties for querying affinity relationships and requesting affinity on work being executed. [[36]][pXXXX] focusses on the low-level interface; a mechanism for discovering the topology and affinity properties of a given system.
57
+
This paper is the result of a request from SG1 at the 2018 San Diego meeting to split P0796: Supporting Heterogeneous & Distributed Computing Through Affinity [[35]][p0796] into two separate papers, one for the high-level interface and one for the low-level interface. This paper focusses on the high-level interface: a series of properties for querying affinity relationships and requesting affinity on work being executed. [[36]][pXXXX] focusses on the low-level interface: a mechanism for discovering the topology and affinity properties of a given system.
58
58
59
59
This paper provides an initial meta-framework for the drives toward an execution and memory affinity model for C\+\+. It accounts for feedback from the Toronto 2017 SG1 meeting on Data Movement in C\+\+[[1]][p0687r0] that we should define affinity for C\+\+ first, before considering inaccessible memory as a solution to the separate memory problem towards supporting heterogeneous and distributed computing.
60
60
@@ -75,7 +75,7 @@ The affinity problem is especially challenging for applications whose behavior c
75
75
76
76
Frequently, data are initialized at the beginning of the program by the initial thread and are used by multiple threads. While some OSes automatically migrate threads or data for better affinity, migration may have high overhead. In an optimal case, the OS may automatically detect which thread access which data most frequently, or it may replicate data which are read by multiple threads, or migrate data which are modified and used by threads residing on remote locality groups. However, the OS often does a reasonable job, if the machine is not overloaded, if the application carefully uses first-touch allocation, and if the program does not change its behavior with respect to locality.
77
77
78
-
Consider a code example *(Listing 1)* that uses the C\+\+17 parallel STL algorithm `for_each` to modify the entries of a`std::vector``data`. The example applies a loop body in a lambda to each entry of the `std::vector``data`, using an execution policy that distributes work in parallel across multiple CPU cores. We might expect this to be fast, but since `std::vector` containers are initialized automatically and automatically allocated on the master thread's memory, we find that it is actually quite slow even when we have more than one thread.
78
+
Consider a code example *(Listing 1)* that uses the C\+\+17 parallel STL algorithm `for_each` to modify the entries of an`std::vector``data`. The example applies a loop body in a lambda to each entry of the `std::vector``data`, using an execution policy that distributes work in parallel across multiple CPU cores. We might expect this to be fast, but since `std::vector` containers are initialized automatically and automatically allocated on the master thread's memory, we find that it is actually quite slow even when we have more than one thread.
79
79
80
80
```cpp
81
81
// NUMA executor representing N NUMA regions.
@@ -85,8 +85,8 @@ numa_executor exec;
85
85
// of execution, (N == 0).
86
86
std::vector<float> data(N * SIZE);
87
87
88
-
// Require the NUMA executor to bind it's migration of memory to the underlying
89
-
// memory resources in a scatter patter.
88
+
// Require the NUMA executor to bind its migration of memory to the underlying
89
+
// memory resources in a scatter pattern.s
90
90
auto affinityExec = std::execution::require(exec,
91
91
bulk_execution_affinity.scatter);
92
92
@@ -107,8 +107,8 @@ numa_executor exec;
107
107
108
108
// Reserve space in a vector for a unique_ptr for each index in the bulk
109
109
// execution.
110
-
std::vector<std::unique_ptr<float>> data{};
111
-
data.reserve(N * SIZE);
110
+
std::vector<std::unique_ptr<float[SIZE]>> data{};
111
+
data.reserve(N);
112
112
113
113
// Require the NUMA executor to bind it's allocation of memory to the underlying
114
114
// memory resources in a scatter patter.
@@ -118,10 +118,10 @@ auto affinityExec = std::execution::require(exec,
118
118
// Launch a bulk execution that will allocate each unique_ptr in the vector with
119
119
// locality to the nearest NUMA region.
120
120
affinityExec.bulk_execute([&](size_t id) {
121
-
data[id] = std::make_unique<float>(0.0f); }, N * SIZE, sharedFactory);
121
+
data[id] = std::make_unique<float>(); }, N, sharedFactory);
122
122
123
123
// Execute a for_each using the same executor so that each unique_ptr in the
@@ -250,7 +250,7 @@ We propose an executor property group called `bulk_execution_affinity` which con
250
250
251
251
### Example
252
252
253
-
Below *(Listing 2)* is an example of executing a parallel task over 8 threads using `bulk_execute`, with the affinity binding `bulk_execution_affinity.scatter`.
253
+
Below is an example of executing a parallel task over 8 threads using `bulk_execute`, with the affinity binding `bulk_execution_affinity.scatter`.
254
254
255
255
```cpp
256
256
{
@@ -264,7 +264,7 @@ Below *(Listing 2)* is an example of executing a parallel task over 8 threads us
264
264
}, 8, sharedFactory);
265
265
}
266
266
```
267
-
*Listing 2: Example of using the bulk_execution_affinity property*
267
+
*Listing 3: Example of using the bulk_execution_affinity property*
268
268
269
269
### Proposed Wording
270
270
@@ -273,19 +273,19 @@ The `bulk_execution_affinity_t` property is a behavioral property as defined in
273
273
The `bulk_execution_affinity_t` property provides nested property types and objects as described below, where:
274
274
*`e` denotes an executor object of type `E`,
275
275
*`f` denotes a function object of type `F&&`,
276
-
*`s` denotes a shape object of type `execution::executor_shape<E>`,
276
+
*`s` denotes a shape object of type `execution::executor_shape<E>`, and
277
277
*`sf` denotes a function object of type `SF`.
278
278
279
279
| Nested Property Type | Nested Property Name | Requirements |
| bulk_execution_affinity_t::none_t | bulk_execution_affinity_t::none | A call to `e.bulk_execute(f, s, sf)`may or may not bind the created *execution agents* to the underlying *execution resources*. The affinity binding pattern may or may not be consistent across invocations of the executor's bulk execution function. |
282
-
| bulk_execution_affinity_t::scatter_t | bulk_execution_scatter_t::scatter | A call to `e.bulk_execute(f, s, sf)` must bind the created *execution agents* to the underlying *execution resources* such that they are distributed across the *execution resources* where each *execution agent* far from it's preceding and following *execution agents*. The affinity binding pattern must be consistent across invocations of the executor's bulk execution function. |
283
-
| bulk_execution_affinity_t::compact_t | bulk_execution_compact_t::compact | A call to `e.bulk_execute(f, s, sf)` must bind the created *execution agents* to the underlying *execution resources* such that they are in sequence across the *execution resources* where each *execution agent* close to it's preceding and following *execution agents*. The affinity binding pattern must be consistent across invocations of the executor's bulk execution function. |
284
-
| bulk_execution_affinity_t::balanced_t | bulk_execution_balanced_t::balanced | A call to `e.bulk_execute(f, s, sf)` must bind the created *execution agents* to the underlying *execution resources* such that they are in sequence and evenly spread across the *execution resources*where each*execution agent* is close to it's preceding and following *execution agents* and all *execution resources* are utilized. The affinity binding pattern must be consistent across invocations of the executor's bulk execution function. |
281
+
|`bulk_execution_affinity_t::none_t`|`bulk_execution_affinity_t::none`| A call to `e.bulk_execute(f, s, sf)`has no requirements on the binding of *execution agents* to the underlying *execution resources*. |
282
+
|`bulk_execution_affinity_t::scatter_t`|`bulk_execution_scatter_t::scatter`| A call to `e.bulk_execute(f, s, sf)` must bind the created *execution agents* to the underlying *execution resources* such that they are distributed sparsely across the *execution resources*. The affinity binding pattern must be consistent across invocations of the executor's bulk execution function. |
283
+
|`bulk_execution_affinity_t::compact_t`|`bulk_execution_compact_t::compact`| A call to `e.bulk_execute(f, s, sf)` must bind the created *execution agents* to the underlying *execution resources* such that they are distributed as close as possible to the *execution resource* of the *thread of execution* which created them. The affinity binding pattern must be consistent across invocations of the executor's bulk execution function. |
284
+
| bulk_execution_affinity_t::balanced_t | bulk_execution_balanced_t::balanced | A call to `e.bulk_execute(f, s, sf)` must bind the created *execution agents* to the underlying *execution resources* such that they are collected into groups, each group is distributed sparsely across the *execution resources*and the*execution agents* within each group are distributed as close as possible to the first *execution resource* of that group. The affinity binding pattern must be consistent across invocations of the executor's bulk execution function. |
285
285
286
286
> [*Note:* The requirements of the `bulk_execution_affinity_t` nested properties donot enforce a specific binding, simply that the binding follows the requirements set out above and that the pattern is consistent across invocations of the bulk execution functions. *--end note*]
287
287
288
-
> [*Note:* If two *executors* `e1` and `e2` invoke a bulk execution function in order, where `execution::query(e1, execution::context) == query(e2, execution::context)` is `true` and `execution::query(e1, execution::bulk_execution_affinity) == query(e2, execution::bulk_execution_affinity)` is `false`, this will likely result in `e1` binding *execution agents* if necessary to achieve the requested affinity pattern and then `e2` rebinding to achieve the new affinity pattern. *--end note*]
288
+
> [*Note:* If two *executors* `e1` and `e2` invoke a bulk execution function in order, where `execution::query(e1, execution::context) == query(e2, execution::context)` is `true` and `execution::query(e1, execution::bulk_execution_affinity) == query(e2, execution::bulk_execution_affinity)` is `false`, this will likely result in `e1` binding *execution agents* if necessary to achieve the requested affinity pattern and then `e2` rebinding to achieve the new affinity pattern. Rebinding *execution agents* to *execution resources* may take substantial time and may affect performance of subsequent code. *--end note*]
289
289
290
290
> [*Note:* The terms used for the `bulk_execution_affinity_t` nested properties are derived from the OpenMP properties [[33]][openmp-affinity] including the Intel specific balanced affinity binding [[[34]][intel-balanced-affinity] *--end note*]
291
291
@@ -400,7 +400,7 @@ The value returned from `execution::query(e1, memory_locality_intersection_t(e2)
400
400
401
401
## Who should have control over bulk execution affinity?
402
402
403
-
This paper currently proposes the `bulk_execution_affinity_t` properties and it's nested properties for allowing an *executor* to make guarantees as to how *execution agents* are bound to the underlying *execution resources*. However providing control at this level may lead to *execution agents* being bound to *execution resources* within a critical path. A possible solution to this is to allow the *execution context* to be configured with `bulk_execution_affinity_t` nested properties, either instead of the *executor* property or in addition. This would allow the binding of *threads of execution* to be performed at the time of the *execution context* creation.
403
+
This paper currently proposes the `bulk_execution_affinity_t` properties and its nested properties for allowing an *executor* to make guarantees as to how *execution agents* are bound to the underlying *execution resources*. However providing control at this level may lead to *execution agents* being bound to *execution resources* within a critical path. A possible solution to this is to allow the *execution context* to be configured with `bulk_execution_affinity_t` nested properties, either instead of the *executor* property or in addition. This would allow the binding of *threads of execution* to be performed at the time of the *execution context* creation.
0 commit comments