Skip to content

Commit b2e3f12

Browse files
GordonGordon
authored andcommitted
CP013: Make changes to the affinity paper based on feedback.
1 parent 291928f commit b2e3f12

File tree

1 file changed

+17
-17
lines changed

1 file changed

+17
-17
lines changed

affinity/cpp-20/dXXX1r0.md

Lines changed: 17 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -54,7 +54,7 @@
5454

5555
# Abstract
5656

57-
This paper is the result of a request from SG1 at the 2018 San Diego meeting to split P0796: Supporting Heterogeneous & Distributed Computing Through Affinity [[35]][p0796] into two separate papers, one for the high-level interface and one for the low-level interface. This paper focusses on the high-level interface; a series of properties for querying affinity relationships and requesting affinity on work being executed. [[36]][pXXXX] focusses on the low-level interface; a mechanism for discovering the topology and affinity properties of a given system.
57+
This paper is the result of a request from SG1 at the 2018 San Diego meeting to split P0796: Supporting Heterogeneous & Distributed Computing Through Affinity [[35]][p0796] into two separate papers, one for the high-level interface and one for the low-level interface. This paper focusses on the high-level interface: a series of properties for querying affinity relationships and requesting affinity on work being executed. [[36]][pXXXX] focusses on the low-level interface: a mechanism for discovering the topology and affinity properties of a given system.
5858

5959
This paper provides an initial meta-framework for the drives toward an execution and memory affinity model for C\+\+. It accounts for feedback from the Toronto 2017 SG1 meeting on Data Movement in C\+\+ [[1]][p0687r0] that we should define affinity for C\+\+ first, before considering inaccessible memory as a solution to the separate memory problem towards supporting heterogeneous and distributed computing.
6060

@@ -75,7 +75,7 @@ The affinity problem is especially challenging for applications whose behavior c
7575

7676
Frequently, data are initialized at the beginning of the program by the initial thread and are used by multiple threads. While some OSes automatically migrate threads or data for better affinity, migration may have high overhead. In an optimal case, the OS may automatically detect which thread access which data most frequently, or it may replicate data which are read by multiple threads, or migrate data which are modified and used by threads residing on remote locality groups. However, the OS often does a reasonable job, if the machine is not overloaded, if the application carefully uses first-touch allocation, and if the program does not change its behavior with respect to locality.
7777

78-
Consider a code example *(Listing 1)* that uses the C\+\+17 parallel STL algorithm `for_each` to modify the entries of a `std::vector` `data`. The example applies a loop body in a lambda to each entry of the `std::vector` `data`, using an execution policy that distributes work in parallel across multiple CPU cores. We might expect this to be fast, but since `std::vector` containers are initialized automatically and automatically allocated on the master thread's memory, we find that it is actually quite slow even when we have more than one thread.
78+
Consider a code example *(Listing 1)* that uses the C\+\+17 parallel STL algorithm `for_each` to modify the entries of an `std::vector` `data`. The example applies a loop body in a lambda to each entry of the `std::vector` `data`, using an execution policy that distributes work in parallel across multiple CPU cores. We might expect this to be fast, but since `std::vector` containers are initialized automatically and automatically allocated on the master thread's memory, we find that it is actually quite slow even when we have more than one thread.
7979

8080
```cpp
8181
// NUMA executor representing N NUMA regions.
@@ -85,8 +85,8 @@ numa_executor exec;
8585
// of execution, (N == 0).
8686
std::vector<float> data(N * SIZE);
8787

88-
// Require the NUMA executor to bind it's migration of memory to the underlying
89-
// memory resources in a scatter patter.
88+
// Require the NUMA executor to bind its migration of memory to the underlying
89+
// memory resources in a scatter pattern.s
9090
auto affinityExec = std::execution::require(exec,
9191
bulk_execution_affinity.scatter);
9292

@@ -107,8 +107,8 @@ numa_executor exec;
107107
108108
// Reserve space in a vector for a unique_ptr for each index in the bulk
109109
// execution.
110-
std::vector<std::unique_ptr<float>> data{};
111-
data.reserve(N * SIZE);
110+
std::vector<std::unique_ptr<float[SIZE]>> data{};
111+
data.reserve(N);
112112
113113
// Require the NUMA executor to bind it's allocation of memory to the underlying
114114
// memory resources in a scatter patter.
@@ -118,10 +118,10 @@ auto affinityExec = std::execution::require(exec,
118118
// Launch a bulk execution that will allocate each unique_ptr in the vector with
119119
// locality to the nearest NUMA region.
120120
affinityExec.bulk_execute([&](size_t id) {
121-
data[id] = std::make_unique<float>(0.0f); }, N * SIZE, sharedFactory);
121+
data[id] = std::make_unique<float>(); }, N, sharedFactory);
122122
123123
// Execute a for_each using the same executor so that each unique_ptr in the
124-
// vector mainains it's locality.
124+
// vector maintains it's locality.
125125
std::for_each(std::execution::par.on(affinityExec), std::begin(data),
126126
std::end(data), [=](float &value) { do_something(value): });
127127
```
@@ -250,7 +250,7 @@ We propose an executor property group called `bulk_execution_affinity` which con
250250
251251
### Example
252252
253-
Below *(Listing 2)* is an example of executing a parallel task over 8 threads using `bulk_execute`, with the affinity binding `bulk_execution_affinity.scatter`.
253+
Below is an example of executing a parallel task over 8 threads using `bulk_execute`, with the affinity binding `bulk_execution_affinity.scatter`.
254254
255255
```cpp
256256
{
@@ -264,7 +264,7 @@ Below *(Listing 2)* is an example of executing a parallel task over 8 threads us
264264
}, 8, sharedFactory);
265265
}
266266
```
267-
*Listing 2: Example of using the bulk_execution_affinity property*
267+
*Listing 3: Example of using the bulk_execution_affinity property*
268268

269269
### Proposed Wording
270270

@@ -273,19 +273,19 @@ The `bulk_execution_affinity_t` property is a behavioral property as defined in
273273
The `bulk_execution_affinity_t` property provides nested property types and objects as described below, where:
274274
* `e` denotes an executor object of type `E`,
275275
* `f` denotes a function object of type `F&&`,
276-
* `s` denotes a shape object of type `execution::executor_shape<E>`,
276+
* `s` denotes a shape object of type `execution::executor_shape<E>`, and
277277
* `sf` denotes a function object of type `SF`.
278278

279279
| Nested Property Type | Nested Property Name | Requirements |
280280
|----------------------|----------------------|--------------|
281-
| bulk_execution_affinity_t::none_t | bulk_execution_affinity_t::none | A call to `e.bulk_execute(f, s, sf)` may or may not bind the created *execution agents* to the underlying *execution resources*. The affinity binding pattern may or may not be consistent across invocations of the executor's bulk execution function. |
282-
| bulk_execution_affinity_t::scatter_t | bulk_execution_scatter_t::scatter | A call to `e.bulk_execute(f, s, sf)` must bind the created *execution agents* to the underlying *execution resources* such that they are distributed across the *execution resources* where each *execution agent* far from it's preceding and following *execution agents*. The affinity binding pattern must be consistent across invocations of the executor's bulk execution function. |
283-
| bulk_execution_affinity_t::compact_t | bulk_execution_compact_t::compact | A call to `e.bulk_execute(f, s, sf)` must bind the created *execution agents* to the underlying *execution resources* such that they are in sequence across the *execution resources* where each *execution agent* close to it's preceding and following *execution agents*. The affinity binding pattern must be consistent across invocations of the executor's bulk execution function. |
284-
| bulk_execution_affinity_t::balanced_t | bulk_execution_balanced_t::balanced | A call to `e.bulk_execute(f, s, sf)` must bind the created *execution agents* to the underlying *execution resources* such that they are in sequence and evenly spread across the *execution resources* where each *execution agent* is close to it's preceding and following *execution agents* and all *execution resources* are utilized. The affinity binding pattern must be consistent across invocations of the executor's bulk execution function. |
281+
| `bulk_execution_affinity_t::none_t` | `bulk_execution_affinity_t::none` | A call to `e.bulk_execute(f, s, sf)` has no requirements on the binding of *execution agents* to the underlying *execution resources*. |
282+
| `bulk_execution_affinity_t::scatter_t` | `bulk_execution_scatter_t::scatter` | A call to `e.bulk_execute(f, s, sf)` must bind the created *execution agents* to the underlying *execution resources* such that they are distributed sparsely across the *execution resources*. The affinity binding pattern must be consistent across invocations of the executor's bulk execution function. |
283+
| `bulk_execution_affinity_t::compact_t` | `bulk_execution_compact_t::compact` | A call to `e.bulk_execute(f, s, sf)` must bind the created *execution agents* to the underlying *execution resources* such that they are distributed as close as possible to the *execution resource* of the *thread of execution* which created them. The affinity binding pattern must be consistent across invocations of the executor's bulk execution function. |
284+
| bulk_execution_affinity_t::balanced_t | bulk_execution_balanced_t::balanced | A call to `e.bulk_execute(f, s, sf)` must bind the created *execution agents* to the underlying *execution resources* such that they are collected into groups, each group is distributed sparsely across the *execution resources* and the *execution agents* within each group are distributed as close as possible to the first *execution resource* of that group. The affinity binding pattern must be consistent across invocations of the executor's bulk execution function. |
285285

286286
> [*Note:* The requirements of the `bulk_execution_affinity_t` nested properties do not enforce a specific binding, simply that the binding follows the requirements set out above and that the pattern is consistent across invocations of the bulk execution functions. *--end note*]
287287

288-
> [*Note:* If two *executors* `e1` and `e2` invoke a bulk execution function in order, where `execution::query(e1, execution::context) == query(e2, execution::context)` is `true` and `execution::query(e1, execution::bulk_execution_affinity) == query(e2, execution::bulk_execution_affinity)` is `false`, this will likely result in `e1` binding *execution agents* if necessary to achieve the requested affinity pattern and then `e2` rebinding to achieve the new affinity pattern. *--end note*]
288+
> [*Note:* If two *executors* `e1` and `e2` invoke a bulk execution function in order, where `execution::query(e1, execution::context) == query(e2, execution::context)` is `true` and `execution::query(e1, execution::bulk_execution_affinity) == query(e2, execution::bulk_execution_affinity)` is `false`, this will likely result in `e1` binding *execution agents* if necessary to achieve the requested affinity pattern and then `e2` rebinding to achieve the new affinity pattern. Rebinding *execution agents* to *execution resources* may take substantial time and may affect performance of subsequent code. *--end note*]
289289

290290
> [*Note:* The terms used for the `bulk_execution_affinity_t` nested properties are derived from the OpenMP properties [[33]][openmp-affinity] including the Intel specific balanced affinity binding [[[34]][intel-balanced-affinity] *--end note*]
291291

@@ -400,7 +400,7 @@ The value returned from `execution::query(e1, memory_locality_intersection_t(e2)
400400

401401
## Who should have control over bulk execution affinity?
402402

403-
This paper currently proposes the `bulk_execution_affinity_t` properties and it's nested properties for allowing an *executor* to make guarantees as to how *execution agents* are bound to the underlying *execution resources*. However providing control at this level may lead to *execution agents* being bound to *execution resources* within a critical path. A possible solution to this is to allow the *execution context* to be configured with `bulk_execution_affinity_t` nested properties, either instead of the *executor* property or in addition. This would allow the binding of *threads of execution* to be performed at the time of the *execution context* creation.
403+
This paper currently proposes the `bulk_execution_affinity_t` properties and its nested properties for allowing an *executor* to make guarantees as to how *execution agents* are bound to the underlying *execution resources*. However providing control at this level may lead to *execution agents* being bound to *execution resources* within a critical path. A possible solution to this is to allow the *execution context* to be configured with `bulk_execution_affinity_t` nested properties, either instead of the *executor* property or in addition. This would allow the binding of *threads of execution* to be performed at the time of the *execution context* creation.
404404

405405
| Straw Poll |
406406
|------------|

0 commit comments

Comments
 (0)