Update the wording of the bulk_execution_affinity properties in P1436 (#112)

AerialMantis · web-flow · commit bd12f57e1548 · 2020-01-13T00:05:42.000Z
* Rename `bulk_execution_affinity_t::scatter_t` to `bulk_execution_affinity_t::spread_t`.
* Rename `bulk_execution_affinity_t::compact_t` to `bulk_execution_affinity_t::close_t`.
* Refine the wording of the `bulk_execution_affinity_t` properties to clarify the requirements on binding and chunking, based on feedback from SG1.
* Add a note about a follow on paper for proposing a chunking property.
* Add some additional wording to describe the terms of art used in the
wording.
diff --git a/affinity/cpp-23/d1436r3.md b/affinity/cpp-23/d1436r3.md
@@ -18,6 +18,10 @@ This paper is the result of discussions from man contributors within SG1, SG14 a
 
 ### P1436r3 (PRA 2020)
 
+* Rename `bulk_execution_affinity_t::scatter_t` to `bulk_execution_affinity_t::spread_t`.
+* Rename `bulk_execution_affinity_t::compact_t` to `bulk_execution_affinity_t::close_t`.
+* Refine the wording of the `bulk_execution_affinity_t` properties to clarify the requirements on binding and chunking, based on feedback from SG1.
+
 ### P1436r2 (BEL 2019)
 
 * Alter the wording on the `bulk_execution_affinity_t` properties so they are now hints that request the executor provide a particular pattern of binding, rather than a guarantee.
@@ -101,7 +105,7 @@ The affinity interface we propose should help computers achieve a much higher fr
 
 To identify the requirements for supporting affinity we have looked at a number of use cases where affinity between memory locality and execution can provide better performance.
 
-Consider the following code example *(Listing 1)* where the C\+\+17 parallel STL algorithm `for_each` is used to modify the elements of a `std::vector` `data` on an *executor* that will execute on a NUMA system with a number of CPU cores. However the memory is allocated by the `std::vector` default allocator immediately during the construction of `data` on memory local to the calling thread of execution. This means that the memory allocated for `data` may have poor locality to all of the NUMA regions on the system, other than the one in which the constructor executed. Therefore, accesses in the parallel `for_each` made by threads in other NUMA regions will incur high latency. In this example, this is avoided by migrating `data` to have better affinity with the NUMA regions on the system using an *executor* with the `bulk_execution_affinity.scatter` property applied, before it is accessed by the `for_each`. Note that a mechanism for migration is not yet specified in this paper, so this example currently uses an arbitrary vendor API, `vendor_api::migrate`. Our intention is that a future revision of this paper will specify a standard mechanism for migration
+Consider the following code example *(Listing 1)* where the C\+\+17 parallel STL algorithm `for_each` is used to modify the elements of a `std::vector` `data` on an *executor* that will execute on a NUMA system with a number of CPU cores. However the memory is allocated by the `std::vector` default allocator immediately during the construction of `data` on memory local to the calling thread of execution. This means that the memory allocated for `data` may have poor locality to all of the NUMA regions on the system, other than the one in which the constructor executed. Therefore, accesses in the parallel `for_each` made by threads in other NUMA regions will incur high latency. In this example, this is avoided by migrating `data` to have better affinity with the NUMA regions on the system using an *executor* with the `bulk_execution_affinity.spread` property applied, before it is accessed by the `for_each`. Note that a mechanism for migration is not yet specified in this paper, so this example currently uses an arbitrary vendor API, `vendor_api::migrate`. Our intention is that a future revision of this paper will specify a standard mechanism for migration
 
 ```cpp
 // NUMA executor representing N NUMA regions.
@@ -112,12 +116,12 @@ numa_executor exec;
 std::vector<float> data(N * SIZE);
 
 // Require the NUMA executor to bind its migration of memory to the underlying
-// memory resources in a scatter pattern.
+// memory resources in a spread pattern.
 auto affinityExec = std::execution::require(exec,
-  bulk_execution_affinity.scatter);
+  bulk_execution_affinity.spread);
 
 // Migrate the memory allocated for the vector across the NUMA regions in a
-// scatter pattern.
+// spread pattern.
 vendor_api::migrate(data, affinityExec);
 
 // Placement of data is local to NUMA region 0, so data for execution on other
@@ -127,7 +131,7 @@ std::for_each(std::execution::par.on(affinityExec), std::begin(data),
 ```
 *Listing 1: Migrating previously allocated memory.*
 
-Now consider a similar code example *(Listing 2)* where again the C\+\+17 parallel STL algorithm `for_each` is used to modify the elements of a `std::vector` `data` on an *executor* that will execute on a NUMA system with a number of CPU cores. However, instead of migrating `data` to have affinity with the NUMA regions, `data` is allocated within a bulk execution by an *executor* with the `bulk_execution_affinity.scatter` property applied so that `data` is allocated with affinity. Then when the `for_each` is called with the same executor, `data` maintains its affinity with the NUMA regions.
+Now consider a similar code example *(Listing 2)* where again the C\+\+17 parallel STL algorithm `for_each` is used to modify the elements of a `std::vector` `data` on an *executor* that will execute on a NUMA system with a number of CPU cores. However, instead of migrating `data` to have affinity with the NUMA regions, `data` is allocated within a bulk execution by an *executor* with the `bulk_execution_affinity.spread` property applied so that `data` is allocated with affinity. Then when the `for_each` is called with the same executor, `data` maintains its affinity with the NUMA regions.
 
 ```cpp
 // NUMA executor representing N NUMA regions.
@@ -139,9 +143,9 @@ std::vector<std::unique_ptr<float[SIZE]>> data{};
 data.reserve(N);
 
 // Require the NUMA executor to bind its allocation of memory to the underlying
-// memory resources in a scatter patter.
+// memory resources in a spread pattern.
 auto affinityExec = std::execution::require(exec,
-  bulk_execution_affinity.scatter);
+  bulk_execution_affinity.spread);
 
 // Launch a bulk execution that will allocate each unique_ptr in the vector with
 // locality to the nearest NUMA region.
@@ -263,27 +267,27 @@ constexpr memory_locality_intersection_t memory_locality_intersection;
 
 ## Bulk execution affinity properties
 
-We propose an executor property group called `bulk_execution_affinity` which contains the nested properties `none`, `balanced`, `scatter` and `compact`. Each of these properties, if applied to an *executor* provides a hint to the `executor` that requests a particular binding of *execution agents* to the *execution resources* associated with the *executor* in a particular pattern.
+We propose an executor property group called `bulk_execution_affinity` which contains the nested properties `none`, `balanced`, `spread` and `close`. Each of these properties, if applied to an *executor* provides a hint to the `executor` that requests a particular binding of *execution agents* to the *execution resources* associated with the *executor* in a particular pattern.
 
 ### Example
 
-Below is an example *(Listing 4)* of executing a parallel task over 8 threads using `bulk_execute`, with the affinity binding `bulk_execution_affinity.scatter`. We request affinity binding using `prefer` and then check to see if the executor is able to support it using `query`.
+Below is an example *(Listing 4)* of executing a parallel task over 8 threads using `bulk_execute`, with the affinity binding `bulk_execution_affinity.spread`. We request affinity binding using `prefer` and then check to see if the executor is able to support it using `query`.
 
 ```cpp
 {
   bulk_executor exec;
         
   auto affExec = execution::prefer(exec,
-    execution::bulk_execution_affinity.scatter);
+    execution::bulk_execution_affinity.spread);
 
-  if (execution::query(affExec, execution::bulk_execution_affinity.scatter)) {
-    std::cout << "bulk_execute using bulk_execution_affinity.scatter"
+  if (execution::query(affExec, execution::bulk_execution_affinity.spread)) {
+    std::cout << "bulk_execute using bulk_execution_affinity.spread"
       << std::endl;
   }
 
-  affExec.bulk_execute([](std::size_t i, shared s) {
+  execution::bulk_execute(affExec, [](std::size_t i) {
     func(i);
-  }, 8, sharedFactory);
+  }, 8);
 }
 ```
 *Listing 4: Example of using the bulk_execution_affinity property*
@@ -292,24 +296,33 @@ Below is an example *(Listing 4)* of executing a parallel task over 8 threads us
 
 The `bulk_execution_affinity_t` properties are a group of mutually exclusive behavioral properties (as defined in P0443 [[22]][p0443]) which provide a hint to the *executor* to, if possible, bind the *execution agents* created by a bulk invocation from an *executor*, to the underlying *execution resources* in a particular pattern relative to their physical closeness.
 
+The `bulk_execution_affinity_t` nested properties are defined using the following terms of art:
+* *Available concurrency*; which is defined the number of *execution resources* available to an *executor* which can be bound to *execution agents* concurrently, assuming no contention.
+* *Locality distance*; which is defined an implementation-defined metric for measuring the relative affinity between *execution resources* whereby *execution resources* with a lower *locality distance* are likely to have similar latency in memory access operations, for a given memory location.
+
+The `bulk_execution_affinity_t` nested properties also refer to the subdivision of *execution resources*, which is an implementation-defined method of subdividing the *available concurrency*, generally based on groupings of *execution resources* with the lowest *locality distance* to each other.
+
+> [*Note:* An alternative term of art for *locality distance* could be *locality interference*. *--end note*]
+
 The `bulk_execution_affinity_t` property provides nested property types and objects as described below, where:
 * `e` denotes an executor object of type `E`,
 * `f` denotes a function object of type `F&&`,
-* `s` denotes a shape object of type `execution::executor_shape<E>`, and
-* `sf` denotes a function object of type `SF`.
+* `s` denotes a shape object of type `execution::executor_shape<E>`,
+* `sf` denotes a function object of type `SF`, and
+* a call to `execution::bulk_execute(e, f, s)` creates a consecutive sequence of work-items from `0` to `s-1`, mapped to the available concurrency of `e`, that is a number of execution resources, which are subdivided in some implementation-defined way.
 
 | Nested Property Type | Nested Property Name | Requirements |
 |----------------------|----------------------|--------------|
-| `bulk_execution_affinity_t::none_t` | `bulk_execution_affinity_t::none` | A call to `e.bulk_execute(f, s, sf)` has no requirements on the binding of *execution agents* to the underlying *execution resources*. |
-| `bulk_execution_affinity_t::scatter_t` | `bulk_execution_scatter_t::scatter` | A call to `e.bulk_execute(f, s, sf)` should aim to bind the created *execution agents* to the underlying *execution resources* (ordered by physical closeness) such that they are distributed equally across the *execution resources* in a round-robin fashion. <br><br> If the execution context associated with `e` is not able to bind the *execution agents* to the underlying *execution resources* as requested it should fall back to `bulk_execution_affinity_t::none_t`. |
-| `bulk_execution_affinity_t::compact_t` | `bulk_execution_compact_t::compact` | A call to `e.bulk_execute(f, s, sf)` should aim to bind the created *execution agents* to the underlying *execution resources* such that they are distributed as close as possible to the *execution resource* of the *thread of execution* which created them. <br><br> If the execution context associated with `e` is not able to bind the *execution agents* to the underlying *execution resources* as requested it should fall back to `bulk_execution_affinity_t::none_t`. |
-| bulk_execution_affinity_t::balanced_t | bulk_execution_balanced_t::balanced | A call to `e.bulk_execute(f, s, sf)` should aim to bind the created *execution agents* to the underlying *execution resources* (ordered by physical closeness) such that they are distributed equally across the *execution resources* in a bin packing fashion. <br><br> If the execution context associated with `e` is not able to bind the *execution agents* to the underlying *execution resources* as requested it should fall back to `bulk_execution_affinity_t::none_t`. |
+| `bulk_execution_affinity_t::none_t` | `bulk_execution_affinity_t::none` | A call to `execution::bulk_execute(e, f, s)` is not required to bind the created *execution agents* for the work-items of the iteration space specified by `s` to *execution resources*. |
+| `bulk_execution_affinity_t::spread_t` | `bulk_execution_spread_t::spread` | A call to `e.bulk_execute(f, s, sf)` should aim to bind the created *execution agents* for the work-items of the iteration space specified by `s` to *execution resources* such that the average locality distance of adjacent work-items in the same subdivision of the available concurrency is maximized and the average locality distance of adjacent work-items in different subdivisions of the available concurrency is maximized. The binding of all *execution agents* to all *execution resources* must not result in the difference between the number of *execution agents* assigned to any *execution resources* being greater than `1`. <br><br> If `e` is not able to fulfil this aim the it should fall back to `bulk_execution_affinity_t::none_t`. |
+| `bulk_execution_affinity_t::close_t` | `bulk_execution_close_t::close` | A call to `e.bulk_execute(f, s, sf)` should aim to bind the created *execution agents* for the work-items of the iteration space specified by `s` to *execution resources* such that the average locality distance between adjacent work-items is minimized. The binding of all *execution agents* to all *execution resources* must not result in the difference between the number of *execution agents* assigned to any *execution resources* being greater than `1`. <br><br> If `e` is not able to fulfil this aim the it should fall back to `bulk_execution_affinity_t::none_t`. |
+| `bulk_execution_affinity_t::balanced_t` | `bulk_execution_balanced_t::balanced` | A call to `e.bulk_execute(f, s, sf)` should aim to bind the created *execution agents* for the work-items of the iteration space specified by `s` to *execution resources* such that the average locality distance of adjacent work-items in the same subdivision of the available concurrency is minimized and the average locality distance of adjacent work-items in different subdivisions of the available concurrency is maximized. The binding of all *execution agents* to all *execution resources* must not result in the difference between the number of *execution agents* assigned to any *execution resources* being greater than `1`. <br><br> If `e` is not able to fulfil this aim the it should fall back to `bulk_execution_affinity_t::none_t`. |
 
-> [*Note:* An implementation is free to choose how it maps individual work items to the underlying *execution resources*, providing it aims to achieve the requested affinity relationship. *--end note*]
+> [*Note:* Note: The subdivision of the available concurrency is implementation-defined. *--end note*]
 
-> [*Note:* It's expected that the default value of `bulk_execution_affinity_t` for most executors be `bulk_execution_affinity_t::none_t`. *--end note*]
+> [*Note:* Note: If the number of work-items specified by `s` is larger than the available concurrency, the manner in which that iteration space is subdivided into a consecutive sequence of work-items is implementation-defined. *--end note*]
 
-> [*Note:* The terms used for the `bulk_execution_affinity_t` nested properties are derived from the OpenMP properties [[33]][openmp-affinity] including the Intel specific balanced affinity binding [[[34]][intel-balanced-affinity] *--end note*]
+> [*Note:* It's expected that the default value of `bulk_execution_affinity_t` for most executors be `bulk_execution_affinity_t::none_t`. *--end note*]
 
 > [*Note:* If two *executors* `e1` and `e2` invoke a bulk execution function in order, where `execution::query(e1, execution::context) == query(e2, execution::context)` is `true` and `execution::query(e1, execution::bulk_execution_affinity) == query(e2, execution::bulk_execution_affinity)` is `false`, this will likely result in `e1` binding *execution agents* if necessary to achieve the requested affinity pattern and then `e2` rebinding to achieve the new affinity pattern. Rebinding *execution agents* to *execution resources* may take substantial time and may affect performance of subsequent code. *--end note*]
 
@@ -454,6 +467,10 @@ The value returned from `execution::query(e1, memory_locality_intersection_t(e2)
 
 There are a number of additional features which we are considering for inclusion in this paper but are not ready yet.
 
+## Iteration space subdivision property
+
+It is defined in this proposal for the `bulk_execution_affinity_t` properties that when the size in an invocation of `execution::bulk_execute` is greater than the *available concurrency* then it is implementation how that iteration space is subdivided into a consecutive sequence of work-items. The authors of this proposal intend to propose a follow up property for specifying how an iteration space should be subdivided into chunks in this case.
+
 ## Migrating data
 
 This paper currently provides a mechanism for detecting whether two *executors* share a common memory locality. However, it does not provide a way to invoke migration of data allocated local to one *executor* into the locality of another *executor*.