CP013: Final changes before submitting affinity paper.

Gordon · Gordon · commit be88cd4a935d · 2019-01-21T13:29:44.000Z
* Corrections based on feedback.
* Update Readme and front matter.
diff --git a/README.md b/README.md
@@ -51,7 +51,7 @@ Each proposal in the table below will be tagged with one of the following states
 | CP009 | [Async Work Group Copy & Prefetch Builtins](async-work-group-copy/index.md) | SYCL 1.2.1 | 07 August 2017 | 07 August 2017 | _Accepted with changes_ |
 | CP011 | [Mem Fence Builtins](mem-fence/index.md) | SYCL 1.2.1 | 11 August 2017 | 9 September 2017 | _Accepted_ |
 | CP012 | [Data Movement in C++](data-movement/index.md) | ISO C++ SG1, SG14 | 30 May 2017 | 28 August 2017 | _Work in Progress_ |
-| CP013 | [Executor properties for affinity-based execution <br> System topology discovery for heterogeneous & distributed computing](affinity/index.md) | ISO C++ SG1, SG14, LEWG | 15 November 2017 | 12 January 2019 | _Work in Progress_ |
+| CP013 | [P1436: Executor properties for affinity-based execution](affinity/index.md) | ISO C++ SG1, SG14, LEWG | 15 November 2017 | 21 January 2019 | _Work in Progress_ |
 | CP014 | [Shared Virtual Memory](svm/index.md) | SYCL 2.2 | 22 January 2018 | 22 January 2018 | _Work in Progress_ |
 | CP015 | [Specialization Constant](spec-constant/index.md) | SYCL 1.2.1 extension / SYCL 2.2 | 24 April 2018 | 24 April 2018 | _Work in Progress_ |
 | CP019 | [On-chip Memory Allocation](onchip-memory/index.md) | SYCL 1.2.1 extension / SYCL 2.2 | 03 December 2018 | 03 December 2018 | _Work in Progress_ |
diff --git a/affinity/cpp-20/d1436r0.md b/affinity/cpp-20/d1436r0.md
@@ -1,6 +1,6 @@
 # D1436r0: Executor properties for affinity-based execution
 
-**Date: 2019-01-12**
+**Date: 2019-01-21**
 
 **Audience: SG1, SG14, LEWG**
 
@@ -78,13 +78,13 @@ The affinity problem is especially challenging for applications whose behavior c
 
 Frequently, data are initialized at the beginning of the program by the initial thread and are used by multiple threads. While some OSes automatically migrate threads or data for better affinity, migration may have high overhead. In an optimal case, the OS may automatically detect which thread access which data most frequently, or it may replicate data which are read by multiple threads, or migrate data which are modified and used by threads residing on remote locality groups. However, the OS often does a reasonable job, if the machine is not overloaded, if the application carefully uses first-touch allocation, and if the program does not change its behavior with respect to locality.
 
-The affinity interface we propose should help computers achieve a much higher fraction of peak memory bandwidth when using parallel algorithms. In the future, we plan to extend this to heterogeneous and distributed computing. This follows the lead of OpenMP [[2]][design-of-openmp], which has plans to integrate its affinity model with its heterogeneous model [3]. (One of the authors of this document participated in the design of OpenMP's affinity model.)
+The affinity interface we propose should help computers achieve a much higher fraction of peak memory bandwidth when using parallel algorithms. In the future, we plan to extend this to heterogeneous and distributed computing. This follows the lead of OpenMP [[2]][design-of-openmp], which has plans to integrate its affinity model with its heterogeneity model [3]. (One of the authors of this document participated in the design of OpenMP's affinity model.)
 
 ## Motivational examples
 
 To identify the requirements for supporting affinity we have looked at a number of use cases where affinity between memory locality and execution can provide better performance.
 
-Consider the following code example *(Listing 1)* where the C\+\+17 parallel STL algorithm `for_each` is used to modify the elements of a `std::vector` `data` on an *executor* that will execute on a NUMA system with a number of CPU cores. However the memory is allocated by the `std::vector` default allocator immediately during the construction of `data` on memory local to the calling *thread of execution*. This means the memory allocated for `data` has bad locality to the NUMA regions on the system, therefore as is would incur higher latency when accessed. In this example, this is avoided by migrating `data` to have better affinity with the NUMA regions on the system using an *executor* with the `bulk_execution_affinity.scatter` property applied, before it is accessed by the `for_each`. Note that a mechanism for miration is not yet specified in this paper so this is example currently uses a vendor API; `vendor_api::migrate`, though the intention is that this will be replaced in a future revision.
+Consider the following code example *(Listing 1)* where the C\+\+17 parallel STL algorithm `for_each` is used to modify the elements of a `std::vector` `data` on an *executor* that will execute on a NUMA system with a number of CPU cores. However the memory is allocated by the `std::vector` default allocator immediately during the construction of `data` on memory local to the calling thread of execution. This means that the memory allocated for `data` may have poor locality to all of the NUMA regions on the system, other than the one in which the constructor executed. Therefore, accesses in the parallel `for_each` made by threads in other NUMA regions will incur high latency. In this example, this is avoided by migrating `data` to have better affinity with the NUMA regions on the system using an *executor* with the `bulk_execution_affinity.scatter` property applied, before it is accessed by the `for_each`. Note that a mechanism for migration is not yet specified in this paper, so this example currently uses an arbitrary vendor API, `vendor_api::migrate`. Our intention is that a future revision of this paper will specify a standard mechanism for migration
 
 ```cpp
 // NUMA executor representing N NUMA regions.
@@ -101,7 +101,7 @@ auto affinityExec = std::execution::require(exec,
 
 // Migrate the memory allocated for the vector across the NUMA regions in a
 // scatter pattern.
-vendor_api::migrate(std::begin(data), std::end(data), affinityExec);
+vendor_api::migrate(data, affinityExec);
 
 // Placement of data is local to NUMA region 0, so data for execution on other
 // NUMA nodes must is migrated when accessed.
@@ -110,7 +110,7 @@ std::for_each(std::execution::par.on(affinityExec), std::begin(data),
 ```
 *Listing 1: Migrating previously allocated memory.*
 
-Now consider a similar code example *(Listing 2)* where again the C\+\+17 parallel STL algorithm `for_each` is used to modify the elements of a `std::vector` `data` on an *executor* that will execute on a NUMA system with a number of CPU cores. However instead of migrating `data` to have affinity with the NUMA regions, `data` is allocated within a bulk execution by an *executor* with the `bulk_execution_affinity.scatter` property applied so that `data` is allocated with affinity. Then when the `for_each` is called with the same executor, `data` maintains its affinity with the NUMA regions.
+Now consider a similar code example *(Listing 2)* where again the C\+\+17 parallel STL algorithm `for_each` is used to modify the elements of a `std::vector` `data` on an *executor* that will execute on a NUMA system with a number of CPU cores. However, instead of migrating `data` to have affinity with the NUMA regions, `data` is allocated within a bulk execution by an *executor* with the `bulk_execution_affinity.scatter` property applied so that `data` is allocated with affinity. Then when the `for_each` is called with the same executor, `data` maintains its affinity with the NUMA regions.
 
 ```cpp
 // NUMA executor representing N NUMA regions.
@@ -151,7 +151,7 @@ Wherever possible, we also evaluate how an affinity-based solution could be scal
 
 ## State of the art
 
-The *affinity problem* existed for some time, and there are a number of third-party libraries and standards which provide APIs to solve the problem. In order to standardize this process for C\+\+, we must carefully look at all of these approaches and identify which we wish ideas are suitable for adoption into C\+\+. Below is a list of the libraries and standards from which this proposal will draw:
+The *affinity problem* existed for some time, and there are a number of third-party libraries and standards which provide APIs to solve the problem. In order to standardize this process for C\+\+, we must carefully look at all of these approaches and identify which ideas are suitable for adoption into C\+\+. Below is a list of the libraries and standards from which this proposal will draw:
 
 * Portable Hardware Locality [[4]][hwloc]
 * SYCL 1.2 [[5]][sycl-1-2-1]
@@ -183,7 +183,7 @@ This can be scaled to heterogeneous and distributed systems, as the relative aff
 
 ## Inaccessible memory
 
-The initial solution proposed by this paper may only target systems with a single addressable memory region. It may thus exclude devices like discrete GPUs. However, in order to maintain a unified interface going forward, the initial solution should consider these devices and be able to scale to support them in the future.
+The initial solution proposed by this paper may only target systems with a single addressable memory region. It may therefore exclude certain heterogeneous devices such as some discrete GPUs. However, in order to maintain a unified interface going forward, the initial solution should consider these devices and be able to scale to support them in the future.
 
 # Proposal
 
@@ -432,19 +432,19 @@ There are a number of additional features which we are considering for inclusion
 
 ## Migrating data
 
-This paper currently provides a mechanism for detecting whether two *executors* share a common memory locality. However it does not provide a way to invoke migration of data allocated local to one *executor* into the locality of another *executor*.
+This paper currently provides a mechanism for detecting whether two *executors* share a common memory locality. However, it does not provide a way to invoke migration of data allocated local to one *executor* into the locality of another *executor*.
 
-We envision that this mechanic could be facilitated by a customization point on two *executors* and perhaps a `spann` or `mdspan` accessor.
+We envision that this mechanic could be facilitated by a customization point on two *executors* and perhaps a `span` or `mdspan` accessor.
 
 ## Supporting different affinity domains
 
 This paper currently assumes a NUMA-like system, however there are many other kinds of systems with many different architectures with different kinds of processors, memory and connections between them.
 
-In order to accurately take advantage of the range of systems available now and in the future we will need some way to parameterized or enumerate the different affinity domains which an executor can structure around.
+In order to accurately take advantage of the range of systems available now and in the future we will need some way to parameterize or enumerate the different affinity domains which an executor can structure around.
 
-Furthermore in order to have control over those affinity domains we need a way in which to mask out the components of that domain that we wish to work with.
+Furthermore, in order to have control over those affinity domains we need a way in which to mask out the components of that domain that we wish to work with.
 
-However whichever option we opt for, it must be in such a way as to allow further additions as new system architectures become available.
+However, whichever option we opt for, it must be in such a way as to allow further additions as new system architectures become available.
 
 
 # Acknowledgments
diff --git a/affinity/index.md b/affinity/index.md
@@ -1,19 +1,21 @@
-# Supporting Heterogeneous & Distributed Computing Through Affinity
+# P1436: Executor properties for affinity-based execution
 
 |   |   |
 |---|---|
 | ID | CP013 |
-| Name | Executor properties for affinity-based execution <br> System topology discovery for heterogeneous & distributed computing |
-| Target | ISO C++ SG1 SG14 |
+| Name | Executor properties for affinity-based execution |
+| Target | ISO C++ SG1, SG14, LEWG |
 | Initial creation | 15 November 2017 |
-| Last update | 12 August 2018 |
-| Reply-to | Michael Wong <michael.wong@codeplay.com> |
+| Last update | 21 January 2019 |
+| Reply-to | Gordon Brown <gordon@codeplay.com> |
 | Original author | Gordon Brown <gordon@codeplay.com> |
-| Contributors | Ruyman Reyes <ruyman@codeplay.com>, Michael Wong <michael.wong@codeplay.com>, H. Carter Edwards <hcedwar@sandia.gov>, Thomas Rodgers <rodgert@twrodgers.com> |
+| Contributors | Ruyman Reyes <ruyman@codeplay.com>, Michael Wong <michael.wong@codeplay.com>, H. Carter Edwards <hcedwar@sandia.gov>, Thomas Rodgers <rodgert@twrodgers.com>, Mark Hoemmen <mhoemme@sandia.gov> |
 
 ## Overview
 
-This paper provides an initial meta-framework for the drives toward memory affinity for C++, given the direction from Toronto 2017 SG1 meeting that we should look towards defining affinity for C++ before looking at inaccessible memory as a solution to the separate memory problem towards supporting heterogeneous and distributed computing.
+This paper is the result of a request from SG1 at the 2018 San Diego meeting to split P0796: Supporting Heterogeneous & Distributed Computing Through Affinity [[35]][p0796] into two separate papers, one for the high-level interface and one for the low-level interface. This paper focusses on the high-level interface: a series of properties for querying affinity relationships and requesting affinity on work being executed. [[36]][p1437] focusses on the low-level interface: a mechanism for discovering the topology and affinity properties of a given system.
+
+The aim of this paper is to provide a number of executor properties that if supported allow the user of an executor to query and manipulate the binding of *execution agents* and the underlying *execution resources* of the *threads of execution* they are run on.
 
 ## Versions
 
@@ -23,7 +25,7 @@ This paper provides an initial meta-framework for the drives toward memory affin
 | [P0796r1][p0796r1] | _Published_ |
 | [D0796r2][p0796r2] | _Published_ |
 | [D0796r3][p0796r3] | _Published_ |
-| [DXXX1r0](cpp-20/dXXX1r0.md) <br> [DXXX2r0](cpp-20/dXXX2r0.md) | _Work In Progress_ |
+| [DXXX1r0](cpp-20/d1436r0.md) | _Work In Progress_ |
 
 [p0796r0]: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2017/p0796r0.pdf
 [p0796r1]: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0796r1.pdf