Skip to content

Commit be88cd4

Browse files
GordonGordon
authored andcommitted
CP013: Final changes before submitting affinity paper.
* Corrections based on feedback. * Update Readme and front matter.
1 parent 03a643e commit be88cd4

File tree

3 files changed

+23
-21
lines changed

3 files changed

+23
-21
lines changed

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -51,7 +51,7 @@ Each proposal in the table below will be tagged with one of the following states
5151
| CP009 | [Async Work Group Copy & Prefetch Builtins](async-work-group-copy/index.md) | SYCL 1.2.1 | 07 August 2017 | 07 August 2017 | _Accepted with changes_ |
5252
| CP011 | [Mem Fence Builtins](mem-fence/index.md) | SYCL 1.2.1 | 11 August 2017 | 9 September 2017 | _Accepted_ |
5353
| CP012 | [Data Movement in C++](data-movement/index.md) | ISO C++ SG1, SG14 | 30 May 2017 | 28 August 2017 | _Work in Progress_ |
54-
| CP013 | [Executor properties for affinity-based execution <br> System topology discovery for heterogeneous & distributed computing](affinity/index.md) | ISO C++ SG1, SG14, LEWG | 15 November 2017 | 12 January 2019 | _Work in Progress_ |
54+
| CP013 | [P1436: Executor properties for affinity-based execution](affinity/index.md) | ISO C++ SG1, SG14, LEWG | 15 November 2017 | 21 January 2019 | _Work in Progress_ |
5555
| CP014 | [Shared Virtual Memory](svm/index.md) | SYCL 2.2 | 22 January 2018 | 22 January 2018 | _Work in Progress_ |
5656
| CP015 | [Specialization Constant](spec-constant/index.md) | SYCL 1.2.1 extension / SYCL 2.2 | 24 April 2018 | 24 April 2018 | _Work in Progress_ |
5757
| CP019 | [On-chip Memory Allocation](onchip-memory/index.md) | SYCL 1.2.1 extension / SYCL 2.2 | 03 December 2018 | 03 December 2018 | _Work in Progress_ |

affinity/cpp-20/d1436r0.md

Lines changed: 12 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# D1436r0: Executor properties for affinity-based execution
22

3-
**Date: 2019-01-12**
3+
**Date: 2019-01-21**
44

55
**Audience: SG1, SG14, LEWG**
66

@@ -78,13 +78,13 @@ The affinity problem is especially challenging for applications whose behavior c
7878

7979
Frequently, data are initialized at the beginning of the program by the initial thread and are used by multiple threads. While some OSes automatically migrate threads or data for better affinity, migration may have high overhead. In an optimal case, the OS may automatically detect which thread access which data most frequently, or it may replicate data which are read by multiple threads, or migrate data which are modified and used by threads residing on remote locality groups. However, the OS often does a reasonable job, if the machine is not overloaded, if the application carefully uses first-touch allocation, and if the program does not change its behavior with respect to locality.
8080

81-
The affinity interface we propose should help computers achieve a much higher fraction of peak memory bandwidth when using parallel algorithms. In the future, we plan to extend this to heterogeneous and distributed computing. This follows the lead of OpenMP [[2]][design-of-openmp], which has plans to integrate its affinity model with its heterogeneous model [3]. (One of the authors of this document participated in the design of OpenMP's affinity model.)
81+
The affinity interface we propose should help computers achieve a much higher fraction of peak memory bandwidth when using parallel algorithms. In the future, we plan to extend this to heterogeneous and distributed computing. This follows the lead of OpenMP [[2]][design-of-openmp], which has plans to integrate its affinity model with its heterogeneity model [3]. (One of the authors of this document participated in the design of OpenMP's affinity model.)
8282

8383
## Motivational examples
8484

8585
To identify the requirements for supporting affinity we have looked at a number of use cases where affinity between memory locality and execution can provide better performance.
8686

87-
Consider the following code example *(Listing 1)* where the C\+\+17 parallel STL algorithm `for_each` is used to modify the elements of a `std::vector` `data` on an *executor* that will execute on a NUMA system with a number of CPU cores. However the memory is allocated by the `std::vector` default allocator immediately during the construction of `data` on memory local to the calling *thread of execution*. This means the memory allocated for `data` has bad locality to the NUMA regions on the system, therefore as is would incur higher latency when accessed. In this example, this is avoided by migrating `data` to have better affinity with the NUMA regions on the system using an *executor* with the `bulk_execution_affinity.scatter` property applied, before it is accessed by the `for_each`. Note that a mechanism for miration is not yet specified in this paper so this is example currently uses a vendor API; `vendor_api::migrate`, though the intention is that this will be replaced in a future revision.
87+
Consider the following code example *(Listing 1)* where the C\+\+17 parallel STL algorithm `for_each` is used to modify the elements of a `std::vector` `data` on an *executor* that will execute on a NUMA system with a number of CPU cores. However the memory is allocated by the `std::vector` default allocator immediately during the construction of `data` on memory local to the calling thread of execution. This means that the memory allocated for `data` may have poor locality to all of the NUMA regions on the system, other than the one in which the constructor executed. Therefore, accesses in the parallel `for_each` made by threads in other NUMA regions will incur high latency. In this example, this is avoided by migrating `data` to have better affinity with the NUMA regions on the system using an *executor* with the `bulk_execution_affinity.scatter` property applied, before it is accessed by the `for_each`. Note that a mechanism for migration is not yet specified in this paper, so this example currently uses an arbitrary vendor API, `vendor_api::migrate`. Our intention is that a future revision of this paper will specify a standard mechanism for migration
8888

8989
```cpp
9090
// NUMA executor representing N NUMA regions.
@@ -101,7 +101,7 @@ auto affinityExec = std::execution::require(exec,
101101

102102
// Migrate the memory allocated for the vector across the NUMA regions in a
103103
// scatter pattern.
104-
vendor_api::migrate(std::begin(data), std::end(data), affinityExec);
104+
vendor_api::migrate(data, affinityExec);
105105

106106
// Placement of data is local to NUMA region 0, so data for execution on other
107107
// NUMA nodes must is migrated when accessed.
@@ -110,7 +110,7 @@ std::for_each(std::execution::par.on(affinityExec), std::begin(data),
110110
```
111111
*Listing 1: Migrating previously allocated memory.*
112112
113-
Now consider a similar code example *(Listing 2)* where again the C\+\+17 parallel STL algorithm `for_each` is used to modify the elements of a `std::vector` `data` on an *executor* that will execute on a NUMA system with a number of CPU cores. However instead of migrating `data` to have affinity with the NUMA regions, `data` is allocated within a bulk execution by an *executor* with the `bulk_execution_affinity.scatter` property applied so that `data` is allocated with affinity. Then when the `for_each` is called with the same executor, `data` maintains its affinity with the NUMA regions.
113+
Now consider a similar code example *(Listing 2)* where again the C\+\+17 parallel STL algorithm `for_each` is used to modify the elements of a `std::vector` `data` on an *executor* that will execute on a NUMA system with a number of CPU cores. However, instead of migrating `data` to have affinity with the NUMA regions, `data` is allocated within a bulk execution by an *executor* with the `bulk_execution_affinity.scatter` property applied so that `data` is allocated with affinity. Then when the `for_each` is called with the same executor, `data` maintains its affinity with the NUMA regions.
114114
115115
```cpp
116116
// NUMA executor representing N NUMA regions.
@@ -151,7 +151,7 @@ Wherever possible, we also evaluate how an affinity-based solution could be scal
151151

152152
## State of the art
153153

154-
The *affinity problem* existed for some time, and there are a number of third-party libraries and standards which provide APIs to solve the problem. In order to standardize this process for C\+\+, we must carefully look at all of these approaches and identify which we wish ideas are suitable for adoption into C\+\+. Below is a list of the libraries and standards from which this proposal will draw:
154+
The *affinity problem* existed for some time, and there are a number of third-party libraries and standards which provide APIs to solve the problem. In order to standardize this process for C\+\+, we must carefully look at all of these approaches and identify which ideas are suitable for adoption into C\+\+. Below is a list of the libraries and standards from which this proposal will draw:
155155

156156
* Portable Hardware Locality [[4]][hwloc]
157157
* SYCL 1.2 [[5]][sycl-1-2-1]
@@ -183,7 +183,7 @@ This can be scaled to heterogeneous and distributed systems, as the relative aff
183183

184184
## Inaccessible memory
185185

186-
The initial solution proposed by this paper may only target systems with a single addressable memory region. It may thus exclude devices like discrete GPUs. However, in order to maintain a unified interface going forward, the initial solution should consider these devices and be able to scale to support them in the future.
186+
The initial solution proposed by this paper may only target systems with a single addressable memory region. It may therefore exclude certain heterogeneous devices such as some discrete GPUs. However, in order to maintain a unified interface going forward, the initial solution should consider these devices and be able to scale to support them in the future.
187187

188188
# Proposal
189189

@@ -432,19 +432,19 @@ There are a number of additional features which we are considering for inclusion
432432

433433
## Migrating data
434434

435-
This paper currently provides a mechanism for detecting whether two *executors* share a common memory locality. However it does not provide a way to invoke migration of data allocated local to one *executor* into the locality of another *executor*.
435+
This paper currently provides a mechanism for detecting whether two *executors* share a common memory locality. However, it does not provide a way to invoke migration of data allocated local to one *executor* into the locality of another *executor*.
436436

437-
We envision that this mechanic could be facilitated by a customization point on two *executors* and perhaps a `spann` or `mdspan` accessor.
437+
We envision that this mechanic could be facilitated by a customization point on two *executors* and perhaps a `span` or `mdspan` accessor.
438438

439439
## Supporting different affinity domains
440440

441441
This paper currently assumes a NUMA-like system, however there are many other kinds of systems with many different architectures with different kinds of processors, memory and connections between them.
442442

443-
In order to accurately take advantage of the range of systems available now and in the future we will need some way to parameterized or enumerate the different affinity domains which an executor can structure around.
443+
In order to accurately take advantage of the range of systems available now and in the future we will need some way to parameterize or enumerate the different affinity domains which an executor can structure around.
444444

445-
Furthermore in order to have control over those affinity domains we need a way in which to mask out the components of that domain that we wish to work with.
445+
Furthermore, in order to have control over those affinity domains we need a way in which to mask out the components of that domain that we wish to work with.
446446

447-
However whichever option we opt for, it must be in such a way as to allow further additions as new system architectures become available.
447+
However, whichever option we opt for, it must be in such a way as to allow further additions as new system architectures become available.
448448

449449

450450
# Acknowledgments

affinity/index.md

Lines changed: 10 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,19 +1,21 @@
1-
# Supporting Heterogeneous & Distributed Computing Through Affinity
1+
# P1436: Executor properties for affinity-based execution
22

33
| | |
44
|---|---|
55
| ID | CP013 |
6-
| Name | Executor properties for affinity-based execution <br> System topology discovery for heterogeneous & distributed computing |
7-
| Target | ISO C++ SG1 SG14 |
6+
| Name | Executor properties for affinity-based execution |
7+
| Target | ISO C++ SG1, SG14, LEWG |
88
| Initial creation | 15 November 2017 |
9-
| Last update | 12 August 2018 |
10-
| Reply-to | Michael Wong <michael.wong@codeplay.com> |
9+
| Last update | 21 January 2019 |
10+
| Reply-to | Gordon Brown <gordon@codeplay.com> |
1111
| Original author | Gordon Brown <gordon@codeplay.com> |
12-
| Contributors | Ruyman Reyes <ruyman@codeplay.com>, Michael Wong <michael.wong@codeplay.com>, H. Carter Edwards <hcedwar@sandia.gov>, Thomas Rodgers <rodgert@twrodgers.com> |
12+
| Contributors | Ruyman Reyes <ruyman@codeplay.com>, Michael Wong <michael.wong@codeplay.com>, H. Carter Edwards <hcedwar@sandia.gov>, Thomas Rodgers <rodgert@twrodgers.com>, Mark Hoemmen <mhoemme@sandia.gov> |
1313

1414
## Overview
1515

16-
This paper provides an initial meta-framework for the drives toward memory affinity for C++, given the direction from Toronto 2017 SG1 meeting that we should look towards defining affinity for C++ before looking at inaccessible memory as a solution to the separate memory problem towards supporting heterogeneous and distributed computing.
16+
This paper is the result of a request from SG1 at the 2018 San Diego meeting to split P0796: Supporting Heterogeneous & Distributed Computing Through Affinity [[35]][p0796] into two separate papers, one for the high-level interface and one for the low-level interface. This paper focusses on the high-level interface: a series of properties for querying affinity relationships and requesting affinity on work being executed. [[36]][p1437] focusses on the low-level interface: a mechanism for discovering the topology and affinity properties of a given system.
17+
18+
The aim of this paper is to provide a number of executor properties that if supported allow the user of an executor to query and manipulate the binding of *execution agents* and the underlying *execution resources* of the *threads of execution* they are run on.
1719

1820
## Versions
1921

@@ -23,7 +25,7 @@ This paper provides an initial meta-framework for the drives toward memory affin
2325
| [P0796r1][p0796r1] | _Published_ |
2426
| [D0796r2][p0796r2] | _Published_ |
2527
| [D0796r3][p0796r3] | _Published_ |
26-
| [DXXX1r0](cpp-20/dXXX1r0.md) <br> [DXXX2r0](cpp-20/dXXX2r0.md) | _Work In Progress_ |
28+
| [DXXX1r0](cpp-20/d1436r0.md) | _Work In Progress_ |
2729

2830
[p0796r0]: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2017/p0796r0.pdf
2931
[p0796r1]: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0796r1.pdf

0 commit comments

Comments
 (0)