You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -51,7 +51,7 @@ Each proposal in the table below will be tagged with one of the following states
51
51
| CP009 |[Async Work Group Copy & Prefetch Builtins](async-work-group-copy/index.md)| SYCL 1.2.1 | 07 August 2017 | 07 August 2017 |_Accepted with changes_|
52
52
| CP011 |[Mem Fence Builtins](mem-fence/index.md)| SYCL 1.2.1 | 11 August 2017 | 9 September 2017 |_Accepted_|
53
53
| CP012 |[Data Movement in C++](data-movement/index.md)| ISO C++ SG1, SG14 | 30 May 2017 | 28 August 2017 |_Work in Progress_|
54
-
| CP013 |[Executor properties for affinity-based execution <br> System topology discovery for heterogeneous & distributed computing](affinity/index.md)| ISO C++ SG1, SG14, LEWG | 15 November 2017 |12 January 2019 |_Work in Progress_|
54
+
| CP013 |[P1436: Executor properties for affinity-based execution](affinity/index.md)| ISO C++ SG1, SG14, LEWG | 15 November 2017 |21 January 2019 |_Work in Progress_|
55
55
| CP014 |[Shared Virtual Memory](svm/index.md)| SYCL 2.2 | 22 January 2018 | 22 January 2018 |_Work in Progress_|
56
56
| CP015 |[Specialization Constant](spec-constant/index.md)| SYCL 1.2.1 extension / SYCL 2.2 | 24 April 2018 | 24 April 2018 |_Work in Progress_|
57
57
| CP019 |[On-chip Memory Allocation](onchip-memory/index.md)| SYCL 1.2.1 extension / SYCL 2.2 | 03 December 2018 | 03 December 2018 |_Work in Progress_|
Copy file name to clipboardExpand all lines: affinity/cpp-20/d1436r0.md
+12-12Lines changed: 12 additions & 12 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,6 @@
1
1
# D1436r0: Executor properties for affinity-based execution
2
2
3
-
**Date: 2019-01-12**
3
+
**Date: 2019-01-21**
4
4
5
5
**Audience: SG1, SG14, LEWG**
6
6
@@ -78,13 +78,13 @@ The affinity problem is especially challenging for applications whose behavior c
78
78
79
79
Frequently, data are initialized at the beginning of the program by the initial thread and are used by multiple threads. While some OSes automatically migrate threads or data for better affinity, migration may have high overhead. In an optimal case, the OS may automatically detect which thread access which data most frequently, or it may replicate data which are read by multiple threads, or migrate data which are modified and used by threads residing on remote locality groups. However, the OS often does a reasonable job, if the machine is not overloaded, if the application carefully uses first-touch allocation, and if the program does not change its behavior with respect to locality.
80
80
81
-
The affinity interface we propose should help computers achieve a much higher fraction of peak memory bandwidth when using parallel algorithms. In the future, we plan to extend this to heterogeneous and distributed computing. This follows the lead of OpenMP [[2]][design-of-openmp], which has plans to integrate its affinity model with its heterogeneous model [3]. (One of the authors of this document participated in the design of OpenMP's affinity model.)
81
+
The affinity interface we propose should help computers achieve a much higher fraction of peak memory bandwidth when using parallel algorithms. In the future, we plan to extend this to heterogeneous and distributed computing. This follows the lead of OpenMP [[2]][design-of-openmp], which has plans to integrate its affinity model with its heterogeneity model [3]. (One of the authors of this document participated in the design of OpenMP's affinity model.)
82
82
83
83
## Motivational examples
84
84
85
85
To identify the requirements for supporting affinity we have looked at a number of use cases where affinity between memory locality and execution can provide better performance.
86
86
87
-
Consider the following code example *(Listing 1)* where the C\+\+17 parallel STL algorithm `for_each` is used to modify the elements of a `std::vector``data` on an *executor* that will execute on a NUMA system with a number of CPU cores. However the memory is allocated by the `std::vector` default allocator immediately during the construction of `data` on memory local to the calling *thread of execution*. This means the memory allocated for `data` has bad locality to the NUMA regions on the system, therefore as is would incur higher latency when accessed. In this example, this is avoided by migrating `data` to have better affinity with the NUMA regions on the system using an *executor* with the `bulk_execution_affinity.scatter` property applied, before it is accessed by the `for_each`. Note that a mechanism for miration is not yet specified in this paper so this is example currently uses a vendor API; `vendor_api::migrate`, though the intention is that this will be replaced in a future revision.
87
+
Consider the following code example *(Listing 1)* where the C\+\+17 parallel STL algorithm `for_each` is used to modify the elements of a `std::vector` `data` on an *executor* that will execute on a NUMA system with a number of CPU cores. However the memory is allocated by the `std::vector` default allocator immediately during the construction of `data` on memory local to the calling thread of execution. This means that the memory allocated for `data` may have poor locality to all of the NUMA regions on the system, other than the one in which the constructor executed. Therefore, accesses in the parallel `for_each` made by threads in other NUMA regions will incur high latency. In this example, this is avoided by migrating `data` to have better affinity with the NUMA regions on the system using an *executor* with the `bulk_execution_affinity.scatter` property applied, before it is accessed by the `for_each`. Note that a mechanism for migration is not yet specified in this paper, so this example currently uses an arbitrary vendor API, `vendor_api::migrate`. Our intention is that a future revision of this paper will specify a standard mechanism for migration
88
88
89
89
```cpp
90
90
// NUMA executor representing N NUMA regions.
@@ -101,7 +101,7 @@ auto affinityExec = std::execution::require(exec,
101
101
102
102
// Migrate the memory allocated for the vector across the NUMA regions in a
Now consider a similar code example *(Listing 2)* where again the C\+\+17 parallel STL algorithm `for_each` is used to modify the elements of a `std::vector` `data` on an *executor* that will execute on a NUMA system with a number of CPU cores. However instead of migrating `data` to have affinity with the NUMA regions, `data` is allocated within a bulk execution by an *executor* with the `bulk_execution_affinity.scatter` property applied so that `data` is allocated with affinity. Then when the `for_each` is called with the same executor, `data` maintains its affinity with the NUMA regions.
113
+
Now consider a similar code example *(Listing 2)* where again the C\+\+17 parallel STL algorithm `for_each` is used to modify the elements of a `std::vector` `data` on an *executor* that will execute on a NUMA system with a number of CPU cores. However, instead of migrating `data` to have affinity with the NUMA regions, `data` is allocated within a bulk execution by an *executor* with the `bulk_execution_affinity.scatter` property applied so that `data` is allocated with affinity. Then when the `for_each` is called with the same executor, `data` maintains its affinity with the NUMA regions.
114
114
115
115
```cpp
116
116
// NUMA executor representing N NUMA regions.
@@ -151,7 +151,7 @@ Wherever possible, we also evaluate how an affinity-based solution could be scal
151
151
152
152
## State of the art
153
153
154
-
The *affinity problem* existed for some time, and there are a number of third-party libraries and standards which provide APIs to solve the problem. In order to standardize this process for C\+\+, we must carefully look at all of these approaches and identify which we wish ideas are suitable for adoption into C\+\+. Below is a list of the libraries and standards from which this proposal will draw:
154
+
The *affinity problem* existed for some time, and there are a number of third-party libraries and standards which provide APIs to solve the problem. In order to standardize this process for C\+\+, we must carefully look at all of these approaches and identify which ideas are suitable for adoption into C\+\+. Below is a list of the libraries and standards from which this proposal will draw:
155
155
156
156
* Portable Hardware Locality [[4]][hwloc]
157
157
* SYCL 1.2 [[5]][sycl-1-2-1]
@@ -183,7 +183,7 @@ This can be scaled to heterogeneous and distributed systems, as the relative aff
183
183
184
184
## Inaccessible memory
185
185
186
-
The initial solution proposed by this paper may only target systems with a single addressable memory region. It may thus exclude devices like discrete GPUs. However, in order to maintain a unified interface going forward, the initial solution should consider these devices and be able to scale to support them in the future.
186
+
The initial solution proposed by this paper may only target systems with a single addressable memory region. It may therefore exclude certain heterogeneous devices such as some discrete GPUs. However, in order to maintain a unified interface going forward, the initial solution should consider these devices and be able to scale to support them in the future.
187
187
188
188
# Proposal
189
189
@@ -432,19 +432,19 @@ There are a number of additional features which we are considering for inclusion
432
432
433
433
## Migrating data
434
434
435
-
This paper currently provides a mechanism for detecting whether two *executors* share a common memory locality. However it does not provide a way to invoke migration of data allocated local to one *executor* into the locality of another *executor*.
435
+
This paper currently provides a mechanism for detecting whether two *executors* share a common memory locality. However, it does not provide a way to invoke migration of data allocated local to one *executor* into the locality of another *executor*.
436
436
437
-
We envision that this mechanic could be facilitated by a customization point on two *executors* and perhaps a `spann` or `mdspan` accessor.
437
+
We envision that this mechanic could be facilitated by a customization point on two *executors* and perhaps a `span` or `mdspan` accessor.
438
438
439
439
## Supporting different affinity domains
440
440
441
441
This paper currently assumes a NUMA-like system, however there are many other kinds of systems with many different architectures with different kinds of processors, memory and connections between them.
442
442
443
-
In order to accurately take advantage of the range of systems available now and in the future we will need some way to parameterized or enumerate the different affinity domains which an executor can structure around.
443
+
In order to accurately take advantage of the range of systems available now and in the future we will need some way to parameterize or enumerate the different affinity domains which an executor can structure around.
444
444
445
-
Furthermore in order to have control over those affinity domains we need a way in which to mask out the components of that domain that we wish to work with.
445
+
Furthermore, in order to have control over those affinity domains we need a way in which to mask out the components of that domain that we wish to work with.
446
446
447
-
However whichever option we opt for, it must be in such a way as to allow further additions as new system architectures become available.
447
+
However, whichever option we opt for, it must be in such a way as to allow further additions as new system architectures become available.
Copy file name to clipboardExpand all lines: affinity/index.md
+10-8Lines changed: 10 additions & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,19 +1,21 @@
1
-
# Supporting Heterogeneous & Distributed Computing Through Affinity
1
+
# P1436: Executor properties for affinity-based execution
2
2
3
3
|||
4
4
|---|---|
5
5
| ID | CP013 |
6
-
| Name | Executor properties for affinity-based execution <br> System topology discovery for heterogeneous & distributed computing |
7
-
| Target | ISO C++ SG1 SG14 |
6
+
| Name | Executor properties for affinity-based execution |
7
+
| Target | ISO C++ SG1, SG14, LEWG|
8
8
| Initial creation | 15 November 2017 |
9
-
| Last update |12 August 2018|
10
-
| Reply-to |Michael Wong <michael.wong@codeplay.com> |
9
+
| Last update |21 January 2019|
10
+
| Reply-to |Gordon Brown <gordon@codeplay.com>|
11
11
| Original author | Gordon Brown <gordon@codeplay.com>|
12
-
| Contributors | Ruyman Reyes <ruyman@codeplay.com>, Michael Wong <michael.wong@codeplay.com>, H. Carter Edwards <hcedwar@sandia.gov>, Thomas Rodgers <rodgert@twrodgers.com>|
12
+
| Contributors | Ruyman Reyes <ruyman@codeplay.com>, Michael Wong <michael.wong@codeplay.com>, H. Carter Edwards <hcedwar@sandia.gov>, Thomas Rodgers <rodgert@twrodgers.com>, Mark Hoemmen <mhoemme@sandia.gov>|
13
13
14
14
## Overview
15
15
16
-
This paper provides an initial meta-framework for the drives toward memory affinity for C++, given the direction from Toronto 2017 SG1 meeting that we should look towards defining affinity for C++ before looking at inaccessible memory as a solution to the separate memory problem towards supporting heterogeneous and distributed computing.
16
+
This paper is the result of a request from SG1 at the 2018 San Diego meeting to split P0796: Supporting Heterogeneous & Distributed Computing Through Affinity [[35]][p0796] into two separate papers, one for the high-level interface and one for the low-level interface. This paper focusses on the high-level interface: a series of properties for querying affinity relationships and requesting affinity on work being executed. [[36]][p1437] focusses on the low-level interface: a mechanism for discovering the topology and affinity properties of a given system.
17
+
18
+
The aim of this paper is to provide a number of executor properties that if supported allow the user of an executor to query and manipulate the binding of *execution agents* and the underlying *execution resources* of the *threads of execution* they are run on.
17
19
18
20
## Versions
19
21
@@ -23,7 +25,7 @@ This paper provides an initial meta-framework for the drives toward memory affin
23
25
|[P0796r1][p0796r1]|_Published_|
24
26
|[D0796r2][p0796r2]|_Published_|
25
27
|[D0796r3][p0796r3]|_Published_|
26
-
|[DXXX1r0](cpp-20/dXXX1r0.md) <br> [DXXX2r0](cpp-20/dXXX2r0.md)|_Work In Progress_|
0 commit comments