Skip to content

Commit 84ff94a

Browse files
GordonGordon
authored andcommitted
Merge branch 'master' into issue-41
2 parents 92f6c80 + c99c7b1 commit 84ff94a

File tree

1 file changed

+14
-14
lines changed

1 file changed

+14
-14
lines changed

affinity/cpp-20/d0796r3.md

Lines changed: 14 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -48,7 +48,7 @@
4848

4949
# Abstract
5050

51-
This paper provides an initial meta-framework for the drives toward an execution and memory affinity model for C++. It accounts for feedback from the Toronto 2017 SG1 meeting on Data Movement in C++ [[1]][p0687r0] that we should define affinity for C++ first, before considering inaccessible memory as a solution to the separate memory problem towards supporting heterogeneous and distributed computing.
51+
This paper provides an initial meta-framework for the drives toward an execution and memory affinity model for C\+\+. It accounts for feedback from the Toronto 2017 SG1 meeting on Data Movement in C\+\+ [[1]][p0687r0] that we should define affinity for C\+\+ first, before considering inaccessible memory as a solution to the separate memory problem towards supporting heterogeneous and distributed computing.
5252

5353
This paper is split into two main parts:
5454

@@ -63,13 +63,13 @@ On almost all computer architectures, the cost of accessing different data may d
6363

6464
One strategy to improve applications' performance, given the importance of affinity, is processor and memory *binding*. Keeping a process bound to a specific thread and local memory region optimizes cache affinity. It also reduces context switching and unnecessary scheduler activity. Since memory accesses to remote locations incur higher latency and/or lower bandwidth, control of thread placement to enforce affinity within parallel applications is crucial to fuel all the cores and to exploit the full performance of the memory subsystem on NUMA computers.
6565

66-
Operating systems (OSes) traditionally take responsibility for assigning threads or processes to run on processing units. However, OSes may use high-level policies for this assignment that do not necessarily match the optimal usage pattern for a given application. Application developers must leverage the placement of memory and *placement of threads* for best performance on current and future architectures. For C++ developers to achieve this, native support for *placement of threads and memory* is critical for application portability. We will refer to this as the *affinity problem*.
66+
Operating systems (OSes) traditionally take responsibility for assigning threads or processes to run on processing units. However, OSes may use high-level policies for this assignment that do not necessarily match the optimal usage pattern for a given application. Application developers must leverage the placement of memory and *placement of threads* for best performance on current and future architectures. For C\+\+ developers to achieve this, native support for *placement of threads and memory* is critical for application portability. We will refer to this as the *affinity problem*.
6767

6868
The affinity problem is especially challenging for applications whose behavior changes over time or is hard to predict, or when different applications interfere with each other's performance. Today, most OSes already can group processing units according to their locality and distribute processes, while keeping threads close to the initial thread, or even avoid migrating threads and maintain first touch policy. Nevertheless, most programs can change their work distribution, especially in the presence of nested parallelism.
6969

7070
Frequently, data are initialized at the beginning of the program by the initial thread and are used by multiple threads. While some OSes automatically migrate threads or data for better affinity, migration may have high overhead. In an optimal case, the OS may automatically detect which thread access which data most frequently, or it may replicate data which are read by multiple threads, or migrate data which are modified and used by threads residing on remote locality groups. However, the OS often does a reasonable job, if the machine is not overloaded, if the application carefully uses first-touch allocation, and if the program does not change its behavior with respect to locality.
7171

72-
Consider a code example *(Listing 1)* that uses the C++17 parallel STL algorithm `for_each` to modify the entries of a `valarray` `a`. The example applies a loop body in a lambda to each entry of the `valarray` `a`, using an execution policy that distributes work in parallel across multiple CPU cores. We might expect this to be fast, but since `valarray` containers are initialized automatically and automatically allocated on the master thread's memory, we find that it is actually quite slow even when we have more than one thread.
72+
Consider a code example *(Listing 1)* that uses the C\+\+17 parallel STL algorithm `for_each` to modify the entries of a `valarray` `a`. The example applies a loop body in a lambda to each entry of the `valarray` `a`, using an execution policy that distributes work in parallel across multiple CPU cores. We might expect this to be fast, but since `valarray` containers are initialized automatically and automatically allocated on the master thread's memory, we find that it is actually quite slow even when we have more than one thread.
7373

7474
```cpp
7575
// C++ valarray STL containers are initialized automatically.
@@ -93,7 +93,7 @@ The affinity interface we propose should help computers achieve a much higher fr
9393
9494
# Background Research: State of the Art
9595
96-
The problem of effectively partitioning a system’s topology has existed for some time, and there are a range of third-party libraries and standards which provide APIs to solve the problem. In order to standardize this process for C++, we must carefully look at all of these approaches and identify which we wish to adopt. Below is a list of the libraries and standards from which this proposal will draw:
96+
The problem of effectively partitioning a system’s topology has existed for some time, and there are a range of third-party libraries and standards which provide APIs to solve the problem. In order to standardize this process for C\+\+, we must carefully look at all of these approaches and identify which we wish to adopt. Below is a list of the libraries and standards from which this proposal will draw:
9797
9898
* Portable Hardware Locality [[4]][hwloc]
9999
* SYCL 1.2 [[5]][sycl-1-2-1]
@@ -108,7 +108,7 @@ The problem of effectively partitioning a system’s topology has existed for so
108108
* Windows SetThreadAffinityMask() [[14]][windows-set-thread-affinity-mask]
109109
* Chapel [[15]][chapel]
110110
* X10 [[16]][x10]
111-
* UPC++ [[17]][upc++]
111+
* UPC\+\+ [[17]][upc++]
112112
* TBB [[18]][tbb]
113113
* HPX [[19]][hpx]
114114
* MADNESS [[20]][madness][[32]][madness-journal]
@@ -119,12 +119,12 @@ Some systems give additional user control through explicit binding of threads to
119119
120120
## Problem Space
121121
122-
In this paper we describe the problem space of affinity for C++, the various challenges which need to be addressed in defining a partitioning and affinity interface for C++, and some suggested solutions. These include:
122+
In this paper we describe the problem space of affinity for C\+\+, the various challenges which need to be addressed in defining a partitioning and affinity interface for C\+\+, and some suggested solutions. These include:
123123
124124
* How to represent, identify and navigate the topology of execution resources available within a heterogeneous or distributed system.
125125
* How to query and measure the relative affinity between different execution resources within a system.
126126
* How to bind execution and allocation particular execution resource(s).
127-
* What kind of and level of interface(s) should be provided by C++ for affinity.
127+
* What kind of and level of interface(s) should be provided by C\+\+ for affinity.
128128
129129
Wherever possible, we also evaluate how an affinity-based solution could be scaled to support both distributed and heterogeneous systems. We also have addressed some aspects of dynamic topology discovery.
130130
@@ -136,9 +136,9 @@ There are also some additional challenges which we have been investigating but a
136136
137137
### Querying and representing the system topology
138138
139-
The first task in allowing C++ applications to leverage memory locality is to provide the ability to query a *system* for its *resource topology* (commonly represented as a tree or graph) and traverse its *execution resources*.
139+
The first task in allowing C\+\+ applications to leverage memory locality is to provide the ability to query a *system* for its *resource topology* (commonly represented as a tree or graph) and traverse its *execution resources*.
140140
141-
The capability of querying underlying *execution resources* of a given *system* is particularly important towards supporting affinity control in C++. The current proposal for executors [[22]][p0443r7] mentions execution resources in passing, but leaves the term largely unspecified. This is intentional: *execution resources* will vary greatly between one implementation and another, and it is out of the scope of the current executors proposal to define those. There is current work [[23]][p0737r0] on extending the executors proposal to describe a typical interface for an *execution context*. In this paper a typical *execution context* is defined with an interface for construction and comparison, and for retrieving an *executor*, waiting on submitted work to complete and querying the underlying *execution resource*. Extending the executors interface to provide topology information can serve as a basis for providing a unified interface to expose affinity. This interface cannot mandate a specific architectural definition, and must be generic enough that future architectural evolutions can still be expressed.
141+
The capability of querying underlying *execution resources* of a given *system* is particularly important towards supporting affinity control in C\+\+. The current proposal for executors [[22]][p0443r7] mentions execution resources in passing, but leaves the term largely unspecified. This is intentional: *execution resources* will vary greatly between one implementation and another, and it is out of the scope of the current executors proposal to define those. There is current work [[23]][p0737r0] on extending the executors proposal to describe a typical interface for an *execution context*. In this paper a typical *execution context* is defined with an interface for construction and comparison, and for retrieving an *executor*, waiting on submitted work to complete and querying the underlying *execution resource*. Extending the executors interface to provide topology information can serve as a basis for providing a unified interface to expose affinity. This interface cannot mandate a specific architectural definition, and must be generic enough that future architectural evolutions can still be expressed.
142142
143143
Two important considerations when defining a unified interface for querying the *resource topology* of a *system*, are (a) what level of abstraction such an interface should have, and (b) at what granularity it should describe the typology's *execution resources*. As both the level of abstraction of an *execution resource* and the granularity that it is described in will vary greatly from one implementation to another, it’s important for the interface to be generic enough to support any level of abstraction. To achieve this we propose a generic hierarchical structure of *execution resources*, each *execution resource* being composed of other *execution resources* recursively. Each *execution resource* within this hierarchy can be used to place memory (i.e., allocate memory within the *execution resource’s* memory region), place execution (i.e. bind an execution to an *execution resource’s execution agents*), or both.
144144
@@ -157,7 +157,7 @@ The interface for querying the *resource topology* of a *system* must be flexibl
157157
158158
### Topology discovery & fault tolerance
159159
160-
In traditional single-CPU systems, users may reason about the execution resources with standard constructs such as `std::thread`, `std::this_thread` and `thread_local`. This is because the C++ machine model requires that a system have **at least one thread of execution, some memory, and some I/O capabilities**. Thus, for these systems, users may make some assumptions about the system resource topology as part of the language and its supporting standard library. For example, one may always ask for the available hardware concurrency, since there is always at least one thread, and one may always use thread-local storage.
160+
In traditional single-CPU systems, users may reason about the execution resources with standard constructs such as `std::thread`, `std::this_thread` and `thread_local`. This is because the C\+\+ machine model requires that a system have **at least one thread of execution, some memory, and some I/O capabilities**. Thus, for these systems, users may make some assumptions about the system resource topology as part of the language and its supporting standard library. For example, one may always ask for the available hardware concurrency, since there is always at least one thread, and one may always use thread-local storage.
161161
162162
This assumption, however, does not hold on newer, more complex systems, especially on heterogeneous systems. On these systems, even the type and number of high-level resources available in a particular *system* is not known until the physical hardware attached to a particular system has been identified by the program. This often happens as part of a run-time initialization API [[6]][opencl-2-2] [[7]][hsa] which makes the resources available through some software abstraction. Furthermore, the resources which are identified often have different levels of parallel and concurrent execution capabilities. We refer to this process of identifying resources and their capabilities as *topology discovery*, and we call the point at the point at which this occurs the *point of discovery*.
163163
@@ -294,7 +294,7 @@ Each **memory resource** may also have any number of child **memory resources**
294294

295295
The **system topology** can be discovered by calling `this_system::discover_topology`. This will discover all **execution resources** and **memory resources** available within the system and construct the **system topology** DAG, describing a read-only snapshot at the point of the call, and then return an `execution_resource` object exposing the **system execution resource**.
296296

297-
A call to `this_system::discover_topology` may invoke C++ library, system or third party library API calls required to discover certain **execution resources**. However, `this_system::discover_topology` must be thread safe and must initialize and finalize any OS or third-party state before returning.
297+
A call to `this_system::discover_topology` may invoke C\+\+ library, system or third party library API calls required to discover certain **execution resources**. However, `this_system::discover_topology` must be thread safe and must initialize and finalize any OS or third-party state before returning.
298298

299299
Below *(Figure 2)* is an example of what a typical **system topology** could look like.
300300

@@ -858,7 +858,7 @@ Thanks to Christopher Di Bella, Toomas Remmelg, and Morris Hafner for their revi
858858
# References
859859

860860
[p0687r0]: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2017/p0687r0.pdf
861-
[[1]][p0687r0] P0687r0: Data Movement in C++
861+
[[1]][p0687r0] P0687r0: Data Movement in C\+\+
862862

863863
[design-of-openmp]: https://link.springer.com/chapter/10.1007/978-3-642-30961-8_2
864864
[[2]][design-of-openmp] The Design of OpenMP Thread Affinity
@@ -905,7 +905,7 @@ Thanks to Christopher Di Bella, Toomas Remmelg, and Morris Hafner for their revi
905905
[[16]][x10] X10
906906

907907
[upc++]: https://bitbucket.org/berkeleylab/upcxx/wiki/Home
908-
[[17]][upc++] UPC++
908+
[[17]][upc++] UPC\+\+
909909

910910
[tbb]: https://www.threadingbuildingblocks.org/
911911
[[18]][tbb] TBB
@@ -921,7 +921,7 @@ Thanks to Christopher Di Bella, Toomas Remmelg, and Morris Hafner for their revi
921921

922922
[p0443r7]:
923923
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0443r7.html
924-
[[22]][p0443r7] A Unified Executors Proposal for C++
924+
[[22]][p0443r7] A Unified Executors Proposal for C\+\+
925925

926926
[p0737r0]: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2017/p0737r0.html
927927
[[23]][p0737r0] P0737r0 : Execution Context of Execution Agents

0 commit comments

Comments
 (0)