Skip to content

Commit c99c7b1

Browse files
GordonGordon
authored andcommitted
Fix uses of "C++" in test.
1 parent d7c02c2 commit c99c7b1

File tree

1 file changed

+14
-14
lines changed

1 file changed

+14
-14
lines changed

affinity/cpp-20/d0796r3.md

Lines changed: 14 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -43,7 +43,7 @@
4343

4444
# Abstract
4545

46-
This paper provides an initial meta-framework for the drives toward an execution and memory affinity model for C++. It accounts for feedback from the Toronto 2017 SG1 meeting on Data Movement in C++ [[1]][p0687r0] that we should define affinity for C++ first, before considering inaccessible memory as a solution to the separate memory problem towards supporting heterogeneous and distributed computing.
46+
This paper provides an initial meta-framework for the drives toward an execution and memory affinity model for C\+\+. It accounts for feedback from the Toronto 2017 SG1 meeting on Data Movement in C\+\+ [[1]][p0687r0] that we should define affinity for C\+\+ first, before considering inaccessible memory as a solution to the separate memory problem towards supporting heterogeneous and distributed computing.
4747

4848
This paper is split into two main parts:
4949

@@ -58,13 +58,13 @@ On almost all computer architectures, the cost of accessing different data may d
5858

5959
One strategy to improve applications' performance, given the importance of affinity, is processor and memory *binding*. Keeping a process bound to a specific thread and local memory region optimizes cache affinity. It also reduces context switching and unnecessary scheduler activity. Since memory accesses to remote locations incur higher latency and/or lower bandwidth, control of thread placement to enforce affinity within parallel applications is crucial to fuel all the cores and to exploit the full performance of the memory subsystem on NUMA computers.
6060

61-
Operating systems (OSes) traditionally take responsibility for assigning threads or processes to run on processing units. However, OSes may use high-level policies for this assignment that do not necessarily match the optimal usage pattern for a given application. Application developers must leverage the placement of memory and *placement of threads* for best performance on current and future architectures. For C++ developers to achieve this, native support for *placement of threads and memory* is critical for application portability. We will refer to this as the *affinity problem*.
61+
Operating systems (OSes) traditionally take responsibility for assigning threads or processes to run on processing units. However, OSes may use high-level policies for this assignment that do not necessarily match the optimal usage pattern for a given application. Application developers must leverage the placement of memory and *placement of threads* for best performance on current and future architectures. For C\+\+ developers to achieve this, native support for *placement of threads and memory* is critical for application portability. We will refer to this as the *affinity problem*.
6262

6363
The affinity problem is especially challenging for applications whose behavior changes over time or is hard to predict, or when different applications interfere with each other's performance. Today, most OSes already can group processing units according to their locality and distribute processes, while keeping threads close to the initial thread, or even avoid migrating threads and maintain first touch policy. Nevertheless, most programs can change their work distribution, especially in the presence of nested parallelism.
6464

6565
Frequently, data are initialized at the beginning of the program by the initial thread and are used by multiple threads. While some OSes automatically migrate threads or data for better affinity, migration may have high overhead. In an optimal case, the OS may automatically detect which thread access which data most frequently, or it may replicate data which are read by multiple threads, or migrate data which are modified and used by threads residing on remote locality groups. However, the OS often does a reasonable job, if the machine is not overloaded, if the application carefully uses first-touch allocation, and if the program does not change its behavior with respect to locality.
6666

67-
Consider a code example *(Listing 1)* that uses the C++17 parallel STL algorithm `for_each` to modify the entries of a `valarray` `a`. The example applies a loop body in a lambda to each entry of the `valarray` `a`, using an execution policy that distributes work in parallel across multiple CPU cores. We might expect this to be fast, but since `valarray` containers are initialized automatically and automatically allocated on the master thread's memory, we find that it is actually quite slow even when we have more than one thread.
67+
Consider a code example *(Listing 1)* that uses the C\+\+17 parallel STL algorithm `for_each` to modify the entries of a `valarray` `a`. The example applies a loop body in a lambda to each entry of the `valarray` `a`, using an execution policy that distributes work in parallel across multiple CPU cores. We might expect this to be fast, but since `valarray` containers are initialized automatically and automatically allocated on the master thread's memory, we find that it is actually quite slow even when we have more than one thread.
6868

6969
```cpp
7070
// C++ valarray STL containers are initialized automatically.
@@ -88,7 +88,7 @@ The affinity interface we propose should help computers achieve a much higher fr
8888
8989
# Background Research: State of the Art
9090
91-
The problem of effectively partitioning a system’s topology has existed for some time, and there are a range of third-party libraries and standards which provide APIs to solve the problem. In order to standardize this process for C++, we must carefully look at all of these approaches and identify which we wish to adopt. Below is a list of the libraries and standards from which this proposal will draw:
91+
The problem of effectively partitioning a system’s topology has existed for some time, and there are a range of third-party libraries and standards which provide APIs to solve the problem. In order to standardize this process for C\+\+, we must carefully look at all of these approaches and identify which we wish to adopt. Below is a list of the libraries and standards from which this proposal will draw:
9292
9393
* Portable Hardware Locality [[4]][hwloc]
9494
* SYCL 1.2 [[5]][sycl-1-2-1]
@@ -103,7 +103,7 @@ The problem of effectively partitioning a system’s topology has existed for so
103103
* Windows SetThreadAffinityMask() [[14]][windows-set-thread-affinity-mask]
104104
* Chapel [[15]][chapel]
105105
* X10 [[16]][x10]
106-
* UPC++ [[17]][upc++]
106+
* UPC\+\+ [[17]][upc++]
107107
* TBB [[18]][tbb]
108108
* HPX [[19]][hpx]
109109
* MADNESS [[20]][madness][[32]][madness-journal]
@@ -114,12 +114,12 @@ Some systems give additional user control through explicit binding of threads to
114114
115115
## Problem Space
116116
117-
In this paper we describe the problem space of affinity for C++, the various challenges which need to be addressed in defining a partitioning and affinity interface for C++, and some suggested solutions. These include:
117+
In this paper we describe the problem space of affinity for C\+\+, the various challenges which need to be addressed in defining a partitioning and affinity interface for C\+\+, and some suggested solutions. These include:
118118
119119
* How to represent, identify and navigate the topology of execution resources available within a heterogeneous or distributed system.
120120
* How to query and measure the relative affinity between different execution resources within a system.
121121
* How to bind execution and allocation particular execution resource(s).
122-
* What kind of and level of interface(s) should be provided by C++ for affinity.
122+
* What kind of and level of interface(s) should be provided by C\+\+ for affinity.
123123
124124
Wherever possible, we also evaluate how an affinity-based solution could be scaled to support both distributed and heterogeneous systems. We also have addressed some aspects of dynamic topology discovery.
125125
@@ -131,9 +131,9 @@ There are also some additional challenges which we have been investigating but a
131131
132132
### Querying and representing the system topology
133133
134-
The first task in allowing C++ applications to leverage memory locality is to provide the ability to query a *system* for its *resource topology* (commonly represented as a tree or graph) and traverse its *execution resources*.
134+
The first task in allowing C\+\+ applications to leverage memory locality is to provide the ability to query a *system* for its *resource topology* (commonly represented as a tree or graph) and traverse its *execution resources*.
135135
136-
The capability of querying underlying *execution resources* of a given *system* is particularly important towards supporting affinity control in C++. The current proposal for executors [[22]][p0443r7] mentions execution resources in passing, but leaves the term largely unspecified. This is intentional: *execution resources* will vary greatly between one implementation and another, and it is out of the scope of the current executors proposal to define those. There is current work [[23]][p0737r0] on extending the executors proposal to describe a typical interface for an *execution context*. In this paper a typical *execution context* is defined with an interface for construction and comparison, and for retrieving an *executor*, waiting on submitted work to complete and querying the underlying *execution resource*. Extending the executors interface to provide topology information can serve as a basis for providing a unified interface to expose affinity. This interface cannot mandate a specific architectural definition, and must be generic enough that future architectural evolutions can still be expressed.
136+
The capability of querying underlying *execution resources* of a given *system* is particularly important towards supporting affinity control in C\+\+. The current proposal for executors [[22]][p0443r7] mentions execution resources in passing, but leaves the term largely unspecified. This is intentional: *execution resources* will vary greatly between one implementation and another, and it is out of the scope of the current executors proposal to define those. There is current work [[23]][p0737r0] on extending the executors proposal to describe a typical interface for an *execution context*. In this paper a typical *execution context* is defined with an interface for construction and comparison, and for retrieving an *executor*, waiting on submitted work to complete and querying the underlying *execution resource*. Extending the executors interface to provide topology information can serve as a basis for providing a unified interface to expose affinity. This interface cannot mandate a specific architectural definition, and must be generic enough that future architectural evolutions can still be expressed.
137137
138138
Two important considerations when defining a unified interface for querying the *resource topology* of a *system*, are (a) what level of abstraction such an interface should have, and (b) at what granularity it should describe the typology's *execution resources*. As both the level of abstraction of an *execution resource* and the granularity that it is described in will vary greatly from one implementation to another, it’s important for the interface to be generic enough to support any level of abstraction. To achieve this we propose a generic hierarchical structure of *execution resources*, each *execution resource* being composed of other *execution resources* recursively. Each *execution resource* within this hierarchy can be used to place memory (i.e., allocate memory within the *execution resource’s* memory region), place execution (i.e. bind an execution to an *execution resource’s execution agents*), or both.
139139
@@ -151,7 +151,7 @@ The interface for querying the *resource topology* of a *system* must be flexibl
151151
152152
### Topology discovery & fault tolerance
153153
154-
In traditional single-CPU systems, users may reason about the execution resources with standard constructs such as `std::thread`, `std::this_thread` and `thread_local`. This is because the C++ machine model requires that a system have **at least one thread of execution, some memory, and some I/O capabilities**. Thus, for these systems, users may make some assumptions about the system resource topology as part of the language and its supporting standard library. For example, one may always ask for the available hardware concurrency, since there is always at least one thread, and one may always use thread-local storage.
154+
In traditional single-CPU systems, users may reason about the execution resources with standard constructs such as `std::thread`, `std::this_thread` and `thread_local`. This is because the C\+\+ machine model requires that a system have **at least one thread of execution, some memory, and some I/O capabilities**. Thus, for these systems, users may make some assumptions about the system resource topology as part of the language and its supporting standard library. For example, one may always ask for the available hardware concurrency, since there is always at least one thread, and one may always use thread-local storage.
155155
156156
This assumption, however, does not hold on newer, more complex systems, especially on heterogeneous systems. On these systems, even the type and number of high-level resources available in a particular *system* is not known until the physical hardware attached to a particular system has been identified by the program. This often happens as part of a run-time initialization API [[6]][opencl-2-2] [[7]][hsa] which makes the resources available through some software abstraction. Furthermore, the resources which are identified often have different levels of parallel and concurrent execution capabilities. We refer to this process of identifying resources and their capabilities as *topology discovery*, and we call the point at the point at which this occurs the *point of discovery*.
157157
@@ -282,7 +282,7 @@ The **system topology** is comprised of a directed acyclic graph (DAG) of **exec
282282

283283
The **system topology** can be discovered by calling `this_system::discover_topology`. This will discover all **execution resources** available within the system and construct the **system topology** DAG, describing a read-only snapshot at the point of the call, and then return an `execution_resource` object exposing the **system execution resource**.
284284

285-
A call to `this_system::discover_topology` may invoke C++ library, system or third party library API calls required to discover certain **execution resources**. However, `this_system::discover_topology` must be thread safe and must initialize and finalize any OS or third-party state before returning.
285+
A call to `this_system::discover_topology` may invoke C\+\+ library, system or third party library API calls required to discover certain **execution resources**. However, `this_system::discover_topology` must be thread safe and must initialize and finalize any OS or third-party state before returning.
286286

287287
### Execution resources
288288

@@ -774,7 +774,7 @@ Thanks to Christopher Di Bella, Toomas Remmelg, and Morris Hafner for their revi
774774
# References
775775

776776
[p0687r0]: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2017/p0687r0.pdf
777-
[[1]][p0687r0] P0687r0: Data Movement in C++
777+
[[1]][p0687r0] P0687r0: Data Movement in C\+\+
778778

779779
[design-of-openmp]: https://link.springer.com/chapter/10.1007/978-3-642-30961-8_2
780780
[[2]][design-of-openmp] The Design of OpenMP Thread Affinity
@@ -821,7 +821,7 @@ Thanks to Christopher Di Bella, Toomas Remmelg, and Morris Hafner for their revi
821821
[[16]][x10] X10
822822

823823
[upc++]: https://bitbucket.org/berkeleylab/upcxx/wiki/Home
824-
[[17]][upc++] UPC++
824+
[[17]][upc++] UPC\+\+
825825

826826
[tbb]: https://www.threadingbuildingblocks.org/
827827
[[18]][tbb] TBB
@@ -837,7 +837,7 @@ Thanks to Christopher Di Bella, Toomas Remmelg, and Morris Hafner for their revi
837837

838838
[p0443r7]:
839839
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0443r7.html
840-
[[22]][p0443r7] A Unified Executors Proposal for C++
840+
[[22]][p0443r7] A Unified Executors Proposal for C\+\+
841841

842842
[p0737r0]: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2017/p0737r0.html
843843
[[23]][p0737r0] P0737r0 : Execution Context of Execution Agents

0 commit comments

Comments
 (0)