Merge branch 'master' into issue-41

Gordon · Gordon · commit 84ff94a46219 · 2018-10-08T14:55:08.000+01:00
diff --git a/affinity/cpp-20/d0796r3.md b/affinity/cpp-20/d0796r3.md
@@ -48,7 +48,7 @@
 
 # Abstract
 
-This paper provides an initial meta-framework for the drives toward an execution and memory affinity model for C++.  It accounts for feedback from the Toronto 2017 SG1 meeting on Data Movement in C++ [[1]][p0687r0] that we should define affinity for C++ first, before considering inaccessible memory as a solution to the separate memory problem towards supporting heterogeneous and distributed computing.
+This paper provides an initial meta-framework for the drives toward an execution and memory affinity model for C\+\+.  It accounts for feedback from the Toronto 2017 SG1 meeting on Data Movement in C\+\+ [[1]][p0687r0] that we should define affinity for C\+\+ first, before considering inaccessible memory as a solution to the separate memory problem towards supporting heterogeneous and distributed computing.
 
 This paper is split into two main parts:
 
@@ -63,13 +63,13 @@ On almost all computer architectures, the cost of accessing different data may d
 
 One strategy to improve applications' performance, given the importance of affinity, is processor and memory *binding*. Keeping a process bound to a specific thread and local memory region optimizes cache affinity. It also reduces context switching and unnecessary scheduler activity. Since memory accesses to remote locations incur higher latency and/or lower bandwidth, control of thread placement to enforce affinity within parallel applications is crucial to fuel all the cores and to exploit the full performance of the memory subsystem on NUMA computers. 
 
-Operating systems (OSes) traditionally take responsibility for assigning threads or processes to run on processing units. However, OSes may use high-level policies for this assignment that do not necessarily match the optimal usage pattern for a given application. Application developers must leverage the placement of memory and *placement of threads* for best performance on current and future architectures. For C++ developers to achieve this, native support for *placement of threads and memory* is critical for application portability. We will refer to this as the *affinity problem*. 
+Operating systems (OSes) traditionally take responsibility for assigning threads or processes to run on processing units. However, OSes may use high-level policies for this assignment that do not necessarily match the optimal usage pattern for a given application. Application developers must leverage the placement of memory and *placement of threads* for best performance on current and future architectures. For C\+\+ developers to achieve this, native support for *placement of threads and memory* is critical for application portability. We will refer to this as the *affinity problem*. 
 
 The affinity problem is especially challenging for applications whose behavior changes over time or is hard to predict, or when different applications interfere with each other's performance. Today, most OSes already can group processing units according to their locality and distribute processes, while keeping threads close to the initial thread, or even avoid migrating threads and maintain first touch policy. Nevertheless, most programs can change their work distribution, especially in the presence of nested parallelism.
 
 Frequently, data are initialized at the beginning of the program by the initial thread and are used by multiple threads. While some OSes automatically migrate threads or data for better affinity, migration may have high overhead. In an optimal case, the OS may automatically detect which thread access which data most frequently, or it may replicate data which are read by multiple threads, or migrate data which are modified and used by threads residing on remote locality groups. However, the OS often does a reasonable job, if the machine is not overloaded, if the application carefully uses first-touch allocation, and if the program does not change its behavior with respect to locality.
 
-Consider a code example *(Listing 1)* that uses the C++17 parallel STL algorithm `for_each` to modify the entries of a `valarray` `a`.  The example applies a loop body in a lambda to each entry of the `valarray` `a`, using an execution policy that distributes work in parallel across multiple CPU cores. We might expect this to be fast, but since `valarray` containers are initialized automatically and automatically allocated on the master thread's memory, we find that it is actually quite slow even when we have more than one thread.
+Consider a code example *(Listing 1)* that uses the C\+\+17 parallel STL algorithm `for_each` to modify the entries of a `valarray` `a`.  The example applies a loop body in a lambda to each entry of the `valarray` `a`, using an execution policy that distributes work in parallel across multiple CPU cores. We might expect this to be fast, but since `valarray` containers are initialized automatically and automatically allocated on the master thread's memory, we find that it is actually quite slow even when we have more than one thread.
 
 ```cpp
 // C++ valarray STL containers are initialized automatically.
@@ -93,7 +93,7 @@ The affinity interface we propose should help computers achieve a much higher fr
 
 # Background Research: State of the Art
 
-The problem of effectively partitioning a system’s topology has existed for some time, and there are a range of third-party libraries and standards which provide APIs to solve the problem. In order to standardize this process for C++, we must carefully look at all of these approaches and identify which we wish to adopt. Below is a list of the libraries and standards from which this proposal will draw:
+The problem of effectively partitioning a system’s topology has existed for some time, and there are a range of third-party libraries and standards which provide APIs to solve the problem. In order to standardize this process for C\+\+, we must carefully look at all of these approaches and identify which we wish to adopt. Below is a list of the libraries and standards from which this proposal will draw:
 
 * Portable Hardware Locality [[4]][hwloc]
 * SYCL 1.2 [[5]][sycl-1-2-1]
@@ -108,7 +108,7 @@ The problem of effectively partitioning a system’s topology has existed for so
 * Windows SetThreadAffinityMask() [[14]][windows-set-thread-affinity-mask]
 * Chapel [[15]][chapel]
 * X10 [[16]][x10]
-* UPC++ [[17]][upc++]
+* UPC\+\+ [[17]][upc++]
 * TBB [[18]][tbb]
 * HPX [[19]][hpx]
 * MADNESS [[20]][madness][[32]][madness-journal]
@@ -119,12 +119,12 @@ Some systems give additional user control through explicit binding of threads to
 
 ## Problem Space
 
-In this paper we describe the problem space of affinity for C++, the various challenges which need to be addressed in defining a partitioning and affinity interface for C++, and some suggested solutions.  These include:
+In this paper we describe the problem space of affinity for C\+\+, the various challenges which need to be addressed in defining a partitioning and affinity interface for C\+\+, and some suggested solutions.  These include:
 
 * How to represent, identify and navigate the topology of execution resources available within a heterogeneous or distributed system.
 * How to query and measure the relative affinity between different execution resources within a system.
 * How to bind execution and allocation particular execution resource(s).
-* What kind of and level of interface(s) should be provided by C++ for affinity.
+* What kind of and level of interface(s) should be provided by C\+\+ for affinity.
 
 Wherever possible, we also evaluate how an affinity-based solution could be scaled to support both distributed and heterogeneous systems. We also have addressed some aspects of dynamic topology discovery.
 
@@ -136,9 +136,9 @@ There are also some additional challenges which we have been investigating but a
 
 ### Querying and representing the system topology
 
-The first task in allowing C++ applications to leverage memory locality is to provide the ability to query a *system* for its *resource topology* (commonly represented as a tree or graph) and traverse its *execution resources*.
+The first task in allowing C\+\+ applications to leverage memory locality is to provide the ability to query a *system* for its *resource topology* (commonly represented as a tree or graph) and traverse its *execution resources*.
 
-The capability of querying underlying *execution resources* of a given *system* is particularly important towards supporting affinity control in C++. The current proposal for executors [[22]][p0443r7] mentions execution resources in passing, but leaves the term largely unspecified. This is intentional: *execution resources* will vary greatly between one implementation and another, and it is out of the scope of the current executors proposal to define those. There is current work [[23]][p0737r0] on extending the executors proposal to describe a typical interface for an *execution context*. In this paper a typical *execution context* is defined with an interface for construction and comparison, and for retrieving an *executor*, waiting on submitted work to complete and querying the underlying *execution resource*. Extending the executors interface to provide topology information can serve as a basis for providing a unified interface to expose affinity. This interface cannot mandate a specific architectural definition, and must be generic enough that future architectural evolutions can still be expressed.
+The capability of querying underlying *execution resources* of a given *system* is particularly important towards supporting affinity control in C\+\+. The current proposal for executors [[22]][p0443r7] mentions execution resources in passing, but leaves the term largely unspecified. This is intentional: *execution resources* will vary greatly between one implementation and another, and it is out of the scope of the current executors proposal to define those. There is current work [[23]][p0737r0] on extending the executors proposal to describe a typical interface for an *execution context*. In this paper a typical *execution context* is defined with an interface for construction and comparison, and for retrieving an *executor*, waiting on submitted work to complete and querying the underlying *execution resource*. Extending the executors interface to provide topology information can serve as a basis for providing a unified interface to expose affinity. This interface cannot mandate a specific architectural definition, and must be generic enough that future architectural evolutions can still be expressed.
 
 Two important considerations when defining a unified interface for querying the *resource topology* of a *system*, are (a) what level of abstraction such an interface should have, and (b) at what granularity it should describe the typology's *execution resources*. As both the level of abstraction of an *execution resource* and the granularity that it is described in will vary greatly from one implementation to another, it’s important for the interface to be generic enough to support any level of abstraction. To achieve this we propose a generic hierarchical structure of *execution resources*, each *execution resource* being composed of other *execution resources* recursively. Each *execution resource* within this hierarchy can be used to place memory (i.e., allocate memory within the *execution resource’s* memory region), place execution (i.e. bind an execution to an *execution resource’s execution agents*), or both.
 
@@ -157,7 +157,7 @@ The interface for querying the *resource topology* of a *system* must be flexibl
 
 ### Topology discovery & fault tolerance
 
-In traditional single-CPU systems, users may reason about the execution resources with standard constructs such as `std::thread`, `std::this_thread` and `thread_local`. This is because the C++ machine model requires that a system have **at least one thread of execution, some memory, and some I/O capabilities**. Thus, for these systems, users may make some assumptions about the system resource topology as part of the language and its supporting standard library. For example, one may always ask for the available hardware concurrency, since there is always at least one thread, and one may always use thread-local storage.
+In traditional single-CPU systems, users may reason about the execution resources with standard constructs such as `std::thread`, `std::this_thread` and `thread_local`. This is because the C\+\+ machine model requires that a system have **at least one thread of execution, some memory, and some I/O capabilities**. Thus, for these systems, users may make some assumptions about the system resource topology as part of the language and its supporting standard library. For example, one may always ask for the available hardware concurrency, since there is always at least one thread, and one may always use thread-local storage.
 
 This assumption, however, does not hold on newer, more complex systems, especially on heterogeneous systems. On these systems, even the type and number of high-level resources available in a particular *system* is not known until the physical hardware attached to a particular system has been identified by the program. This often happens as part of a run-time initialization API [[6]][opencl-2-2] [[7]][hsa] which makes the resources available through some software abstraction. Furthermore, the resources which are identified often have different levels of parallel and concurrent execution capabilities. We refer to this process of identifying resources and their capabilities as *topology discovery*, and we call the point at the point at which this occurs the *point of discovery*.
 
@@ -294,7 +294,7 @@ Each **memory resource** may also have any number of child **memory resources**
 
 The **system topology** can be discovered by calling `this_system::discover_topology`. This will discover all **execution resources** and **memory resources** available within the system and construct the **system topology** DAG, describing a read-only snapshot at the point of the call, and then return an `execution_resource` object exposing the **system execution resource**.
 
-A call to `this_system::discover_topology` may invoke C++ library, system or third party library API calls required to discover certain **execution resources**. However, `this_system::discover_topology` must be thread safe and must initialize and finalize any OS or third-party state before returning.
+A call to `this_system::discover_topology` may invoke C\+\+ library, system or third party library API calls required to discover certain **execution resources**. However, `this_system::discover_topology` must be thread safe and must initialize and finalize any OS or third-party state before returning.
 
 Below *(Figure 2)* is an example of what a typical **system topology** could look like.
 
@@ -858,7 +858,7 @@ Thanks to Christopher Di Bella, Toomas Remmelg, and Morris Hafner for their revi
 # References
 
 [p0687r0]: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2017/p0687r0.pdf
-[[1]][p0687r0] P0687r0: Data Movement in C++
+[[1]][p0687r0] P0687r0: Data Movement in C\+\+
 
 [design-of-openmp]: https://link.springer.com/chapter/10.1007/978-3-642-30961-8_2
 [[2]][design-of-openmp] The Design of OpenMP Thread Affinity
@@ -905,7 +905,7 @@ Thanks to Christopher Di Bella, Toomas Remmelg, and Morris Hafner for their revi
 [[16]][x10] X10
 
 [upc++]: https://bitbucket.org/berkeleylab/upcxx/wiki/Home
-[[17]][upc++] UPC++
+[[17]][upc++] UPC\+\+
 
 [tbb]: https://www.threadingbuildingblocks.org/
 [[18]][tbb] TBB
@@ -921,7 +921,7 @@ Thanks to Christopher Di Bella, Toomas Remmelg, and Morris Hafner for their revi
 
 [p0443r7]:
 http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0443r7.html
-[[22]][p0443r7] A Unified Executors Proposal for C++
+[[22]][p0443r7] A Unified Executors Proposal for C\+\+
 
 [p0737r0]: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2017/p0737r0.html
 [[23]][p0737r0] P0737r0 : Execution Context of Execution Agents