Skip to content

Commit e452960

Browse files
author
Mark Hoemmen
committed
Affinity: Minor grammar & code fixes
1 parent 3531755 commit e452960

File tree

1 file changed

+7
-4
lines changed

1 file changed

+7
-4
lines changed

affinity/cpp-20/d0796r3.md

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -41,7 +41,10 @@
4141

4242
This paper provides an initial meta-framework for the drives toward an execution and memory affinity model for C++. It accounts for feedback from the Toronto 2017 SG1 meeting on Data Movement in C++ [[1]][p0687r0] that we should define affinity for C++ first, before considering inaccessible memory as a solution to the separate memory problem towards supporting heterogeneous and distributed computing.
4343

44-
This paper is split into two main parts; firstly a series of executor properties which can be used to apply affinity requirements to bulk execution functions, and secondly an interface for discovering the execution resources within the system topology and querying relative affinity of execution resources.
44+
This paper is split into two main parts:
45+
46+
1. A series of executor properties which can be used to apply affinity requirements to bulk execution functions.
47+
2. An interface for discovering the execution resources within the system topology and querying relative affinity of execution resources.
4548

4649
# Motivation
4750

@@ -55,7 +58,7 @@ Operating systems (OSes) traditionally take responsibility for assigning threads
5558

5659
The affinity problem is especially challenging for applications whose behavior changes over time or is hard to predict, or when different applications interfere with each other's performance. Today, most OSes already can group processing units according to their locality and distribute processes, while keeping threads close to the initial thread, or even avoid migrating threads and maintain first touch policy. Nevertheless, most programs can change their work distribution, especially in the presence of nested parallelism.
5760

58-
Frequently, data are initialized at the beginning of the program by the initial thread and are used by multiple threads. While some OSes automatically migrate threads or data for better affinity, migration may have high overhead. In an optimal case, the OS may automatically detect which thread access which data most frequently, or it may replicate data which are read by multiple threads, or migrate data which are modified and used by threads residing on remote locality groups. However, the OS often does a reasonable job, if the machine is not overloaded, if the application carefully used first-touch allocation, and if the program does not change its behavior with respect to locality.
61+
Frequently, data are initialized at the beginning of the program by the initial thread and are used by multiple threads. While some OSes automatically migrate threads or data for better affinity, migration may have high overhead. In an optimal case, the OS may automatically detect which thread access which data most frequently, or it may replicate data which are read by multiple threads, or migrate data which are modified and used by threads residing on remote locality groups. However, the OS often does a reasonable job, if the machine is not overloaded, if the application carefully uses first-touch allocation, and if the program does not change its behavior with respect to locality.
5962

6063
Consider a code example *(Listing 1)* that uses the C++17 parallel STL algorithm `for_each` to modify the entries of a `valarray` `a`. The example applies a loop body in a lambda to each entry of the `valarray` `a`, using an execution policy that distributes work in parallel across multiple CPU cores. We might expect this to be fast, but since `valarray` containers are initialized automatically and automatically allocated on the master thread's memory, we find that it is actually quite slow even when we have more than one thread.
6164

@@ -65,14 +68,14 @@ Consider a code example *(Listing 1)* that uses the C++17 parallel STL algorithm
6568
std::valarray<double> a(N);
6669

6770
// Data placement is wrong, so parallel update is slow.
68-
std::for_each(par, std::begin(a), std::end(a),
71+
std::for_each(std::execution::par, std::begin(a), std::end(a),
6972
[=] (double& a_i) { a_i *= scalar; });
7073

7174
// Use future affinity interface to migrate data at next
7275
// use and move pages closer to next accessing thread.
7376
...
7477
// Faster, because data are local now.
75-
std::for_each(par, std::begin(a), std::end(a),
78+
std::for_each(std::execution::par, std::begin(a), std::end(a),
7679
[=] (double& a_i) { a_i *= scalar; });
7780
```
7881
*Listing 1: Parallel vector update example*

0 commit comments

Comments
 (0)