You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Affinity: Remove "parallel" in "parallel (execution,launch,region)"
Executors don't have to be about parallel execution. Thus, remove
references to "parallel execution," "parallel launch," and "parallel
region." The phrase "parallel region" is an OpenMP-ism that would need
definition anyway, and is better expressed using phrases like "the code
that would execute on ${HARDWARE}."
Copy file name to clipboardExpand all lines: affinity/cpp-20/d0796r3.md
+8-8Lines changed: 8 additions & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -54,7 +54,7 @@ The affinity problem is especially challenging for applications whose behavior c
54
54
55
55
Frequently, data are initialized at the beginning of the program by the initial thread and are used by multiple threads. While some OSes automatically migrate threads or data for better affinity, migration may have high overhead. In an optimal case, the OS may automatically detect which thread access which data most frequently, or it may replicate data which are read by multiple threads, or migrate data which are modified and used by threads residing on remote locality groups. However, the OS often does a reasonable job, if the machine is not overloaded, if the application carefully used first-touch allocation, and if the program does not change its behavior with respect to locality.
56
56
57
-
Consider a code example *(Listing 1)* that uses the C++17 parallel STL algorithm `for_each` to modify the entries of a `valarray``a`. The example applies a loop body in a lambda to each entry of the `valarray``a`, using a parallel execution policy that distributes work in parallel across multiple CPU cores. We might expect this to be fast, but since `valarray` containers are initialized automatically and automatically allocated on the master thread's memory, we find that it is actually quite slow even when we have more than one thread.
57
+
Consider a code example *(Listing 1)* that uses the C++17 parallel STL algorithm `for_each` to modify the entries of a `valarray``a`. The example applies a loop body in a lambda to each entry of the `valarray``a`, using an execution policy that distributes work in parallel across multiple CPU cores. We might expect this to be fast, but since `valarray` containers are initialized automatically and automatically allocated on the master thread's memory, we find that it is actually quite slow even when we have more than one thread.
58
58
59
59
```cpp
60
60
// C++ valarray STL containers are initialized automatically.
@@ -164,7 +164,7 @@ The *resource* objects returned from the *topology discovery interface* are opaq
164
164
165
165
The lifetime of a *resource* instance refers to both validity and uniqueness. First, if a *resource* instance exists, does it point to a valid underlying hardware or software resource? That is, could an instance's validity ever change at run time? Second, could a *resource* instance ever point to a different (but still valid) underlying resource? It suffices for now to define "point to a valid underlying resource" informally. We will elaborate this idea later in this proposal.
166
166
167
-
Creation of a *context* expresses intent to use the *resource*, not just to view it as part of the *resource topology*. Thus, if a *resource* could ever cease to point to a valid underlying resource, then users must not be allowed to create a *context* from the resource instance, or launch parallel executions with that context. *Context* construction, and use of an *executor* with that *context* to launch a parallel execution, both assert validity of the *context*'s *resource*.
167
+
Creation of a *context* expresses intent to use the *resource*, not just to view it as part of the *resource topology*. Thus, if a *resource* could ever cease to point to a valid underlying resource, then users must not be allowed to create a *context* from the resource instance, or launch executions with that context. *Context* construction, and use of an *executor* with that *context* to launch an execution, both assert validity of the *context*'s *resource*.
168
168
169
169
If a *resource* is valid, then it must always point to the same underlying thing. For example, a *resource* cannot first point to one CPU core, and then suddenly point to a different CPU core. *Contexts* can thus rely on properties like binding of operating system threads to CPU cores. However, the "thing" to which a *resource* points may be a dynamic, possibly software-managed pool of hardware. Here are three examples of this phenomenon:
170
170
@@ -182,9 +182,9 @@ We should not assume that *resource* instances have the same lifetime as the run
182
182
183
183
We considered mandating that *execution resources* use reference counting, just like `shared_ptr`. This would clearly define resources' lifetimes. However, there are several arguments against requiring reference counting.
184
184
185
-
1. Holding a reference to the *execution resource* would prevent parallel execution from shutting down, thus (potentially) deadlocking the program.
186
-
2. Not all kinds of *resources* may have lifetimes that fit reference counting semantics. Some kinds of GPU *resources* only exist during parallel execution, for example; those *resources* cannot be valid if they escape the parallel region. In general, programming models that let a "host" processor launch code on a "different processor" have this issue.
187
-
3. Reference counting could have unattractive overhead if accessed concurrently, especially if code wants to traverse a particular subset of the *resource topology* inside a parallel region (e.g., to access GPU scratch memory).
185
+
1. Holding a reference to the *execution resource* would prevent execution from shutting down, thus (potentially) deadlocking the program.
186
+
2. Not all kinds of *resources* may have lifetimes that fit reference counting semantics. Some kinds of GPU *resources* only exist during execution, for example; those *resources* cannot be valid if they escape the scope of code that executes on the GPU. In general, programming models that let a "host" processor launch code on a "different processor" have this issue.
187
+
3. Reference counting could have unattractive overhead if accessed concurrently, especially if code wants to traverse a particular subset of the *resource topology* inside a region executing on the GPU (e.g., to access GPU scratch memory).
188
188
4. Since users can construct arbitrary data structures from *resources* in a *resource hierarchy*, the proposal would need another *resource* type analogous to `weak_ptr`, in order to avoid circular dependencies that could prevent releasing *resources*.
189
189
5. There is no type currently in the Standard that has reference-counting semantics, but does not have `shared_` in its name (e.g., `shared_ptr` and `shared_future`). Adding a type like this sets a bad precedent for types with hidden costs and correctness issues (see (4)).
190
190
@@ -194,7 +194,7 @@ Here, we elaborate on what it means for a *resource* to be "valid." This proposa
194
194
195
195
1. It is implementation defined whether any subset of the *resource topology* reflects the current state of the *system*, or just a "snapshot." Ability to iterate a *resource*'s children in the *resource topology* need not imply ability to create a *context* from that *resource*. This may even vary between subsets of the *resource topology*.
3. Use of a *context* to launch parallel execution asserts *resource* validity.
197
+
3. Use of a *context* to launch execution asserts *resource* validity.
198
198
199
199
Here is a concrete example. Suppose that company "Aleph" makes an accelerator that can be viewed as a *resource*, and that has its own child *resources*. Users must call `Aleph_initialize()` in order to see the accelerator and its children as *resources* in the *resource topology*. Users must call `Aleph_finalize()` when they are done using the accelerator.
200
200
@@ -210,7 +210,7 @@ Here, we elaborate on what it means for a *resource* to be "valid." This proposa
210
210
1. Nothing bad may happen. Users must be able to iterate past an invalidated *resource*. If users are iterating a *resource* R's children and one child becomes invalid, that must not invalidate R or the iterators to its children.
211
211
2. Iterating the children after invalidation of the parent must not be undefined behavior, but the child *resources* remain invalid. Attempts to view and iterate the children of the child *resources* may (but need not) fail.
212
212
3. *Context* creation asserts *resource* validity. If the *resource* is invalid, *context* creation must fail. (Compare to how MPI functions report an error if they are called after `MPI_Finalize` has been called on that process.)
213
-
4. Use of a *context* in an *executor* to launch parallel execution asserts *resource* validity, and must thus fail if the *resource* is not longer valid.
213
+
4. Use of a *context* in an *executor* to launch execution asserts *resource* validity, and must thus fail if the *resource* is not longer valid.
214
214
215
215
### Querying the relative affinity of partitions
216
216
@@ -230,7 +230,7 @@ In this paper we propose an interface for querying and representing the executio
230
230
231
231
In this paper is split into two main parts:
232
232
* A series of executor properties describe desired behavior when using parallel algorithms or libraries. These properties provide a low granularity and is aimed at users who may have little or no knowledge of the system architecture.
233
-
* A series of execution resource topology mechanisms for discovering detailed information about the system's topology and affinity properties which can be used to hand optimise parallel applications and libraries for the best performance. These mechanisms provide a high granularity and is aimed at users who have a high knowledge of the system architecture.
233
+
* A series of execution resource topology mechanisms for discovering detailed information about the system's topology and affinity properties which can be used to hand optimize parallel applications and libraries for the best performance. These mechanisms provide a high granularity and is aimed at users who have a high knowledge of the system architecture.
0 commit comments