@@ -75,21 +75,22 @@ aligned (non-distributed). We may now inspect the low-level
7575
7676 p0 = a.pencil
7777
78- The ``p0 `` :class: `.Pencil ` object now contains information about the
79- distribution of a 2D dataarray of global shape (8, 8), that will be
80- distributed in the first axis and aligned in the second. However, ``p0 ``
81- does not contain any of the data itself. Distributed arrays are instead
82- created using the information that is in ``p0 ``. The distributed array
83- ``a `` uses the associated pencil to look up information about the global
84- array, for example::
85-
86- a.alignment
87- a.global_shape
88- a.subcomm
89- a.commsizes
90-
91- will return, respectively, ``1 ``, ``(8, 8) ``, a list of two subcommunicators
92- (subcomm) and finally their sizes. Naturally, their sizes will depend on the
78+ The ``p0 `` :class: `.Pencil ` object contains information about the
79+ distribution of a 2D dataarray of global shape (8, 8). The
80+ distributed array ``a `` has been created using the information that is in
81+ ``p0 ``, and ``p0 `` is used by ``a `` to look up information about
82+ the global array, for example::
83+
84+ >>> a.alignment
85+ 1
86+ >>> a.global_shape
87+ (8, 8)
88+ >>> a.subcomm
89+ (<mpi4py.MPI.Cartcomm at 0x10cc14a68>, <mpi4py.MPI.Cartcomm at 0x10e028690>)
90+ >>> a.commsizes
91+ [1, 1]
92+
93+ Naturally, the sizes of the communicators will depend on the
9394number of processors used to run the program. If we used 4, then
9495``a.commsizes `` would return ``[1, 4] ``.
9596
@@ -110,12 +111,13 @@ it would be a pure Numpy array (created on each processor) and it would
110111not contain any of the information about the global array that it is
111112part of ``(global_shape, pencil, subcomm, etc.) ``. It contains the same
112113amount of data as ``a `` though and ``a0 `` is as such a perfectly fine
113- distributed array.
114+ distributed array. Used together with ``p0 `` it contains exactly the
115+ same information as ``a ``.
114116
115117Since at least one axis needs to be aligned (non-distributed), a 2D array
116118can only be distributed with
117119one processor group. If we wanted to distribute the second axis instead
118- of the first, then we would could have done::
120+ of the first, then we would have done::
119121
120122 a = DistributedArray(N, [1, 0])
121123
@@ -249,71 +251,98 @@ Multidimensional distributed arrays
249251-----------------------------------
250252
251253The procedure discussed above remains the same for any type of array, of any
252- dimension . With mpi4py-fft we can distribute any array of arbitrary dimensionality
254+ dimensionality . With mpi4py-fft we can distribute any array of arbitrary dimensionality
253255using an arbitrary number of processor groups. How to distribute is completely
254256configurable through the classes in the :mod: `.pencil ` module.
255257
256- The
257-
258258We denote a global :math: `d`-dimensional array as :math: `u_{j_0 , j_1 , \ldots , j_{d-1 }}`,
259259where :math: `j_m\in \textbf {j}_m` for :math: `m=[0 , 1 , \ldots , d-1 ]`.
260260A :math: `d`-dimensional array distributed with only one processor group in the
261261first axis is denoted as :math: `u_{j_0 /P, j_1 , \ldots , j_{d-1 }}`. If using more
262262than one processor group, the groups are indexed, like :math: `P_0 , P_1 ` etc.
263263
264- Lets illustrate using a 4-dimensional array and 3 processor groups::
264+ Lets illustrate using a 4-dimensional array with 3 processor groups. Let the
265+ array be aligned only in axis 3 first (:math: `u_{j_0 /P_0 , j_1 /P_1 , j_2 /P_2 , j_3 }`),
266+ and then redistributed for alignment along axes 2, 1 and finally 0. Mathematically,
267+ we will now be executing the three following global redistributions:
268+
269+ .. math ::
270+ :label: 4 d_redistribute
271+
272+ u_{j_0 /P_0 , j_1 /P_1 , j_2 , j_3 /P_2 } \xleftarrow [P_2 ]{3 \rightarrow 2 } u_{j_0 /P_0 , j_1 /P_1 , j_2 /P_2 , j_3 } \\
273+ u_{j_0 /P_0 , j_1 , j_2 /P_1 , j_3 /P_2 } \xleftarrow [P_1 ]{2 \rightarrow 1 } u_{j_0 /P_0 , j_1 /P_1 , j_2 , j_3 /P_2 } \\
274+ u_{j_0 , j_1 /P_0 , j_2 /P_1 , j_3 /P_2 } \xleftarrow [P_0 ]{1 \rightarrow 0 } u_{j_0 /P_0 , j_1 , j_2 /P_1 , j_3 /P_2 }
275+
276+ Now, it is not necessary to use three processor groups just because we have a
277+ four-dimensional array. We could just as well have been using 2 or 1. The advantage
278+ of using more groups is that you can then use more processors in total. Assuming
279+ :math: `N = N_0 = N_1 = N_2 = N_3 `, you can use a maximum of :math: `N^p` processors,
280+ where :math: `p` is
281+ the number of processor groups. So for an array of shape :math: `(8 ,8 ,8 ,8 )`
282+ it is possible to use 8, 64 and 512 number of processors for 1, 2 and 3
283+ processor groups, respectively. On the other hand, if you can get away with it,
284+ or if you do not have access to a great number of processors, then fewer groups
285+ are usually found to be faster for the same number of processors in total.
286+
287+ We can implement the global redistribution using the high-level :class: `.DistributedArray `
288+ class::
265289
266290 N = (8, 8, 8, 8)
291+ a3 = DistributedArray(N, [0, 0, 0, 1])
292+ a2 = a3.redistribute(2)
293+ a1 = a2.redistribute(1)
294+ a0 = a1.redistribute(0)
295+
296+ Note that the three redistribution steps correspond exactly to the three steps
297+ in :eq: `4d_redistribute `.
298+
299+ Using a low-level API the same can be achieved with a little more elaborate
300+ coding. We start by creating pencils for the 4 different alignments::
301+
267302 subcomm = Subcomm(comm, [0, 0, 0, 1])
268- p0 = Pencil(subcomm, N, axis=3)
269- p1 = p0 .pencil(2)
270- p2 = p1 .pencil(1)
271- p3 = p2 .pencil(0)
303+ p3 = Pencil(subcomm, N, axis=3)
304+ p2 = p3 .pencil(2)
305+ p1 = p2 .pencil(1)
306+ p0 = p1 .pencil(0)
272307
273308Here we have defined 4 different pencil groups, ``p0, p1, p2, p3 ``, aligned in
274- axis 3, 2, 1 and 0 , respectively. Transfer objects for arrays of type ``np.float ``
309+ axis 0, 1, 2 and 3 , respectively. Transfer objects for arrays of type ``np.float ``
275310are then created as::
276311
277- transfer01 = p0 .transfer(p1 , np.float)
278- transfer12 = p1 .transfer(p2 , np.float)
279- transfer23 = p2 .transfer(p3 , np.float)
312+ transfer32 = p3 .transfer(p2 , np.float)
313+ transfer21 = p2 .transfer(p1 , np.float)
314+ transfer10 = p1 .transfer(p0 , np.float)
280315
281316Note that we can create transfer objects between any two pencils, not just
282- neighbouring axes.
283-
284- We may now perform three different global redistributions as::
317+ neighbouring axes. We may now perform three different global redistributions
318+ as::
285319
286320 a0 = np.zeros(p0.subshape)
287321 a1 = np.zeros(p1.subshape)
288322 a2 = np.zeros(p2.subshape)
289323 a3 = np.zeros(p3.subshape)
290324 a0[:] = np.random.random(a0.shape)
291- transfer01 .forward(a0, a1 )
292- transfer12 .forward(a1, a2 )
293- transfer23 .forward(a2, a3 )
325+ transfer32 .forward(a3, a2 )
326+ transfer21 .forward(a2, a1 )
327+ transfer10 .forward(a1, a0 )
294328
295329Storing this code under ``pencils4d.py ``, we can use 8 processors that will
296330give us 3 processor groups with 2 processors in each group::
297331
298332 mpirun -np 8 python pencils4d.py
299333
300- Mathematically, we will now, with the three calls to `` transfer ``, be executing
301- the three following global redistributions :
334+ Note that with the low-level approach we can now easily go back using the
335+ reverse `` backward `` method of the :class: ` .Transfer ` objects: :
302336
303- .. math ::
337+ transfer10.backward(a0, a1)
304338
305- u_{j_0 /P_0 , j_1 /P_1 , j_2 , j_3 /P_2 } \xleftarrow [P_2 ]{3 \rightarrow 2 } u_{j_0 /P_0 , j_1 /P_1 , j_2 /P_2 , j_3 } \\
306- u_{j_0 /P_0 , j_1 , j_2 /P_1 , j_3 /P_2 } \xleftarrow [P_1 ]{2 \rightarrow 1 } u_{j_0 /P_0 , j_1 /P_1 , j_2 , j_3 /P_2 } \\
307- u_{j_0 , j_1 /P_0 , j_2 /P_1 , j_3 /P_2 } \xleftarrow [P_0 ]{1 \rightarrow 0 } u_{j_0 /P_0 , j_1 , j_2 /P_1 , j_3 /P_2 }
339+ A different approach is also possible with the high-level API::
308340
341+ a0.redistribute(darray=a1)
342+ a1.redistribute(darray=a2)
343+ a2.redistribute(darray=a3)
309344
310- Now, it is not necessary to use three processor groups just because we have a
311- four-dimensional array. We could just as well have been using 2 or 1. The advantage
312- of using more groups is that you can then use more processors in total. Assuming
313- :math: `N = N_0 = N_1 = N_2 = N_3 `, you can use a maximum of :math: `N^p` processors,
314- where :math: `p` is
315- the number of processor groups. So for an array of shape :math: `(8 ,8 ,8 ,8 )`
316- it is possible to use 8, 64 and 512 number of processors for 1, 2 and 3
317- processor groups, respectively. On the other hand, if you can get away with it,
318- or if you do not have access to a great number of processors, then fewer groups
319- are usually found to be faster for the same number of processors in total.
345+ which corresponds to the backward transfers. However, with the high-level
346+ API the transfer objects are created (and deleted on exit) during the call
347+ to ``redistribute `` and as such this latter approach may be slightly less
348+ efficient.
0 commit comments