Looking at the following extract from the openmpi manual
--map-by <foo>
Map to the specified object, defaults to socket. Supported options
include slot, hwthread, core, L1cache, L2cache, L3cache, socket,
numa, board, node, sequential, distance, and ppr. Any object can
include modifiers by adding a : and any combination of PE=n (bind n
processing elements to each proc), SPAN (load balance the processes
across the allocation), OVERSUBSCRIBE (allow more processes on a node
than processing elements), and NOOVERSUBSCRIBE. This includes PPR,
where the pattern would be terminated by another colon to separate
it from the modifiers.
I have different questions regarding the syntax and some comments on them:
sequential
, distance
and ppr
refer to?Especially ppr
puzzles me. What is it abbreviating?
--map-by ppr:4:socket
regarding the extract of the manual? Of course I can see the result of former option by looking at the reported bindings with --report-bindings
(only 4 processes are mapped onto one socket and by default bound to 4 cores of one socket), but I cannot make any sense of the syntax. At another line of the manual it says that this new option replaces the deprecated use of --npersocket
:
-npersocket, --npersocket <#persocket>
On each node, launch this many processes times the number of processor
sockets on the node. The -npersocket option also turns on the -bind-
to-socket option. (deprecated in favor of --map-by ppr:n:socket)
ppr
means processes per resource. Its syntax is ppr:N:resource
and it means "assign N processes to each resource of type resource available on the host". For example, on a 4-socket system with 6-core CPUs having --map-by ppr:4:socket
results in the following process map:
socket ---- 0 ---- ---- 1 ---- ---- 2 ---- ---- 3 ----
core 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
process A B C D E F G H I J K L M N O P
(process numbering runs from A
to Z
in this example)
What the manual means is that the whole ppr:N:resource
is to be regarded as a single specifier and that options could be added after it, separated by :
, e.g. ppr:2:socket:pe=2
. This should read as "start two processes per each socket and bind each of them to two processing elements" and should result in the following map given the same quad-socket system:
socket ---- 0 ---- ---- 1 ---- ---- 2 ---- ---- 3 ----
core 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
process A A B B C C D D E E F F G G H H
The sequential
mapper reads the host file line by line and starts one process per host found there. It ignores the slot count if given.
The dist
mapper maps processes on NUMA nodes by the distance of the latter from a given PCI resource. It only makes sense on NUMA systems. Again, let's use the toy quad-socket system but this time extend the representation so to show the NUMA topology:
Socket 0 ------------- Socket 1
| |
| |
| |
| |
| |
Socket 2 ------------- Socket 3
|
ib0
The lines between the sockets represent CPU links. Those are, e.g. QPI links for Intel CPUs and HT links for AMD CPUs. ib0
is an InfiniBand HCA used to communicate with other compute nodes. Now, in that system, Socket 2 talks directly to the InfiniBand HCA. Socket 0 and Socket 3 has to cross one CPU link in order to talk to ib0
and Socket 1 has to cross 2 CPU links. That means, processes running on Socket 2 will have the lowest possible latency while sending and receiving messages and processes on Socket 1 will have the highest possible latency.
How does it work? If your host file specifies e.g. 16 slots on that host and the mapping option is --map-by dist:ib0
, it may result in the following map:
socket ---- 0 ---- ---- 1 ---- ---- 2 ---- ---- 3 ----
core 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
process G H I J K L A B C D E F M N O P
6 processes are mapped to Socket 2 that is closest to the InfiniBand HCA, then 6 more are mapped to Socket 0 that is the second closest and 4 more are mapped to Socket 3. It is also possible to spread the processes instead of linearly filling the processing elements. --map-by dist:ib0:span
results in:
socket ---- 0 ---- ---- 1 ---- ---- 2 ---- ---- 3 ----
core 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
process E F G H M N O P A B C D I J K L
The actual NUMA topology is obtained using the hwloc
library, which reads the distance information provided by the BIOS. hwloc
includes a command-line tool called hwloc-ls
(also known as lstopo
) that could be used in order to display the topology of the system. Normally it only includes the topology of the processing elements and the NUMA domains in its output but if you give it the -v
option it also includes the PCI devices.