Syntax of the --map-by option in openmpi mpirun v1.8

el_tenedor picture el_tenedor · Jan 29, 2015 · Viewed 9.8k times · Source

Looking at the following extract from the openmpi manual

--map-by <foo>
    Map to the specified object, defaults to socket. Supported options
    include slot, hwthread, core, L1cache, L2cache, L3cache, socket, 
    numa, board, node, sequential, distance, and ppr. Any object can 
    include modifiers by adding a : and any combination of PE=n (bind n
    processing elements to each proc), SPAN (load balance the processes 
    across the allocation), OVERSUBSCRIBE (allow more processes on a node
    than processing elements), and NOOVERSUBSCRIBE. This includes PPR,
    where the pattern would be terminated by another colon to separate 
    it from the modifiers.

I have different questions regarding the syntax and some comments on them:

  • what do the options sequential, distance and ppr refer to?

Especially ppr puzzles me. What is it abbreviating?

  • how should I understand options like --map-by ppr:4:socket regarding the extract of the manual?

Of course I can see the result of former option by looking at the reported bindings with --report-bindings (only 4 processes are mapped onto one socket and by default bound to 4 cores of one socket), but I cannot make any sense of the syntax. At another line of the manual it says that this new option replaces the deprecated use of --npersocket:

-npersocket, --npersocket <#persocket>
    On each node, launch this many processes times the number of processor
    sockets on the node. The -npersocket option also turns on the -bind-
    to-socket option. (deprecated in favor of --map-by ppr:n:socket) 

Answer

Hristo Iliev picture Hristo Iliev · Jan 30, 2015

ppr means processes per resource. Its syntax is ppr:N:resource and it means "assign N processes to each resource of type resource available on the host". For example, on a 4-socket system with 6-core CPUs having --map-by ppr:4:socket results in the following process map:

 socket   ---- 0 ----    ---- 1 ----    ---- 2 ----    ---- 3 ----
 core     0 1 2 3 4 5    0 1 2 3 4 5    0 1 2 3 4 5    0 1 2 3 4 5
 process  A B C D        E F G H        I J K L        M N O P

(process numbering runs from A to Z in this example)

What the manual means is that the whole ppr:N:resource is to be regarded as a single specifier and that options could be added after it, separated by :, e.g. ppr:2:socket:pe=2. This should read as "start two processes per each socket and bind each of them to two processing elements" and should result in the following map given the same quad-socket system:

 socket   ---- 0 ----    ---- 1 ----    ---- 2 ----    ---- 3 ----
 core     0 1 2 3 4 5    0 1 2 3 4 5    0 1 2 3 4 5    0 1 2 3 4 5
 process  A A B B        C C D D        E E F F        G G H H

The sequential mapper reads the host file line by line and starts one process per host found there. It ignores the slot count if given.

The dist mapper maps processes on NUMA nodes by the distance of the latter from a given PCI resource. It only makes sense on NUMA systems. Again, let's use the toy quad-socket system but this time extend the representation so to show the NUMA topology:

 Socket 0 ------------- Socket 1
    |                      |
    |                      |
    |                      |
    |                      |
    |                      |
 Socket 2 ------------- Socket 3
    |
   ib0

The lines between the sockets represent CPU links. Those are, e.g. QPI links for Intel CPUs and HT links for AMD CPUs. ib0 is an InfiniBand HCA used to communicate with other compute nodes. Now, in that system, Socket 2 talks directly to the InfiniBand HCA. Socket 0 and Socket 3 has to cross one CPU link in order to talk to ib0 and Socket 1 has to cross 2 CPU links. That means, processes running on Socket 2 will have the lowest possible latency while sending and receiving messages and processes on Socket 1 will have the highest possible latency.

How does it work? If your host file specifies e.g. 16 slots on that host and the mapping option is --map-by dist:ib0, it may result in the following map:

 socket   ---- 0 ----    ---- 1 ----    ---- 2 ----    ---- 3 ----
 core     0 1 2 3 4 5    0 1 2 3 4 5    0 1 2 3 4 5    0 1 2 3 4 5
 process  G H I J K L                   A B C D E F    M N O P

6 processes are mapped to Socket 2 that is closest to the InfiniBand HCA, then 6 more are mapped to Socket 0 that is the second closest and 4 more are mapped to Socket 3. It is also possible to spread the processes instead of linearly filling the processing elements. --map-by dist:ib0:span results in:

 socket   ---- 0 ----    ---- 1 ----    ---- 2 ----    ---- 3 ----
 core     0 1 2 3 4 5    0 1 2 3 4 5    0 1 2 3 4 5    0 1 2 3 4 5
 process  E F G H        M N O P        A B C D        I J K L

The actual NUMA topology is obtained using the hwloc library, which reads the distance information provided by the BIOS. hwloc includes a command-line tool called hwloc-ls (also known as lstopo) that could be used in order to display the topology of the system. Normally it only includes the topology of the processing elements and the NUMA domains in its output but if you give it the -v option it also includes the PCI devices.