mpirun --cpu-set vs. --rankfile (Open MPI) 1.4.5

el_tenedor picture el_tenedor · Jun 5, 2013 · Viewed 7.1k times · Source

I want to accurately pin my MPI processes to a list of (physical) cores. I refer to the following points of the mpirun --help output:

   -cpu-set|--cpu-set <arg0>  
                         Comma-separated list of ranges specifying logical
                         cpus allocated to this job [default: none]

...

   -rf|--rankfile <arg0>  
                         Provide a rankfile file

The topology of my processor is as follows:

-------------------------------------------------------------
CPU type:       Intel Core Bloomfield processor 
*************************************************************
Hardware Thread Topology
*************************************************************
Sockets:        1 
Cores per socket:       4 
Threads per core:       2 
-------------------------------------------------------------
HWThread        Thread          Core            Socket
0               0               0               0
1               0               1               0
2               0               2               0
3               0               3               0
4               1               0               0
5               1               1               0
6               1               2               0
7               1               3               0
-------------------------------------------------------------
Socket 0: ( 0 4 1 5 2 6 3 7 )
-------------------------------------------------------------

Now, if I start my programm using mpirun -np 2 --cpu-set 0,1 --report-bindings ./solver the program starts normally but without considering the --cpu-set argument I provided. On the other hand starting my program with mpirun -np 2 --rankfile rankfile --report-bindings ./solver gives me the following output:

[neptun:14781] [[16333,0],0] odls:default:fork binding child [[16333,1],0] to slot_list 0
[neptun:14781] [[16333,0],0] odls:default:fork binding child [[16333,1],1] to slot_list 1

Indeed checking with top shows me, that mpirun actually uses the specified cores. But how should I interpret this output? Except for the host (neptun) and the specified slots (0,1) I don't have a clue. Same with the other commands I tried out:

$mpirun --np 2 --bind-to-core --report-bindings ./solver
[neptun:15166] [[15694,0],0] odls:default:fork binding child [[15694,1],0] to cpus 0001
[neptun:15166] [[15694,0],0] odls:default:fork binding child [[15694,1],1] to cpus 0002

and

$mpirun --np 2 --bind-to-socket --report-bindings ./solver
[neptun:15188] [[15652,0],0] odls:default:fork binding child [[15652,1],0] to socket 0 cpus 000f
[neptun:15188] [[15652,0],0] odls:default:fork binding child [[15652,1],1] to socket 0 cpus 000f

With --bind-to-core the top command once again shows me that cores 0 and 1 are used, but why is the output cpus 0001 and 0002? --bind-to-socket causes even more confusion: 2x 000f?

I use the last paragraph to summarize the questions that arrose from my experiments:
- Why isn't my --cpu-set command working?
- How am I supposed to interpret the output resulting from the --report-bindings output?

Kind regards

Answer

Hristo &#39;away&#39; Iliev picture Hristo 'away' Iliev · Jun 6, 2013

In both cases the output matches exactly what you have told Open MPI to do. The hexadecimal number in cpus ... shows the allowed CPUs (the affinity mask) for the process. This is a bit field with each bit representing one logical CPU.

With --bind-to-core each MPI process is bound to its own CPU core. Rank 0 ([...,0]) has its affinity mask set to 0001 which means logical CPU 0. Rank 1 ([...,1]) has its affinity mask set to 0002 which means logical CPU 1. The logical CPU numbering probably matches the HWThread identifier in the output with the topology information.

With --bind-to-socket each MPI process is bound to all cores of the socket. In your particular case the affinity mask is set to 000f, or 0000000000001111 in binary, which corresponds to all four cores in the socket. Only a single hyperthread per core is being assigned.

You can further instruct Open MPI how to select the sockets on multisocket nodes. With --bysocket the sockets are selected in round-robin fashion, i.e. the first rank is placed on the first socket, the next rank is placed on the next socket, and so on until there is one process per socket, then the next rank is again put on the first socket and so on. With --bycore each sockets receives as much consecutive ranks as is the number of cores in that socket.

I would suggest that you read the manual for mpirun for Open MPI 1.4.x, especially the Process Binding section. There are some examples there with how the different binding options interact with each other. The --cpu-set option is not mentioned in the manual, although Jeff Squyres has written a nice page on processor affinity features in Open MPI (it is about v1.5, but most if not all of it applies to v1.4 also).