Using the multiprocessing module for cluster computing

astrofrog picture astrofrog · Mar 3, 2011 · Viewed 28.8k times · Source

I'm interested in running a Python program using a computer cluster. I have in the past been using Python MPI interfaces, but due to difficulties in compiling/installing these, I would prefer solutions which use built-in modules, such as Python's multiprocessing module.

What I would really like to do is just set up a multiprocessing.Pool instance that would span across the whole computer cluster, and run a Pool.map(...). Is this something that is possible/easy to do?

If this is impossible, I'd like to at least be able to start Process instances on any of the nodes from a central script with different parameters for each node.

Answer

Shawn Chin picture Shawn Chin · Mar 3, 2011

If by cluster computing you mean distributed memory systems (multiple nodes rather that SMP) then Python's multiprocessing may not be a suitable choice. It can spawn multiple processes but they will still be bound within a single node.

What you will need is a framework that handles spawing of processes across multiple nodes and provides a mechanism for communication between the processors. (pretty much what MPI does).

See the page on Parallel Processing on the Python wiki for a list of frameworks which will help with cluster computing.

From the list, pp, jug, pyro and celery look like sensible options although I can't personally vouch for any since I have no experience with any of them (I use mainly MPI).

If ease of installation/use is important, I would start by exploring jug. It's easy to install, supports common batch cluster systems, and looks well documented.