I'm interested in running a Python program using a computer cluster. I have in the past been using Python MPI interfaces, but due to difficulties in compiling/installing these, I would prefer solutions which use built-in modules, such as Python's multiprocessing module.
What I would really like to do is just set up a multiprocessing.Pool
instance that would span across the whole computer cluster, and run a Pool.map(...)
. Is this something that is possible/easy to do?
If this is impossible, I'd like to at least be able to start Process
instances on any of the nodes from a central script with different parameters for each node.
If by cluster computing you mean distributed memory systems (multiple nodes rather that SMP) then Python's multiprocessing may not be a suitable choice. It can spawn multiple processes but they will still be bound within a single node.
What you will need is a framework that handles spawing of processes across multiple nodes and provides a mechanism for communication between the processors. (pretty much what MPI does).
See the page on Parallel Processing on the Python wiki for a list of frameworks which will help with cluster computing.
From the list, pp, jug, pyro and celery look like sensible options although I can't personally vouch for any since I have no experience with any of them (I use mainly MPI).
If ease of installation/use is important, I would start by exploring jug
. It's easy to install, supports common batch cluster systems, and looks well documented.