Yesterday i asked a question: Reading data in parallel with multiprocess
I got very good answers, and i implemented the solution mentioned in the answer i marked as correct.
def read_energies(motif):
os.chdir("blabla/working_directory")
complx_ener = pd.DataFrame()
# complex function to fill that dataframe
lig_ener = pd.DataFrame()
# complex function to fill that dataframe
return motif, complx_ener, lig_ener
COMPLEX_ENERGIS = {}
LIGAND_ENERGIES = {}
p = multiprocessing.Pool(processes=CPU)
for x in p.imap_unordered(read_energies, peptide_kd.keys()):
COMPLEX_ENERGIS[x[0]] = x[1]
LIGAND_ENERGIES[x[0]] = x[2]
However, this solution takes the same amount of time as if i would just iterate over peptide_kd.keys()
and fill up the DataFrames
one by one. Why is that so? Is there a way to fill up the desired dicts in parallel and actually get a speed increase? i am running it on a 48 core HPC.
You are incurring a good amount of overhead in (1) starting up each process, and (2) having to copy the pandas.DataFrame
(and etc) across several processes. If you just need to have a dict
filled in parallel, I'd suggest using a shared memory dict
. If no key will be overwritten, then it's easy and you don't have to worry about locks.
(Note I'm using multiprocess
below, which is a fork of multiprocessing
-- but only so I can demonstrate from the interpreter, otherwise, you'd have to do the below from __main__
).
>>> from multiprocess import Process, Manager
>>>
>>> def f(d, x):
... d[x] = x**2
...
>>> manager = Manager()
>>> d = manager.dict()
>>> job = [Process(target=f, args=(d, i)) for i in range(5)]
>>> _ = [p.start() for p in job]
>>> _ = [p.join() for p in job]
>>> print d
{0: 0, 1: 1, 2: 4, 3: 9, 4: 16}
This solution doesn't make copies of the dict
to share across processes, so that part of the overhead is reduced. For large objects like a pandas.DataFrame
, it can be significant compared to the cost of a simple operation like x**2
. Similarly, spawning a Process
can take time, and you maybe be able to do the above even faster (for lightweight objects) by using threads (i.e. from multiprocess.dummy
instead of multiprocess
for either your originally posted solution or mine above).
If you do need to share DataFrames
(as your code suggests instead of as the question asks), you might be able to do it by creating a shared memory numpy.ndarray
.