So, I am playing around with multiprocessing.Pool
and Numpy
, but it seems I missed some important point. Why is the pool
version much slower? I looked at htop
and I can see several processes be created, but they all share one of the CPUs adding up to ~100%.
$ cat test_multi.py
import numpy as np
from timeit import timeit
from multiprocessing import Pool
def mmul(matrix):
for i in range(100):
matrix = matrix * matrix
return matrix
if __name__ == '__main__':
matrices = []
for i in range(4):
matrices.append(np.random.random_integers(100, size=(1000, 1000)))
pool = Pool(8)
print timeit(lambda: map(mmul, matrices), number=20)
print timeit(lambda: pool.map(mmul, matrices), number=20)
$ python test_multi.py
16.0265390873
19.097837925
[update]
timeit
for benchmarking processesStill no change. pool
version is still slower and I can see in htop
that only one core is used also several processes are spawned.
[update2]
At the moment I am reading about @Jan-Philip Gehrcke's suggestion to use multiprocessing.Process()
and Queue
. But in the meantime I would like to know:
Numpy
?I learned that often one gets better answer, when the others know my end goal so: I have a lot of files, which are atm loaded and processed in a serial fashion. The processing is CPU intense, so I assume much could be gained by parallelization. My aim is it to call the python function that analyses a file in parallel. Furthermore this function is just an interface to C code, I assume, that makes a difference.
1 Ubuntu 12.04, Python 2.7.3, i7 860 @ 2.80 - Please leave a comment if you need more info.
[update3]
Here are the results from Stefano's example code. For some reason there is no speed up. :/
testing with 16 matrices
base 4.27
1 5.07
2 4.76
4 4.71
8 4.78
16 4.79
testing with 32 matrices
base 8.82
1 10.39
2 10.58
4 10.73
8 9.46
16 9.54
testing with 64 matrices
base 17.38
1 19.34
2 19.62
4 19.59
8 19.39
16 19.34
[update 4] answer to Jan-Philip Gehrcke's comment
Sorry that I haven't made myself clearer. As I wrote in Update 2 my main goal is it to parallelize many serial calls of a 3rd party Python library function. This function is an interface to some C code. I was recommended to use Pool
, but this didn't work, so I tried something simpler, the shown above example with numpy
. But also there I could not achieve a performance improvement, even though it looks for me 'emberassing parallelizable`. So I assume I must have missed something important. This information is what I am looking for with this question and bounty.
[update 5]
Thanks for all your tremendous input. But reading through your answers only creates more questions for me. For that reason I will read about the basics and create new SO questions when I have a clearer understanding of what I don't know.
Regarding the fact that all of your processes are running on the same CPU, see my answer here.
During import, numpy
changes the CPU affinity of the parent process, such that when you later use Pool
all of the worker processes that it spawns will end up vying for for the same core, rather than using all of the cores available on your machine.
You can call taskset
after you import numpy
to reset the CPU affinity so that all cores are used:
import numpy as np
import os
from timeit import timeit
from multiprocessing import Pool
def mmul(matrix):
for i in range(100):
matrix = matrix * matrix
return matrix
if __name__ == '__main__':
matrices = []
for i in range(4):
matrices.append(np.random.random_integers(100, size=(1000, 1000)))
print timeit(lambda: map(mmul, matrices), number=20)
# after importing numpy, reset the CPU affinity of the parent process so
# that it will use all cores
os.system("taskset -p 0xff %d" % os.getpid())
pool = Pool(8)
print timeit(lambda: pool.map(mmul, matrices), number=20)
Output:
$ python tmp.py
12.4765810966
pid 29150's current affinity mask: 1
pid 29150's new affinity mask: ff
13.4136221409
If you watch CPU useage using top
while you run this script, you should see it using all of your cores when it executes the 'parallel' part. As others have pointed out, in your original example the overhead involved in pickling data, process creation etc. probably outweigh any possible benefit from parallelisation.
Edit: I suspect that part of the reason why the single process seems to be consistently faster is that numpy
may have some tricks for speeding up that element-wise matrix multiplication that it cannot use when the jobs are spread across multiple cores.
For example, if I just use ordinary Python lists to compute the Fibonacci sequence, I can get a huge speedup from parallelisation. Likewise, if I do element-wise multiplication in a way that takes no advantage of vectorization, I get a similar speedup for the parallel version:
import numpy as np
import os
from timeit import timeit
from multiprocessing import Pool
def fib(dummy):
n = [1,1]
for ii in xrange(100000):
n.append(n[-1]+n[-2])
def silly_mult(matrix):
for row in matrix:
for val in row:
val * val
if __name__ == '__main__':
dt = timeit(lambda: map(fib, xrange(10)), number=10)
print "Fibonacci, non-parallel: %.3f" %dt
matrices = [np.random.randn(1000,1000) for ii in xrange(10)]
dt = timeit(lambda: map(silly_mult, matrices), number=10)
print "Silly matrix multiplication, non-parallel: %.3f" %dt
# after importing numpy, reset the CPU affinity of the parent process so
# that it will use all CPUS
os.system("taskset -p 0xff %d" % os.getpid())
pool = Pool(8)
dt = timeit(lambda: pool.map(fib,xrange(10)), number=10)
print "Fibonacci, parallel: %.3f" %dt
dt = timeit(lambda: pool.map(silly_mult, matrices), number=10)
print "Silly matrix multiplication, parallel: %.3f" %dt
Output:
$ python tmp.py
Fibonacci, non-parallel: 32.449
Silly matrix multiplication, non-parallel: 40.084
pid 29528's current affinity mask: 1
pid 29528's new affinity mask: ff
Fibonacci, parallel: 9.462
Silly matrix multiplication, parallel: 12.163