(This question is about how to make multiprocessing.Pool() run code faster. I finally solved it, and the final solution can be found at the bottom of the post.)
Original Question:
I'm trying to use Python to compare a word with many other words in a list and retrieve a list of the most similar ones. To do that I am using the difflib.get_close_matches function. I'm on a relatively new and powerful Windows 7 Laptop computer, with Python 2.6.5.
What I want is to speed up the comparison process because my comparison list of words is very long and I have to repeat the comparison process several times. When I heard about the multiprocessing module it seemed logical that if the comparison could be broken up into worker tasks and run simultaneously (and thus making use of machine power in exchange for faster speed) my comparison task would finish faster.
However, even after having tried many different ways, and used methods that have been shown in the docs and suggested in forum posts, the Pool method just seems to be incredibly slow, much slower than just running the original get_close_matches function on the entire list at once. I would like help understanding why Pool() is being so slow and if I am using it correctly. Im only using this string comparison scenario as an example because that is the most recent example I could think of where I was unable to understand or get multiprocessing to work for rather than against me. Below is just an example code from the difflib scenario showing the time differences between the ordinary and the Pooled methods:
from multiprocessing import Pool
import random, time, difflib
# constants
wordlist = ["".join([random.choice([letter for letter in "abcdefghijklmnopqersty"]) for lengthofword in xrange(5)]) for nrofwords in xrange(1000000)]
mainword = "hello"
# comparison function
def findclosematch(subwordlist):
matches = difflib.get_close_matches(mainword,subwordlist,len(subwordlist),0.7)
if matches <> []:
return matches
# pool
print "pool method"
if __name__ == '__main__':
pool = Pool(processes=3)
t=time.time()
result = pool.map_async(findclosematch, wordlist, chunksize=100)
#do something with result
for r in result.get():
pass
print time.time()-t
# normal
print "normal method"
t=time.time()
# run function
result = findclosematch(wordlist)
# do something with results
for r in result:
pass
print time.time()-t
The word to be found is "hello", and the list of words in which to find close matches is a 1 million long list of 5 randomly joined characters (only for illustration purposes). I use 3 processor cores and the map function with a chunksize of 100 (listitems to be procesed per worker I think??) (I also tried chunksizes of 1000 and 10 000 but there was no real difference). Notice that in both methods I start the timer right before calling on my function and end it right after having looped through the results. As you can see below the timing results are clearly in favor of the original non-Pool method:
>>>
pool method
37.1690001488 seconds
normal method
10.5329999924 seconds
>>>
The Pool method is almost 4 times slower than the original method. Is there something I am missing here, or maybe misunderstanding about how the Pooling/multiprocessing works? I do suspect that part of the problem here could be that the map function returns None and so adds thousands of unneccessary items to the resultslist even though I only want actual matches to be returned to the results and have written it as such in the function. From what I understand that is just how map works. I have heard about some other functions like filter that only collects non-False results, but I dont think that multiprocessing/Pool supports the filter method. Are there any other functions besides map/imap in the multiprocessing module that could help me out in only returning what my function returns? Apply function is more for giving multiple arguments as I understand it.
I know there's also the imap function, which I tried but without any time-improvements. The reason being the same reason why I have had problems understanding what's so great about the itertools module, supposedly "lightning fast", which I've noticed is true for calling the function, but in my experience and from what I've read that's because calling the function doesn't actually do any calculations, so when it's time to iterate through the results to collect and analyze them (without which there would be no point in calling the cuntion) it takes just as much or sometimes more time than a just using the normal version of the function straightup. But I suppose that's for another post.
Anyway, excited to see if someone can nudge me in the right direction here, and really appreciate any help on this. I'm more interested in understanding multiprocessing in general than to get this example to work, though it would be useful with some example solution code suggestions to aid in my understanding.
The Answer:
Seems like the slowdown had to do with the slow startup time of additional processes. I couldnt get the .Pool() function to be fast enough. My final solution to make it faster was to manually split the workload list, use multiple .Process() instead of .Pool(), and return the solutions in a Queue. But I wonder if maybe the most crucial change might have been splitting the workload in terms of the main word to look for rather than the words to compare with, perhaps because the difflib search function is already so fast. Here is the new code running 5 processes at the same time, and turned out about x10 faster than running a simple code (6 seconds vs 55 seconds). Very useful for fast fuzzy lookups, on top of how fast difflib already is.
from multiprocessing import Process, Queue
import difflib, random, time
def f2(wordlist, mainwordlist, q):
for mainword in mainwordlist:
matches = difflib.get_close_matches(mainword,wordlist,len(wordlist),0.7)
q.put(matches)
if __name__ == '__main__':
# constants (for 50 input words, find closest match in list of 100 000 comparison words)
q = Queue()
wordlist = ["".join([random.choice([letter for letter in "abcdefghijklmnopqersty"]) for lengthofword in xrange(5)]) for nrofwords in xrange(100000)]
mainword = "hello"
mainwordlist = [mainword for each in xrange(50)]
# normal approach
t = time.time()
for mainword in mainwordlist:
matches = difflib.get_close_matches(mainword,wordlist,len(wordlist),0.7)
q.put(matches)
print time.time()-t
# split work into 5 or 10 processes
processes = 5
def splitlist(inlist, chunksize):
return [inlist[x:x+chunksize] for x in xrange(0, len(inlist), chunksize)]
print len(mainwordlist)/processes
mainwordlistsplitted = splitlist(mainwordlist, len(mainwordlist)/processes)
print "list ready"
t = time.time()
for submainwordlist in mainwordlistsplitted:
print "sub"
p = Process(target=f2, args=(wordlist,submainwordlist,q,))
p.Daemon = True
p.start()
for submainwordlist in mainwordlistsplitted:
p.join()
print time.time()-t
while True:
print q.get()
These problems usually boil down to the following:
The function you are trying to parallelize doesn't require enough CPU resources (i.e. CPU time) to rationalize parallelization!
Sure, when you parallelize with multiprocessing.Pool(8)
, you theoretically (but not practically) could get a 8x speed up.
However, keep in mind that this isn't free - you gain this parallelization at the expense of the following overhead:
task
for every chunk
(of size chunksize
) in your iter
passed to Pool.map(f, iter)
task
task
, and the task's
return value (think pickle.dumps()
)task
, and the task's
return value (think pickle.loads()
)Locks
on shared memory Queues
, while worker processes and parent processes get()
and put()
from/to these Queues
.os.fork()
for each worker process, which is expensive.In essence, when using Pool()
you want:
iter
to justify the one-time cost of (3) above.For a more in-depth exploration, this post and linked talk walk-through how large data being passed to Pool.map()
(and friends) gets you into trouble.
Raymond Hettinger also talks about proper use of Python's concurrency here.