Why is it important to protect the main loop when using joblib.Parallel?

Question 1

Why is it important to protect the main loop when using joblib.Parallel?

python multiprocessing joblib

Joe · Apr 9, 2015 · Viewed 12.9k times · Source

Answer

Answer

This is necessary because Windows doesn't have fork(). Because of this limitation, Windows needs to re-import your __main__ module in all the child processes it spawns, in order to re-create the parent's state in the child. This means that if you have the code that spawns the new process at the module-level, it's going to be recursively executed in all the child processes. The if __name__ == "__main__" guard is used to prevent code at the module scope from being re-executed in the child processes.

This isn't necessary on Linux because it does have fork(), which allows it to fork a child process that maintains the same state of the parent, without re-importing the __main__ module.

Question 2

The joblib docs contain the following warning:

Under Windows, it is important to protect the main loop of code to avoid recursive spawning of subprocesses when using joblib.Parallel. In other words, you should be writing code like this:
import ....

def function1(...):
    ...

def function2(...):
    ...

... if __name__ == '__main__':
    # do stuff with imports and functions defined about
    ...
No code should run outside of the “if __name__ == ‘__main__’” blocks, only imports and definitions.

Initially, I assumed this was just to prevent against the occasional odd case where a function passed to joblib.Parallel called the module recursively, which would mean it was generally good practice but often unnecessary. However, it doesn't make sense to me why this would only be a risk on Windows. Additionally, this answer seems to indicate that failure to protect the main loop resulted in the code running several times slower than it otherwise would have for a very simple non-recursive problem.

Out of curiosity, I ran the super-simple example of an embarrassingly parallel loop from the joblib docs without protecting the main loop on a windows box. My terminal was spammed with the following error until I closed it:

ImportError: [joblib] Attempting to do parallel computing without protecting your import on a system that does not suppo
rt forking. To use parallel-computing in a script, you must protect you main loop using "if __name__ == '__main__'". Ple
ase see the joblib documentation on Parallel for more information

My question is, what about the windows implementation of joblib requires the main loop to be protected in every case?

Apologies if this is a super basic question. I am new to the world of parallelization, so I might just be missing some basic concepts, but I couldn't find this issue discussed explicitly anywhere.

Finally, I want to note that this is purely academic; I understand why it is generally good practice to write one's code in this way, and will continue to do so regardless of joblib.

Why is it important to protect the main loop when using joblib.Parallel?

Answer

Related questions