ThreadPoolExecutor, ProcessPoolExecutor and global variables

Bram Vanroy picture Bram Vanroy · Jun 15, 2018 · Viewed 8k times · Source

I am new to parallelization in general and concurrent.futures in particular. I want to benchmark my script and compare the differences between using threads and processes, but I found that I couldn't even get that running because when using ProcessPoolExecutor I cannot use my global variables.

The following code will output Helloas I expect, but when you change ThreadPoolExecutor for ProcessPoolExecutor, it will output None.

from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor

greeting = None

def process():
    print(greeting)

    return None


def main():
    with ThreadPoolExecutor(max_workers=1) as executor:
        executor.submit(process)

    return None


def init():
    global greeting
    greeting = 'Hello'

    return None

if __name__ == '__main__':
    init()
    main()

I don't understand why this is the case. In my real program, init is used to set the global variables to CLI arguments, and there are a lot of them. Hence, passing them as arguments does not seem recommended. So how do I pass those global variables to each process/thread correctly?

I know that I can change things around, which will work, but I don't understand why. E.g. the following works for both Executors, but it also means that the globals initialisation has to happen for every instance.

from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor

greeting = None

def init():
    global greeting
    greeting = 'Hello'

    return None


def main():
    with ThreadPoolExecutor(max_workers=1) as executor:
        executor.submit(process)

    return None

def process():
    init()
    print(greeting)

    return None

if __name__ == '__main__':
    main()

So my main question is, what is actually happening. Why does this code work with threads and not with processes? And, how do I correctly pass set globals to each process/thread without having to re-initialise them for every instance?

(Side note: because I have read that concurrent.futures might behave differently on Windows, I have to note that I am running Python 3.6 on Windows 10 64 bit.)

Answer

jedwards picture jedwards · Jun 15, 2018

I'm not sure of the limitations of this approach, but you can pass (serializable?) objects between your main process/thread. This would also help you get rid of the reliance on global vars:

from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor

def process(opts):
    opts["process"] = "got here"
    print("In process():", opts)

    return None


def main(opts):
    opts["main"] = "got here"
    executor = [ProcessPoolExecutor, ThreadPoolExecutor][1]
    with executor(max_workers=1) as executor:
        executor.submit(process, opts)

    return None


def init(opts):                         # Gather CLI opts and populate dict
    opts["init"] = "got here"

    return None


if __name__ == '__main__':
    cli_opts = {"__main__": "got here"} # Initialize dict
    init(cli_opts)                      # Populate dict
    main(cli_opts)                      # Use dict

Works with both executor types.

Edit: Even though it sounds like it won't be a problem for your use case, I'll point out that with ProcessPoolExecutor, the opts dict you get inside process will be a frozen copy, so mutations to it will not be visible across processes nor will they be visible once you return to the __main__ block. ThreadPoolExecutor, on the other hand, will share the dict object between threads.