Python3 can't pickle _thread.RLock objects on list with multiprocessing

Orhan Solak picture Orhan Solak · May 17, 2018 · Viewed 7.3k times · Source

I'm trying to parse the websites that contain car's properties(154 kinds of properties). I have a huge list(name is liste_test) that consist of 280.000 used car announcement URL.

def araba_cekici(liste_test,headers,engine):
    for link in liste_test:
        try:
            page = requests.get(link, headers=headers)
        .....
        .....

When I start my code like that:

araba_cekici(liste_test,headers,engine)

It works and getting results. But approximately in 1 hour, I could only obtain 1500 URL's properties. It is very slow, and I must use multiprocessing.

I found a result on here with multiprocessing. Then I applied to my code, but unfortunately, it is not working.

import numpy as np
import multiprocessing as multi

def chunks(n, page_list):
    """Splits the list into n chunks"""
    return np.array_split(page_list,n)

cpus = multi.cpu_count()

workers = []   
page_bins = chunks(cpus, liste_test)


for cpu in range(cpus):
    sys.stdout.write("CPU " + str(cpu) + "\n")
    # Process that will send corresponding list of pages 
    # to the function perform_extraction
    worker = multi.Process(name=str(cpu), 
                           target=araba_cekici, 
                           args=(page_bins[cpu],headers,engine))
    worker.start()
    workers.append(worker)

for worker in workers:
    worker.join()

And it gives:

TypeError: can't pickle _thread.RLock objects

I found some kind of responses with respects to this error. But none of them works(at least I can't apply to my code). Also, I tried python multiprocess Pool but unfortunately it stucks on jupyter notebook and seems this code works infinitely.

Answer

IonicSolutions picture IonicSolutions · Sep 20, 2018

Late answer, but since this question turns up when searching on Google: multiprocessing sends the data to the worker processes via a multiprocessing.Queue, which requires all data/objects sent to be picklable.

In your code, you try to pass header and engine, whose implementations you don't show. (Since header holds the HTTP request header, I suspect that engine is the issue here.) To solve your issue, you either have to make engine picklable, or only instantiate engine within the worker process.