I'm trying to parse the websites that contain car's properties(154 kinds of properties). I have a huge list(name is liste_test) that consist of 280.000 used car announcement URL.
def araba_cekici(liste_test,headers,engine):
for link in liste_test:
try:
page = requests.get(link, headers=headers)
.....
.....
When I start my code like that:
araba_cekici(liste_test,headers,engine)
It works and getting results. But approximately in 1 hour, I could only obtain 1500 URL's properties. It is very slow, and I must use multiprocessing.
I found a result on here with multiprocessing. Then I applied to my code, but unfortunately, it is not working.
import numpy as np
import multiprocessing as multi
def chunks(n, page_list):
"""Splits the list into n chunks"""
return np.array_split(page_list,n)
cpus = multi.cpu_count()
workers = []
page_bins = chunks(cpus, liste_test)
for cpu in range(cpus):
sys.stdout.write("CPU " + str(cpu) + "\n")
# Process that will send corresponding list of pages
# to the function perform_extraction
worker = multi.Process(name=str(cpu),
target=araba_cekici,
args=(page_bins[cpu],headers,engine))
worker.start()
workers.append(worker)
for worker in workers:
worker.join()
And it gives:
TypeError: can't pickle _thread.RLock objects
I found some kind of responses with respects to this error. But none of them works(at least I can't apply to my code). Also, I tried python multiprocess Pool but unfortunately it stucks on jupyter notebook and seems this code works infinitely.
Late answer, but since this question turns up when searching on Google: multiprocessing
sends the data to the worker processes via a multiprocessing.Queue
, which requires all data/objects sent to be picklable.
In your code, you try to pass header
and engine
, whose implementations you don't show. (Since header
holds the HTTP request header, I suspect that engine
is the issue here.) To solve your issue, you either have to make engine
picklable, or only instantiate engine
within the worker process.