How to use all the cpu cores using Dask?

ANKIT JHA picture ANKIT JHA · Jul 6, 2018 · Viewed 8.7k times · Source

I have a pandas series with more than 35000 rows. I want to use dask make it more efficient. However, I both the dask code and the pandas code are taking the same time. Initially "ser" is pandas series and fun1 and fun2 are basic functions performing pattern match in individual rows of series.
Pandas
ser = ser.apply(fun1).apply(fun2)

Dask
ser = dd.from_pandas(ser, npartitions = 16) ser = ser.apply(fun1).apply(fun2)

On checking the status of cores of cpu, I found that not all the cores were getting used. Only one core was getting used to 100%.

Is there any method to make the series code faster using dask or utilize all the cores of cpu while performing Dask operations in series?

Answer

mdurant picture mdurant · Jul 9, 2018

See http://dask.pydata.org/en/latest/scheduler-overview.html

It is likely that the functions that you are calling are pure-python, and so claim the GIL, the lock which ensures that only one python instruction is being carried out at a time within a thread. In this case, you will need to run your functions in separate processes to see any parallelism. You could do this by using the multiprocess scheduler

ser = ser.apply(fun1).apply(fun2).compute(scheduler='processes')

or by using the distributed scheduler (which works fine on a single machine, and actually comes with some next-generation benefits, such as the status dashboard); in the simplest, default case, creating a client is enough:

client = dask.distributed.Client()

but you should read the docs