What are the different use cases of joblib versus pickle?

msunbot picture msunbot · Sep 27, 2012 · Viewed 36k times · Source

Background: I'm just getting started with scikit-learn, and read at the bottom of the page about joblib, versus pickle.

it may be more interesting to use joblib’s replacement of pickle (joblib.dump & joblib.load), which is more efficient on big data, but can only pickle to the disk and not to a string

I read this Q&A on Pickle, Common use-cases for pickle in Python and wonder if the community here can share the differences between joblib and pickle? When should one use one over another?

Answer

ogrisel picture ogrisel · Sep 27, 2012
  • joblib is usually significantly faster on large numpy arrays because it has a special handling for the array buffers of the numpy datastructure. To find about the implementation details you can have a look at the source code. It can also compress that data on the fly while pickling using zlib or lz4.
  • joblib also makes it possible to memory map the data buffer of an uncompressed joblib-pickled numpy array when loading it which makes it possible to share memory between processes.
  • if you don't pickle large numpy arrays, then regular pickle can be significantly faster, especially on large collections of small python objects (e.g. a large dict of str objects) because the pickle module of the standard library is implemented in C while joblib is pure python.
  • since PEP 574 (Pickle protocol 5) has been merged in Python 3.8, it is now much more efficient (memory-wise and cpu-wise) to pickle large numpy arrays using the standard library. Large arrays in this context means 4GB or more.
  • But joblib can still be useful with Python 3.8 to load objects that have nested numpy arrays in memory mapped mode with mmap_mode="r".