I an using python 2.7 and trying to pickle an object. I am wondering what the real difference is between the pickle protocols.
import numpy as np
import pickle
class Data(object):
def __init__(self):
self.a = np.zeros((100, 37000, 3), dtype=np.float32)
d = Data()
print("data size: ", d.a.nbytes / 1000000.0)
print("highest protocol: ", pickle.HIGHEST_PROTOCOL)
pickle.dump(d, open("noProt", "w"))
pickle.dump(d, open("prot0", "w"), protocol=0)
pickle.dump(d, open("prot1", "w"), protocol=1)
pickle.dump(d, open("prot2", "w"), protocol=2)
out >> data size: 44.4
out >> highest protocol: 2
then I found that the saved files have different sizes on disk:
noProt
: 177.6MB prot0
: 177.6MB prot1
: 44.4MB prot2
: 44.4MBI know that prot0
is a human readable text file, so I don't want to use it.
I guess protocol 0 is the one given by default.
I wonder what's the difference between protocols 1 and 2, is there a reason why I should chose one or another?
What's is the better to use, pickle
or cPickle
?
Use the latest protocol that supports the lowest Python version you want to support reading the data. Newer protocol versions support new language features and include optimisations.
From the pickle
module data format documentation:
There are currently 6 different protocols which can be used for pickling. The higher the protocol used, the more recent the version of Python needed to read the pickle produced.
- Protocol version 0 is the original “human-readable” protocol and is backwards compatible with earlier versions of Python.
- Protocol version 1 is an old binary format which is also compatible with earlier versions of Python.
- Protocol version 2 was introduced in Python 2.3. It provides much more efficient pickling of new-style classes. Refer to PEP 307 for information about improvements brought by protocol 2.
- Protocol version 3 was added in Python 3.0. It has explicit support for
bytes
objects and cannot be unpickled by Python 2.x. This was the default protocol in Python 3.0–3.7.- Protocol version 4 was added in Python 3.4. It adds support for very large objects, pickling more kinds of objects, and some data format optimizations. It is the default protocol starting with Python 3.8. Refer to PEP 3154 for information about improvements brought by protocol 4.
- Protocol version 5 was added in Python 3.8. It adds support for out-of-band data and speedup for in-band data. Refer to PEP 574 for information about improvements brought by protocol 5.
If a protocol is not specified, protocol 0 is used. If protocol is specified as a negative value or
HIGHEST_PROTOCOL
, the highest protocol version available will be used.
So when you want to support loading the pickled data with Python 3.4 or newer, pick protocol 4. If you need to support Python 2.7 still, pick protocol 2, especially if you are using custom classes derived from object
(new-style classes) (which any modern code does, these days).
However, if you are exchanging pickled data with other Python versions or otherwise need to maintain backwards compatibility with older Python versions, it's easiest to just stick with the highest protocol version you can lay your hands on:
with open("prot2", 'wb') as pfile:
pickle.dump(d, pfile, protocol=pickle.HIGHEST_PROTOCOL)
pickle.HIGHEST_PROTOCOL
will always be the right version for the current Python version. Because this is a binary format, make sure to use 'wb'
as the file mode!
Python 3 no longer distinguishes between cPickle
and pickle
, always use pickle
when using Python 3. It uses a compiled C extension under the hood.
If you are still using Python 2, then cPickle
and pickle
are mostly compatible, the differences lie in the API offered. For most use-cases, just stick with cPickle
; it is faster. Quoting the documentation again:
First,
cPickle
can be up to 1000 times faster than pickle because the former is implemented in C. Second, in thecPickle
module the callablesPickler()
andUnpickler()
are functions, not classes. This means that you cannot use them to derive custom pickling and unpickling subclasses. Most applications have no need for this functionality and should benefit from the greatly improved performance of thecPickle
module.