I did a sample program to train a SVM using sklearn. Here is the code
from sklearn import svm
from sklearn import datasets
from sklearn.externals import joblib
clf = svm.SVC()
iris = datasets.load_iris()
X, y = iris.data, iris.target
clf.fit(X, y)
print(clf.predict(X))
joblib.dump(clf, 'clf.pkl')
When I dump the model file I get this amount of files. :
['clf.pkl', 'clf.pkl_01.npy', 'clf.pkl_02.npy', 'clf.pkl_03.npy', 'clf.pkl_04.npy', 'clf.pkl_05.npy', 'clf.pkl_06.npy', 'clf.pkl_07.npy', 'clf.pkl_08.npy', 'clf.pkl_09.npy', 'clf.pkl_10.npy', 'clf.pkl_11.npy']
I am confused if I did something wrong. Or is this normal? What is *.npy files. And why there are 11?
To save everything into 1 file you should set compression to True or any number (1 for example).
But you should know that separated representation of np arrays is necessary for main features of joblib dump/load, joblib can load and save objects with np arrays faster than Pickle due to this separated representation, and in contrast to Pickle joblib can correctly save and load objects with memmap numpy arrays. If you want to have one file serialization of whole object (And don't want to save memmap np arrays) - i think that it would be better to use Pickle, AFAIK in this case joblib dump/load functionality will work at same speed as Pickle.
import numpy as np
from scikit-learn.externals import joblib
vector = np.arange(0, 10**7)
%timeit joblib.dump(vector, 'vector.pkl')
# 1 loops, best of 3: 818 ms per loop
# file size ~ 80 MB
%timeit vector_load = joblib.load('vector.pkl')
# 10 loops, best of 3: 47.6 ms per loop
# Compressed
%timeit joblib.dump(vector, 'vector.pkl', compress=1)
# 1 loops, best of 3: 1.58 s per loop
# file size ~ 15.1 MB
%timeit vector_load = joblib.load('vector.pkl')
# 1 loops, best of 3: 442 ms per loop
# Pickle
%%timeit
with open('vector.pkl', 'wb') as f:
pickle.dump(vector, f)
# 1 loops, best of 3: 927 ms per loop
%%timeit
with open('vector.pkl', 'rb') as f:
vector_load = pickle.load(f)
# 10 loops, best of 3: 94.1 ms per loop