sklearn dumping model using joblib, dumps multiple files. Which one is the correct model?

kcc__ picture kcc__ · Nov 3, 2015 · Viewed 27.8k times · Source

I did a sample program to train a SVM using sklearn. Here is the code

from sklearn import svm
from sklearn import datasets
from sklearn.externals import joblib

clf = svm.SVC()
iris = datasets.load_iris()
X, y = iris.data, iris.target
clf.fit(X, y)

print(clf.predict(X))
joblib.dump(clf, 'clf.pkl') 

When I dump the model file I get this amount of files. :

['clf.pkl', 'clf.pkl_01.npy', 'clf.pkl_02.npy', 'clf.pkl_03.npy', 'clf.pkl_04.npy', 'clf.pkl_05.npy', 'clf.pkl_06.npy', 'clf.pkl_07.npy', 'clf.pkl_08.npy', 'clf.pkl_09.npy', 'clf.pkl_10.npy', 'clf.pkl_11.npy']

I am confused if I did something wrong. Or is this normal? What is *.npy files. And why there are 11?

Answer

Ibraim Ganiev picture Ibraim Ganiev · Nov 3, 2015

To save everything into 1 file you should set compression to True or any number (1 for example).

But you should know that separated representation of np arrays is necessary for main features of joblib dump/load, joblib can load and save objects with np arrays faster than Pickle due to this separated representation, and in contrast to Pickle joblib can correctly save and load objects with memmap numpy arrays. If you want to have one file serialization of whole object (And don't want to save memmap np arrays) - i think that it would be better to use Pickle, AFAIK in this case joblib dump/load functionality will work at same speed as Pickle.

import numpy as np
from scikit-learn.externals import joblib

vector = np.arange(0, 10**7)

%timeit joblib.dump(vector, 'vector.pkl')
# 1 loops, best of 3: 818 ms per loop
# file size ~ 80 MB
%timeit vector_load = joblib.load('vector.pkl')
# 10 loops, best of 3: 47.6 ms per loop

# Compressed
%timeit joblib.dump(vector, 'vector.pkl', compress=1)
# 1 loops, best of 3: 1.58 s per loop
# file size ~ 15.1 MB
%timeit vector_load = joblib.load('vector.pkl')
# 1 loops, best of 3: 442 ms per loop

# Pickle
%%timeit
with open('vector.pkl', 'wb') as f:
    pickle.dump(vector, f)
# 1 loops, best of 3: 927 ms per loop
%%timeit                                    
with open('vector.pkl', 'rb') as f:
    vector_load = pickle.load(f)
# 10 loops, best of 3: 94.1 ms per loop