Sklearn joblib load function IO error from AWS S3

Jasmine picture Jasmine · Aug 26, 2015 · Viewed 7.3k times · Source

I am trying to load a pkl dump of my classifier from sklearn-learn.

The joblib dump does a much better compression than the cPickle dump for my object so I would like to stick with it. However, I am getting an error when trying to read the object from AWS S3.

Cases:

  • Pkl object hosted locally: pickle.load works, joblib.load works
  • Pkl object pushed to Heroku with app (load from static folder): pickle.load works, joblib.load works
  • Pkl object pushed to S3: pickle.load works, joblib.load returns IOError. (testing from heroku app and tested from local script)

Note that the pkl objects for joblib and pickle are different objects dumped with their respective methods. (i.e. joblib loads only joblib.dump(obj) and pickle loads only cPickle.dump(obj).

Joblib vs cPickle code

# case 2, this works for joblib, object pushed to heroku
resources_dir = os.getcwd() + "/static/res/" # main resource directory
input = joblib.load(resources_dir + 'classifier.pkl')

# case 3, this does not work for joblib, object hosted on s3
aws_app_assets = "https://%s.s3.amazonaws.com/static/res/" % keys.AWS_BUCKET_NAME
classifier_url_s3 = aws_app_assets + 'classifier.pkl'

# does not work with raw url, IO Error
classifier = joblib.load(classifier_url_s3)

# urrllib2, can't open instance
# TypeError: coercing to Unicode: need string or buffer, instance found
req = urllib2.Request(url=classifier_url_s3)
f = urllib2.urlopen(req)
classifier = joblib.load(urllib2.urlopen(classifier_url_s3))

# but works with a cPickle object hosted on S3
classifier = cPickle.load(urllib2.urlopen(classifier_url_s3))

My app works fine in case 2, but because of very slow loading, I wanted to try and push all static files out to S3, particularly these pickle dumps. Is there something inherently different about the way joblib loads vs pickle that would cause this error?

This is my error

File "/usr/local/lib/python2.7/site-packages/sklearn/externals/joblib/numpy_pickle.py", line 409, in load
with open(filename, 'rb') as file_handle:
IOError: [Errno 2] No such file or directory: classifier url on s3
[Finished in 0.3s with exit code 1]

It is not a permissions issue as I've made all my objects on s3 public for testing and the pickle.dump objects load fine. The joblib.dump object also downloads if I directly enter the url into the browser

I could be completely missing something.

Thanks.

Answer

volodymyr picture volodymyr · Sep 3, 2015

joblib.load() expects a name of the file present on filesystem.

Signature: joblib.load(filename, mmap_mode=None)
Parameters
-----------
filename: string
    The name of the file from which to load the object

Moreover, making all your resources public might not be a good idea for other assets, even you don't mind pickled model being accessible to the world.

It is rather simple to copy object from S3 to local filesystem of your worker first:

from boto.s3.connection import S3Connection
from sklearn.externals import joblib
import os

s3_connection = S3Connection(AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)
s3_bucket = s3_connection.get_bucket(keys.AWS_BUCKET_NAME)
local_file = '/tmp/classifier.pkl'
s3_bucket.get_key(aws_app_assets + 'classifier.pkl').get_contents_to_filename(local_file)
clf = joblib.load(local_file)
os.remove(local_file)

Hope this helped.

P.S. you can use this approach to pickle the entire sklearn pipeline. This includes also feature imputation. Just beware of version conflicts of libraries between training and predicting.