sklearn and large datasets

Donbeo picture Donbeo · May 26, 2014 · Viewed 21.6k times · Source

I have a dataset of 22 GB. I would like to process it on my laptop. Of course I can't load it in memory.

I use a lot sklearn but for much smaller datasets.

In this situations the classical approach should be something like.

Read only part of the data -> Partial train your estimator -> delete the data -> read other part of the data -> continue to train your estimator.

I have seen that some sklearn algorithm have the partial fit method that should allow us to train the estimator with various subsamples of the data.

Now I am wondering is there an easy why to do that in sklearn? I am looking for something like

r = read_part_of_data('data.csv')
m = sk.my_model
`for i in range(n):
     x = r.read_next_chunk(20 lines)
     m.partial_fit(x)

m.predict(new_x)

Maybe sklearn is not the right tool for these kind of things? Let me know.

Answer

Alexis Perrier picture Alexis Perrier · Oct 31, 2015

I've used several scikit-learn classifiers with out-of-core capabilities to train linear models: Stochastic Gradient, Perceptron and Passive Agressive and also Multinomial Naive Bayes on a Kaggle dataset of over 30Gb. All these classifiers share the partial_fit method which you mention. Some behave better than others though.

You can find the methodology, the case study and some good resources in of this post: http://www.opendatascience.com/blog/riding-on-large-data-with-scikit-learn/