For quick testing, debugging, creating portable examples, and benchmarking, R has available to it a large number of data sets (in the Base R datasets
package). The command library(help="datasets")
at the R prompt describes nearly 100 historical datasets, each of which have associated descriptions and metadata.
Is there anything like this for Python?
You can use rpy2
package to access all R datasets from Python.
Set up the interface:
>>> from rpy2.robjects import r, pandas2ri
>>> def data(name):
... return pandas2ri.ri2py(r[name])
Then call data()
with any dataset's name of the available datasets (just like in R
)
>>> df = data('iris')
>>> df.describe()
Sepal.Length Sepal.Width Petal.Length Petal.Width
count 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.057333 3.758000 1.199333
std 0.828066 0.435866 1.765298 0.762238
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000
To see a list of the available datasets with description for each:
>>> print(r.data())
Note: rpy2 requires R
installation with setting R_HOME
variable, and pandas
must be installed as well.
I just created PyDataset, which is a simple module to make loading a dataset from Python as easy as R
's (and it does not require R
installation, only pandas
).
To start using it, install the module:
$ pip install pydataset
then just load up any dataset you wish (currently around 757 datasets available) :
from pydataset import data
titanic = data('titanic')