Datasets for Running Statistical Analysis on

Tal Galili picture Tal Galili · Feb 12, 2010 · Viewed 10.5k times · Source

What datasets exist out on the internet that I can run statistical analysis on?

Answer

Shane picture Shane · Feb 12, 2010

The datasets package is included with base R. Run this command to see a full list:

library(help="datasets")

Beyond that, there are many packages that can pull data, and many others that contain important data. Of these, you may want to start by looking at the HistData package, which "provides a collection of small data sets that are interesting and important in the history of statistics and data visualization".

For financial data, the quantmod package provides a common interface for pulling time series data from google, yahoo, FRED, and others:

library(quantmod)
getSymbols("YHOO",src="google") # from google finance 
getSymbols("GOOG",src="yahoo") # from yahoo finance 
getSymbols("DEXUSJP",src="FRED") # FX rates from FRED 

FRED (the Federal Reserve of St. Louis) is really a landmine of free economic data.

Many R packages come bundled with data that is specific to their goal. So if you're interested in genetics, multilevel models, etc., the relevant packages will frequently have the canonical example for that analysis. Also, the book packages typically ship with the data needed to reproduce all the examples.

Here are some examples of relevant packages:

  • alr3: includes data to accompany Applied Linear Regression (http://www.stat.umn.edu/alr)
  • arm: includes some of the data from Gelman's "Data Analysis Using Regression and Multilevel/Hierarchical Models" (the rest of the data and code is on the book's website)
  • BaM: includes data from "Bayesian Methods: A Social and Behavioral Sciences Approach"
  • BayesDA: includes data from Gelman's "Bayesian Data Analysis"
  • cat: includes data for analysis of categorical-variable datasets
  • cimis: from retrieving data from CIMIS, the California Irrigation Management Information System
  • cshapes: includes GIS data boundaries and data
  • ecdat: data sets for econometrics
  • ElemStatLearn: includes data from "The Elements of Statistical Learning, Data Mining, Inference, and Prediction"
  • emdbook: data from "Ecological Models and Data"
  • Fahrmeir: data from the book "Multivariate Statistical Modelling Based on Generalized Linear Models"
  • fEcoFin: "Economic and Financial Data Sets" for Rmetrics
  • fds: functional data sets
  • fma: data sets from "Forecasting: methods and applications"
  • gamair: data for "Generalized Additive Models: An Introduction with R"
  • geomapdata: data for topographic and Geologic Mapping
  • nutshell: contains all the data from the "R in a Nutshell" book
  • nytR: provides access to congressional vote data through the NY Times API
  • openintro: data from the book
  • primer: includes data for "A Primer of Ecology with R"
  • qtlbook: includes data for the R/qtl book
  • RGraphics: includes data from the "R Graphics" book
  • Read.isi: access to old World Fertility Survey data