Large public datasets?

Jason picture Jason · Dec 19, 2008 · Viewed 45.2k times · Source

I am looking for some large public datasets, in particular:

  1. Large sample web server logs that have been anonymized.

  2. Datasets used for database performance benchmarking.

Any other links to large public datasets would be appreciated. I already know about Amazon's public datasets at: http://aws.amazon.com/publicdatasets/

Answer

MrGomez picture MrGomez · Apr 23, 2012

1. Large sample web server logs that have been anonymized.

These work to start with:

There are many, many more data sets available than these (see the gamut of other answers), but this is the lowest hanging fruit that meets your original criteria. As a bonus, they have a contact link if you have specific needs they may know of.

2. Datasets used for database performance benchmarking.

This sounds like a misnomer, because you're asking for empirical data sets that describe well-defined algorithmic problems. Specifically, it sounds like you're trying to find sets of data that you can use to test and benchmark various database systems in real time, using well-defined, normalized relational data that can be used as a set of test cases for determining the most efficient solution that meets your needs.

I don't agree with this approach. Instead of finding a litany of database systems and their canned implementations, it's far better to explore the algorithmic guarantees of these systems as your first port of call. Once you've determined the algorithmic constraints that meet your needs, you can hone in on a set of canned solutions that you can benchmark on efficiency of, for example, indexing, sorting, searching, insertion, deletion, and retrieval.

Wikipedia provides a terse article on database testing concepts that you can use to determine and write test cases for benchmarking performance. For example, you might use an agnostic data access interface like JDBC and JDBC Benchmark to determine the relative timings of each operation. From here, you can hone in on a correct solution.

In short, go to the research first for determining database guarantees. Once a set of candidate solutions has been identified, you can select amongst those by testing (or otherwise determining) the constant time performance of each desired operation.