How do I take an efficient simple random sample in SQL? The database in question is running MySQL; my table is at least 200,000 rows, and I want a simple random sample of about 10,000.
The "obvious" answer is to:
SELECT * FROM table ORDER BY RAND() LIMIT 10000
For large tables, that's too slow: it calls RAND()
for every row (which already puts it at O(n)), and sorts them, making it O(n lg n) at best. Is there a way to do this faster than O(n)?
Note: As Andrew Mao points out in the comments, If you're using this approach on SQL Server, you should use the T-SQL function NEWID()
, because RAND() may return the same value for all rows.
EDIT: 5 YEARS LATER
I ran into this problem again with a bigger table, and ended up using a version of @ignorant's solution, with two tweaks:
ORDER BY RAND()
RAND()
to an indexed column on every insert/update. (If your data set isn't very update-heavy, you may need to find another way to keep this column fresh.)To take a 1000-item sample of a table, I count the rows and sample the result down to, on average, 10,000 rows with the the frozen_rand column:
SELECT COUNT(*) FROM table; -- Use this to determine rand_low and rand_high
SELECT *
FROM table
WHERE frozen_rand BETWEEN %(rand_low)s AND %(rand_high)s
ORDER BY RAND() LIMIT 1000
(My actual implementation involves more work to make sure I don't undersample, and to manually wrap rand_high around, but the basic idea is "randomly cut your N down to a few thousand.")
While this makes some sacrifices, it allows me to sample the database down using an index scan, until it's small enough to ORDER BY RAND()
again.
I think the fastest solution is
select * from table where rand() <= .3
Here is why I think this should do the job.
This assumes that rand() is generating numbers in a uniform distribution. It is the quickest way to do this.
I saw that someone had recommended that solution and they got shot down without proof.. here is what I would say to that -
mysql is very capable of generating random numbers for each row. Try this -
select rand() from INFORMATION_SCHEMA.TABLES limit 10;
Since the database in question is mySQL, this is the right solution.