Suppose I have a table of customers and a table of purchases. Each purchase belongs to one customer. I want to get a list of all customers along with their last purchase in one SELECT statement. What is the best practice? Any advice on building indexes?
Please use these table/column names in your answer:
And in more complicated situations, would it be (performance-wise) beneficial to denormalize the database by putting the last purchase into the customer table?
If the (purchase) id is guaranteed to be sorted by date, can the statements be simplified by using something like LIMIT 1
?
This is an example of the greatest-n-per-group
problem that has appeared regularly on StackOverflow.
Here's how I usually recommend solving it:
SELECT c.*, p1.*
FROM customer c
JOIN purchase p1 ON (c.id = p1.customer_id)
LEFT OUTER JOIN purchase p2 ON (c.id = p2.customer_id AND
(p1.date < p2.date OR (p1.date = p2.date AND p1.id < p2.id)))
WHERE p2.id IS NULL;
Explanation: given a row p1
, there should be no row p2
with the same customer and a later date (or in the case of ties, a later id
). When we find that to be true, then p1
is the most recent purchase for that customer.
Regarding indexes, I'd create a compound index in purchase
over the columns (customer_id
, date
, id
). That may allow the outer join to be done using a covering index. Be sure to test on your platform, because optimization is implementation-dependent. Use the features of your RDBMS to analyze the optimization plan. E.g. EXPLAIN
on MySQL.
Some people use subqueries instead of the solution I show above, but I find my solution makes it easier to resolve ties.