Which provides better performance one big join or multiple queries?

leora picture leora · Dec 19, 2009 · Viewed 25.3k times · Source

i have a table called orders. one column on order is customer_id
i have a table called customers with 10 fields

Given the two options if i want to build up an array of order objects and embedded in an order object is a customer object i have two choices.

Option 1:

a. first query orders table. b. loop through records and query the persons table to get the records for the person

this would be something like:

 Select * from APplications

 Select * from Customer where id = 1
 Select * from Customer where id = 2
 Select * from Customer where id = 3
 Select * from Customer where id = etc . . .

Option 2:

a. do a join on all fields

its an obvious #2 because you are only doing one query versus 1 + [numberOforders] queries (could be hundreds or more)

This would be something like:

 Select * from Applications a, Customers c
 Innerjoin c.id = a.customerID

my main question is, what if i had 10 other tables that were off of the orders table (similar to customer) where you had the id in the order table. should you do a single query that joins these 10 tables or at some point is it inefficient do to this:

any suggestions would help.. is there any optimization to ensure fast performance

Answer

Richard Beier picture Richard Beier · Dec 19, 2009

I agree with everyone who's said a single join will probably be more efficient, even with a lot of tables. It's also less development effort than doing the work in your application code. This assumes the tables are appropriately indexed, with an index on each foreign key column, and (of course) an index on each primary key column.

Your best bet is to try the easiest approach (the big join) first, and see how well it performs. If it performs well, then great - you're done. If it performs poorly, profile the query and look for missing indexes on your tables.

Your option #1 is not likely to perform well, due to the number of network round-trips (as anijhaw mentioned). This is sometimes called the "select N+1" problem - you do one SELECT to get the list of N applications, and then do N SELECTs in a loop to get the customers. This record-at-a-time looping is natural to application programmers; but SQL works much better when you operate on whole sets of data at once.

If option #2 is slow even with good indexing, you may want to look into caching. You can cache in the database (using a summary table or materialized/indexed view), in the application (if there is enough RAM), or in a dedicated caching server such as memcached. Of course, this depends on how up-to-date your query results need to be. If everything has to be fully up-to-date, then any cache would have to be updated whenever the underlying tables are updated - it gets complicated and becomes less useful.

This sounds like a reporting query though, and reporting often doesn't need to be real-time. So caching might be able to help you.

Depending on your DBMS, another thing to think about is the impact of this query on other queries hitting the same database. If your DBMS allows readers to block writers, then this query could prevent updates to the tables if it takes a long time to run. That would be bad. Oracle doesn't have this problem, and neither does SQL Server when run in "read committed snapshot" mode. I don't know about MySQL though.