I'm working on a reporting system that allows the user to arbitrarily query a set of fact tables, constraining on multiple dimension tables for each fact table. I've written a query-builder class that automatically assembles all the correct joins and subqueries based on the constraint parameters, and everything works as designed.
But, I have a feeling that I'm not generating the most efficient queries. On a set of tables with a few million records, these queries take about 10 seconds to run, and I'd like to get them down in the less-than-one-second range. I have a feeling that, if I could get rid of the subqueries, the result would be much more efficient.
Rather than show you my actual schema (which is much more complicated), I'll show you an analogous example that illustrates the point without having to explain my whole application and data model.
Imagine that I have a database of concert information, with artists and venues. Users can arbitrarily tag the artists and the venues. So the schema looks like this:
concert
id
artist_id
venue_id
date
artist
id
name
venue
id
name
tag
id
name
artist_tag
artist_id
tag_id
venue_tag
venue_id
tag_id
Pretty simple.
Now let's say I want to query the database for all concerts happening within one month of today, for all artists with 'techno' and 'trombone' tags, performing at concerts with 'cheap-beer' and 'great-mosh-pits' tag.
The best query I've been able to come up with looks like this:
SELECT
concert.id AS concert_id,
concert.date AS concert_date,
artist.id AS artist_id,
artist.name AS artist_name,
venue.id AS venue_id,
venue.name AS venue_name,
FROM
concert
INNER JOIN (
artist ON artist.id = concert.artist_id
) INNER JOIN (
venue ON venue.id = concert.venue_id
)
WHERE (
artist.id IN (
SELECT artist_id
FROM artist_tag
INNER JOIN tag AS a on (
a.id = artist_tag.tag_id
AND
a.name = 'techno'
) INNER JOIN tag AS b on (
b.id = artist_tag.tag_id
AND
b.name = 'trombone'
)
)
AND
venue.id IN (
SELECT venue_id
FROM venue_tag
INNER JOIN tag AS a on (
a.id = venue_tag.tag_id
AND
a.name = 'cheap-beer'
) INNER JOIN tag AS b on (
b.id = venue_tag.tag_id
AND
b.name = 'great-mosh-pits'
)
)
AND
concert.date BETWEEN NOW() AND (NOW() + INTERVAL 1 MONTH)
)
The query works, but I really don't like having those multiple subqueries. If I could accomplish the same logic purely using JOIN logic, I have a feeling the performance would drastically improve.
In a perfect world, I'd be using a real OLAP server. But my customers will be deploying to MySQL or MSSQL or Postgres, and I can't guarantee that a compatible OLAP engine will be available. So I'm stuck using an ordinary RDBMS with a star schema.
Don't get too hung up on the details of this example (my real application has nothing to do with music, but it has multiple fact tables with an analogous relationship to the ones I've shown here). In this model, the 'artist_tag' and 'venue_tag' tables function as fact tables, and everything else is a dimension.
It's important to note, in this example, that the queries are much simpler to write if I only allow the user to constrain against a single artist_tag or venue_tag value. It only gets really tricky when I allow the queries to include AND logic, requiring multiple distinct tags.
So, my question is: what are the best techniques that you know of for writing efficient queries against multiple fact tables?
My approach is a bit more generic, putting the filter parameters in tables and then using GROUP BY, HAVING and COUNT to filter the results. I've used this basic approach several times for some very sophisticated 'searching' and it works very well (for me grin).
I also don't join on the Artist and Venue dimension tables initially. I'd get the results as id's (just needing artist_tag and venue_tag) then join the results on the artist and venue tables to get those dimension values. (Basically, search for the entity id's in a sub query, then in an outer query get the dimension values you need. Keeping them separate should improve things...)
DECLARE @artist_filter TABLE (
tag_id INT
)
DECLARE @venue_filter TABLE (
tag_id INT
)
INSERT INTO @artist_filter
SELECT id FROM tag
WHERE name IN ('techno','trombone')
INSERT INTO @venue_filter
SELECT id FROM tag
WHERE name IN ('cheap-beer','great-most-pits')
SELECT
concert.id AS concert_id,
concert.date AS concert_date,
artist.id AS artist_id,
venue.id AS venue_id
FROM
concert
INNER JOIN
artist_tag
ON artist_tag.artist_id = concert.artist_id
INNER JOIN
@artist_filter AS [artist_filter]
ON [artist_filter].tag_id = artist_tag.id
INNER JOIN
venue_tag
ON venue_tag.venue_id = concert.venue_id
INNER JOIN
@venue_filter AS [venue_filter]
ON [venue_filter].tag_id = venue_tag.id
WHERE
concert.date BETWEEN NOW() AND (NOW() + INTERVAL 1 MONTH)
GROUP BY
concert.id,
concert.date,
artist_tag.artist_id,
venue_tag.id
HAVING
COUNT(DISTINCT [artist_filter].id) = (SELECT COUNT(*) FROM @artist_filter)
AND
COUNT(DISTINCT [venue_filter].id) = (SELECT COUNT(*) FROM @venue_filter)
(I'm on a netbook and suffering for it, so I'll leave out the outer query getting the artist and venue names from the artist and venue tables grin)
EDIT
Note:
Another option would be to filter the artist_tag and venue_tag tables in sub-queries/derived-tables. Whether this is worth it depends on how influential the join on the Concert table is. My assumption here is that there are MANY artist and venues, but once filtered on the concert table (itself filtered by the dates) the number of artists/venues decreases dramatically.
Also, there is often a need/desire to deal with the case where NO artist_tags and/or venue_tags are specified. From experience it is better to deal with this programatically. That is, use IF statements and queries specially suited to those cases. A single SQL query CAN be written to handle it, but is much slower than the programatic alternative. Equally, writing similar queries several times may look messy and degrade maintainability, but the increase in complexity need to get this to be a single query is often harder to maintain.
EDIT
Another similar layout could be...
- Filter concert by artist as sub_query/derived_table
- Filter results by venue as sub_query/derived_table
- Join results on dimension tables to get names, etc
(Cascaded filtering)
SELECT
<blah>
FROM
(
SELECT
<blah>
FROM
(
SELECT
<blah>
FROM
concert
INNER JOIN
artist_tag
INNER JOIN
artist_filter
WHERE
GROUP BY
HAVING
)
INNER JOIN
venue_tag
INNER JOIN
venue_filter
GROUP BY
HAVING
)
INNER JOIN
artist
INNER JOIN
venue
By cascading the filtering, each subsequent filtering has a reduce set it has to work on. This MAY reduce the work done by the GROUP BY - HAVING section of the query. For two levels of filtering I would guess this to be unlikely to be dramatic.
The original may still be more performant as it benefits additional filtering in a different manner. In your example:
- There may be many artist in your date range, but few which meet at least one criteria
- There may be many venues in your date range, but few which meet at least one criteria
- Before the GROUP BY, however, all concerts are eliminated where...
---> the artist(s) meets NONE of the criteria
---> AND/OR the venue meets NONE of the criteria
Where you are searching by many criteria this filtering degrades. Also where venues and/or artists share a lot of tags, the filtering also degrades.
So when would I use the original, or when would I use the Cascaded version?
- Original : Few search criteria and venues/artists are dis-similar from each other
- Cascaded : Lots of search criteria or venues/artists tend to be similar