How do I avoid redundant data fields in the result set when using JOINs?

Ashkan Kh. Nazary picture Ashkan Kh. Nazary · Jul 5, 2010 · Viewed 12.8k times · Source

Following join is supposed to retrieve user info along with their messages for users with a certain status:

SELECT * FROM user, message WHERE message.user_id=user.id AND user.status=1

The problem is that all rows about a certain user in the result set contain redundant columns that repeat the same data about that user (those fields retrieved from user table), only fields from the message table contain non-redundant information. Something like this:

user.id  username email            message.id  subject
1        jane     [email protected]   120         Notification 
1        jane     [email protected]   122         Re:Hello 
1        jane     [email protected]   125         Quotation
2        john     [email protected]   127         Hi jane
2        john     [email protected]   128         Fix thiss 
2        john     [email protected]   129         Ok
3        jim      [email protected]      140         Re:Re:Quotation

As you can see many data are redundant and we do not want to first find the users and then go about their messages in loop like structures or something like that. Loops that cause micro-queries should be avoided at all costs.

I am not concerned about the output of my program, that is well handled in the UI. I think perhaps the network traffic produced by returning the result of this query could be substantially reduced if somehow I can manage to eliminate the repetition of user data in all rows pertaining to that user.

Answer

Borealid picture Borealid · Jul 5, 2010

There are several things you should know.

The first is that the default SQL JOIN construct is essentially a set cross product, restricted by the WHERE clause. This means it's multiplicative - you get duplicate results out which you then prune down. You also have to be careful in the presence of NULL fields.

The second is that there is a 'DISTINCT' keyword. When you prefix a column in the selection with this, you'll get at most one instance of a certain value for that column in the results. So, as per your query, 'SELECT DISTINCT user.id FROM' will eliminate the redundancies on the server side.

The third is that the correct way to solve this problem is likely not using the * operator. I suggest:

SELECT user.id,username,email,subject FROM message m,user WHERE m.user_id=user.id AND user.status=1

This uses the simple, easy-to-understand implicit-join syntax and should be valid SQL on whatever server. I can vouch for it working with MySQL, at least. It also aliases the 'message' table to 'm' as shorthand.

As you surmise, this will reduce the traffic from the SQL server to your database.

edit: if you want to eliminate the "redundant" email information, you can't - you must make two distinct queries. SQL results are tables and must be rectangular, with all known values filled. There's no 'ditto' entry.

edit 2: You only have to make two queries. For instance:

SELECT subject FROM message WHERE message.id IN (SELECT user.id FROM user WHERE status=1)

This is one query that contains a nested query, so it's really making two database hits. But it doesn't have any programmatic loops.