views:

53

answers:

4

Following join is supposed to retrieve user info along with their messages for users with a certain status:

SELECT * FROM user, message WHERE message.user_id=user.id AND user.status=1

The problem is that all rows about a certain user in the result set contain redundant columns that repeat the same data about that user (those fields retrieved from user table), only fields from the message table contain non-redundant information. Something like this:

user.id  username email            message.id  subject
1        jane     [email protected]   120         Notification 
1        jane     [email protected]   122         Re:Hello 
1        jane     [email protected]   125         Quotation
2        john     [email protected]   127         Hi jane
2        john     [email protected]   128         Fix thiss 
2        john     [email protected]   129         Ok
3        jim      [email protected]      140         Re:Re:Quotation

As you can see many data are redundant and we do not want to first find the users and then go about their messages in loop like structures or something like that. Loops that cause micro-queries should be avoided at all costs.

I am not concerned about the output of my program, that is well handled in the UI. I think perhaps the network traffic produced by returning the result of this query could be substantially reduced if somehow I can manage to eliminate the repetition of user data in all rows pertaining to that user.

+1  A: 

In the straight sql query there is not if you keep them as a single query. If you are programmatically printing this out, then you would order by user id and only reprint that information if the user id changes.

REW
+1  A: 

In the SQL standard, you would use NATURAL JOIN; this joins on common column names and only preserves one copy of those common names.

In practice, you carefully list the columns you want, rather than resorting to the '*' shorthand notation.

Jonathan Leffler
+1  A: 

There are several things you should know.

The first is that the default SQL JOIN construct is essentially a set cross product, restricted by the WHERE clause. This means it's multiplicative - you get duplicate results out which you then prune down. You also have to be careful in the presence of NULL fields.

The second is that there is a 'DISTINCT' keyword. When you prefix a column in the selection with this, you'll get at most one instance of a certain value for that column in the results. So, as per your query, 'SELECT DISTINCT user.id FROM' will eliminate the redundancies on the server side.

The third is that the correct way to solve this problem is likely not using the '*' operator. I suggest:

SELECT user.id,username,email,subject FROM message m,user WHERE m.user_id=user.id AND user.status=1

This uses the simple, easy-to-understand implicit-join syntax and should be valid SQL on whatever server. I can vouch for it working with MySQL, at least. It also aliases the 'message' table to 'm' as shorthand.

As you surmise, this will reduce the traffic from the SQL server to your database.

edit: if you want to eliminate the "redundant" email information, you can't - you must make two distinct queries. SQL results are tables and must be rectangular, with all known values filled. There's no 'ditto' entry.

edit 2: You only have to make two queries. For instance:

SELECT subject FROM message WHERE message.id IN (SELECT user.id FROM user WHERE status=1)

This is one query that contains a nested query, so it's really making two database hits. But it doesn't have any programmatic loops.

Borealid
A: 

Assuming you can use stored procedure, you could write one to run the above query and then use a cursor to store nulls for 'redundant information' to get something like

user.id  username email            message.id  subject 
1        jane     [email protected]   120         Notification  
null     null     null             122         Re:Hello  
null     null     null             125         Quotation 
2        john     [email protected]   127         Hi jane 
null     null     null             128         Fix thiss  
null     null     null             129         Ok 
3        jim      [email protected]      140         Re:Re:Quotation

and then return this resultset in a temporary table. but while this may reduce network traffic, it will add a processing overhead

Another way is to run 2 queries, one to get the user information, and the other to get the message information with just the linked user id and then do the "join" using application server side code. something like

SELECT DISTINCT user.* FROM user, message WHERE message.user_id=user.id AND user.status=1

and

SELECT user.id, message.* FROM user, message WHERE message.user_id=user.id AND user.status=1

which will result in 2 trips to the database, instead of 1, which might eventually be slower, even if the network traffic is reduced.

And another way is to bunch these 2 into a single resultset with something like

SELECT user.* FROM user, message WHERE message.user_id=user.id AND user.status=1
UNION ALL
SELECT user.id, message.* FROM user, message WHERE message.user_id=user.id AND user.status=1

to get something like

   user.id  username/message.id    email/subject 
    1        jane                   [email protected]   
    2        john                   [email protected]   
    3        jim                    [email protected]      
    1        120                    Notification           
    1        122                    Re:Hello           
    1        125                    Quotation          
    2        127                    Hi jane          
    2        128                    Fix thiss           
    2        129                    Ok          
    3        140                    Re:Re:Quotation

and then use application server logic to separate it out. reduced network traffic but more application server load / marginally more database server load.

But the saved network traffic is rarely worth the added complexity.

potatopeelings