tags:

views:

168

answers:

5

Here's the situation. I have two tables:

  • users (registered users of the website),
  • messages (personal messages they sent between each other)

The messages table has these columns (just the important ones):

  • id,
  • sender (id of user who sent the message),
  • receiver id of user to whom the message was sent),
  • reply_to (id of a message to which this message is reply to, can be NULL)

What I need to do is construct a SELECT query that will select a complete conversation between 2 users. I.e. if user A replies to message sent from user B and the user B replies back to the message, I would like to get three rows like this:

  • message03: reply to the message02
  • message02: reply to the message01
  • message01 from user A to user B

I'm sure that it is possible to construct such a SELECT query based on the reply_to field but I have never done something like it before so I need a little help.

The SELECT query should be for MySQL database.

+7  A: 

Actually you're incorrect: with ANSI SQL this isn't possible. Certain databases with vendor extensions (eg Oracle's CONNECT BY) may be able to do what you want but not plain old SQL.

My advice? Change your data so enable an easier solution.

In this case, give each message a conversation_id. If the user posts a new message, give that a new (currently unused) value. If they reply, keep the conversation_id of the message being replied to.

Then querying the data becomes trivial.

cletus
+1 To write one select to do this, you would have to define a max depth, and manually write out that many joins. But if the conversation goes one more message beyond how many joins you wrote, your query will fail to return all of them.
David Oneill
I don't think that's right David; can you explain further? Cletus's design makes sense to me.
Craig Walker
Note tha this is sort-of breaking normalization (though for a worthy cause), as the conversation_id could potentially get out of sync with the reply_to hierarchy. It's still a great idea though since it solves the problem without too much hassle. Just make sure that your application code reliably sets the conversation_id. It's also a good idea to have a checking/cleanup process that periodically checks all reply hierarchies for the correct conversation_id.
Craig Walker
`@Craig Walker`: this design allows selecting only the whole conversations (you cannot easily filter all replies to something that is not the message that started the conversation). However, the original question didn't ask for that, so this design does answer the question.
Quassnoi
This IS completely possible using a few nested select statements, albeit not a good idea. The `conversation_id` would CERTAINLY simplify it.
md5sum
@Quassnoi: That's true, though once you have the whole conversation you can do a relatively simple second-pass on the hierarchy in the application code (which probably supports recursion better). You'll want to do that regardless just to rebuild the hierarchy structure.
Craig Walker
@md5sum: the problem with nested selects is that you have to know how many selects you need ahead of time (ie: before you build your query). You can't do an arbitrary-depth recursion in plain SQL. For "reply-to" structures like this one that's almost certainly what you need.
Craig Walker
`@Craig Walker`: sure you can. However, questions tagged `SQL`, in my opinion, ask for the database-side answers, and most systems (including `MySQL` with the hack mentioned in my answer) support filtering on the parent message and building the hierarchical structures (preserving the order) on the database side.
Quassnoi
@Quassnoi: good point. As Cletus originally said: if you want straight SQL you're out of luck. If you want recursiveness, you'll have to go somewhere else, be it DB extensions and/or application code.
Craig Walker
+2  A: 

This is the adjacency list model.

MySQL has no native way to query it, but you can use a certain hack: create a function like this:

CREATE FUNCTION hierarchy_connect_by_parent_eq_prior_id(value INT) RETURNS INT
NOT DETERMINISTIC
READS SQL DATA
BEGIN
        DECLARE _id INT;
        DECLARE _parent INT;
        DECLARE _next INT;
        DECLARE CONTINUE HANDLER FOR NOT FOUND SET @id = NULL;

        SET _parent = @id;
        SET _id = -1;

        IF @id IS NULL THEN
                RETURN NULL;
        END IF;

        LOOP
                SELECT  MIN(id)
                INTO    @id
                FROM    messages
                WHERE   reply_to = _parent
                        AND id > _id;
                IF @id IS NOT NULL OR _parent = @start_with THEN
                        SET @level = @level + 1;
                        RETURN @id;
                END IF;
                SET @level := @level - 1;
                SELECT  id, reply_to
                INTO    _id, _parent
                FROM    messages
                WHERE   id = _parent;
        END LOOP;
END

and use it in a query:

SELECT  CONCAT(REPEAT('    ', level - 1), CAST(hi.id AS CHAR)) AS treeitem, parent, level
FROM    (
        SELECT  hierarchy_connect_by_parent_eq_prior_id(id) AS id, @level AS level
        FROM    (
                SELECT  @start_with := 0,
                        @id := @start_with,
                        @level := 0
                ) vars, messages
        WHERE   @id IS NOT NULL
        ) ho
JOIN    messages hi
ON      hi.id = ho.id

See this article in my blog for more detailed explanations on how it works:

Only the children of the original message (whose id should be used to initialize @start_with) will be selected.

This query additionally can be filtered for the values of sender_id and receiver_id to make sure only the messages between the users are selected.

Quassnoi
I've done similar things in SQL Server 2005+ by using CLR (ie: C#) functions.
Craig Walker
`@Craig Walker`: `SQL Server 2005` supports recursive `CTE`'s, there is no need to use `CLR` for that.
Quassnoi
@Quassnoi: Yes, recursive CTEs would be an even better (in fact ideal) solution to this problem. The problem I solved with CLR functions had some additional application-level complexity that warranted the jump.
Craig Walker
This method also assumes that during the course of a conversation with multiple users, no participants are also carrying on their own conversations externally.
md5sum
`md5sum`: this method assumes nothing like that. It builds the complete hierarchy starting from the message with the given `id` which later can be filtered on the values of `sender_id` and `receiver_id`. The side conversations are not a part of hierarchy and won't ever be returned by this query.
Quassnoi
@Quassnoi - You're right, missed the purpose of reply_to.
md5sum
+6  A: 

I would suggest adding a conversation_id field to the messages table. Each new non-reply message would have a generated conversation_id, and then each reply based on that message would use the same id. Then your query would be simple:

select * from messages where conversation_id = ? order by id asc
Kaleb Brasee
+2  A: 

I use a technique I found in Joe Celko's SQL for Smarties book - chapter 29 on the nested set model for trees. Maintaining the data is a little ugly (insert, update, delete), but selects are very fast. The code in the book is very thorough and well explained. There is also some info on how to convert the data you have into this new model.

Ray
+1  A: 

Using something like:

SELECT *
FROM [Messages] 
WHERE 
    (
        [Sender] = @Sender 
            AND [Reciever] = @Reciever
    ) OR (
        [Sender] = @Reciever 
            AND [Reciever] = @Sender)

Will get you an entire conversation history. As far as the reply_to field, I wouldn't use that because:

A) It will make it very complex to retrieve the first message of a conversation.

B) You can use other filters such as date or limit the history length to prevent a full output of every conversation the users have had.

Instead, I would add something along the lines of a ConversationId instead that would increment if there was no messages sent between users for a pre-specified length of time.

In specific answer to your question though, the following query will work, provided you can allow selection of the first message in a conversation:

SELECT * 
FROM [Messages] 
WHERE 
    (
        (
            [Sender] = @Sender 
                AND [Reciever] = @Reciever
        ) OR (
            [Sender] = @Reciever 
                AND [Reciever] = @Sender)
    )
    AND id >= @FirstMessageId
    AND id < 
        (
            SELECT TOP 1 [id] 
            FROM [Messages] 
            WHERE [id] > @FirstMessageId
                AND [reply_to] IS NULL
                AND  
                    (       
                        (
                            [Sender] = @Sender 
                                AND [Reciever] = @Reciever
                        ) OR (
                            [Sender] = @Reciever 
                                AND [Reciever] = @Sender
                        )
                    )
    )
md5sum
This only works if the conversation has exactly two conversants AND they're both known ahead of time. That happens to match the poster's example but probably wouldn't work in the real application.
Craig Walker
The poster's example data schema only allows for conversations between two people, and as I stated, it's better to use a `ConversationId`.
md5sum
`@md5sum`: no, it does not. You can put arbitrary values into `sender_id` and `receiver_id`. In fact, this schema is denormalized, so that the `receiver_id` does not necessarily match the `sender_id` in the message referenced by `reply_to`. You can put `4` different people in two messages.
Quassnoi
@Quassnoi - You're assuming that the OP is allowing multiple ids to be placed in the recipient field, which, in accordance to his question, I would assume he isn't. If he IS however, this query could be modified using an `IN` instead, which would further slow down execution. Once again, `ConversationId` will fix this.
md5sum
`@md5sum`: I assume nothing like that. There can be two replies to one message (in two separate records) and these replies can be posted by one or multiple persons.
Quassnoi
@Quassnoi - At that point he's just screwed without `ConversationId` LoL... :D
md5sum
`@md5sum`: btw, downvoting without explaining the reason is considered a bad tone.
Quassnoi
@Quassnoi - I haven't downvoted anything where it either wasn't already explained, or without explaining. In your case, if you'll edit your answer, the downvote will be removed, that was my honest mistake.
md5sum
`md5sum`: updated the answer.
Quassnoi
Done, I'd give you a + too, but it won't let me :( may try editing again, but I dunno...
md5sum