tags:

views:

97

answers:

3

Hey guys, quick question, I have this query, and I am trying to get the latest comment for each topic and then sort those results in descending order (therefore one comment per topic). I have what I think should work, but my join always messes my results up. Somehow, it seems to have sorted the end results properly, but has not taken the latest comment from each topic instead it seems to have just taken a random comment. If anyone has any ideas, would really appreciate any advice

SELECT * FROM comments 
JOIN topic ON topic.topic_id=comments.topic_id 
WHERE topic.creator='admin' 
GROUP BY comments.topic_id 
ORDER BY comments.time DESC

table comments is structured like
id time user message topic_id


table topic is structured like
topic_id subject_id topic_title creator timestamp description

A: 

If your trying to get the latest comment, it should be ORDER BY comments.time DESC LIMIT 1. I doubt that'll solve your problem, though.

Andrew
Why the downvote? What I said is true. Otherwise he would've gotten all the comments for that topic.
Andrew
I'm not the downvoter, but the OP is looking for the latest comment per topic id. LIMIT 1 works over the entire result set, not within the groupings.
Larry Lustig
+2  A: 

This is an extension of standard SQL in MySQL which I don't think is helpful at all. In standard SQL your command would not be allowed at all since there's no way to determine which single line should be reported as a result of the GROUP BY. MySQL will execute this command with (as you found out) a random row returned.

You can see a discussion of this issue here: MySQL - Control which row is returned by a group by.

Larry Lustig
Thanks Larry, I asked the guy below as well, but I might as well ask you since you were the one who sent the article. Your article said there were performance issues with using select queries within select queries. Is this true? If so, when do I have to worry about this?
Scarface
@Scarface: each query you send to a database is a fairly expensive operation involving preparation, disk reads, sorting, etc. So if your query involves a second query, it becomes twice as expensive. That's not usually a problem if it's two queries instead of one (as in this case). But there are cases, called *correlated queries*, in which the nested query must be executed once for every candidate row in the outer query and that can have unacceptable performance implications.
Larry Lustig
thanks larry, appreciate it
Scarface
+3  A: 

You've got a couple of things going on here. First, the reason your current query is returning weird results is that you aren't really using your GROUP BY clause in the way intended; it is intended to be used with aggregrated fields (like COUNT(), SUM(), etc). It is a convenient side-effect that on MySQL, the GROUP BY clause also returns the first record that would be in the group--which, in your case, should be the first inserted message for each topic (not a random one). So your query as it is written is essentially returning the oldest messsage per topic (on MySql only; note that other RDBMS's will throw an error if you try to use a GROUP BY clause like that!)

But you can actually abuse the GROUP BY clause to get what you want, and you are really close already. What you need to do is to do a sub-query to make a derived table first (with your messages ordered by DESC date), then query the derived table using the GROUP BY clause like this:

select * from (
  SELECT
    topic.topic_title, comments.id, comments.topic_id, comments.message
  FROM comments
  JOIN topic ON topic.topic_id=comments.topic_id
  WHERE topic.creator='admin'
  order by comments.time desc) derived_table
group by topic_id
Ken Taylor
Hey really nice answer, it worked. If I could donate points I would donate 10 of mine. Very detailed, and I don't feel mystified after reading. I just have one question however, when I read the article that one of the other guys sent me it said there were performance issues with using select queries within select queries. Is this true? If so, when do I have to worry about this?
Scarface
MySQL does return random values, not the first row, at least according to the docs: "The server is free to return any value from the group, so the results are indeterminate unless all values are the same." http://dev.mysql.com/doc/refman/5.1/en/group-by-hidden-columns.html
Larry Lustig
Also when you say derived table, does that just mean you put that bit at the end to label your query, basically as a filler?
Scarface
larry what exactly do you mean? You mean if you attempt to group something with different values, then the returned value is random within the group? I am sorry, I am kind of noob still so I always have a lot of questions.
Scarface
@Scarface (Thanks!) There are performance issues for a nested query (a "select within a select"), because you are running multiple queries and not just the one. It gets worse if you get too fancy; you can get into situations where your sub-query is running multiple times for the parent query, and that is ugly. But this isn't one of those times; your performance hit here shouldn't be too bad as long as your tables aren't huge.
Ken Taylor
@Larry Ha ha, I know what's in the docs...but nothing in a computer is random. I've run test and re-test on this, and what appears to be happening is that the first physically indexed record (i.e., the first record inserted) is what is being returned--which makes sense if you think about how an RDBMS stores data internally. I think the docs would be better written to say that the result is "unreliable" (which is true from a logic standpoint--you shouldn't code this way if you can avoid it), rather than the result is "random".
Ken Taylor
My topic table has about 4 more rows I did not list, you think that would be a problem? Also just one more question lol, I noticed that although I got the information for the comments table, but I wanted to select rows from the topic table as well. Is this possible or should I just run a separate query?
Scarface
@Scarface Derived tables are just sub-queries (you can think about them as temporary tables that contain the result sets of the inner queries), but MySQL requires that you give them an alias--hence the "derived_table" identifier in the sample query (not all RDMBS's force you to give derived tables a name, by the way). The name is irrelevant--pick something that matters to you--but you have to have one. You can use derived tables in joins, etc. just like any other table; in that case, having an alias for it is critical.
Ken Taylor
@Scarface Just add the topic fields you want to select to the inner query like so: topic.topic_title, topic.creator, etc. They will bubble-up to the outer query.
Ken Taylor
@Larry: Was that changed in 5.1? According to my copy of *MySQL: The definitiv`e Guide* using `GROUP BY` will implicitly sort unless an `ORDER BY` is present (and therefore specify `ORDER BY NULL` to avoid the overhead if you don't need sorted results). Of course, arbitrary order should be expected since relational sets don't have an order.
Duncan
Thanks Ken, looks like I ended up giving you like 10 points anyway lol. Appreciate your time and teachings. Thanks everyone else as well for sparking discussion.
Scarface
No problem--my pleasure!
Ken Taylor
@Ken: For "random" you may prefer "undefined". You can test all you want, and I'm sure the results of your tests are accurate. But without an explicit guarantee the implementation can change at some future date. You must never rely on the internal storage mechanism of an RDBMS: unless the specification guarantees a certain result (especially in cases of ordering), the results can change from version to version. Many people got bitten when various RDBMSes implemented index-only retrieval, and result set order changed.
Larry Lustig
@Duncan: You are mixing up two ordering issues. You are correct that GROUP BY will implicitly order the rows in the actual result set once those rows are calculated. But it will *not* order the rows *inside* each group that were used to calculate the single row for that group in the result set. If you GROUP BY customer_id and don't ORDER BY then customer 1 will implicitly be before customer 2 in the result set. But the 100 (or whatever) customer 1 rows that were aggregated to produce the single customer 1 row will not have been internally ordered.
Larry Lustig
@Larry Absolutely agree! That's what I meant in my previous comment (though I like your term "undefined" better than my "unreliable"). This is problematic use of GROUP BY and works the way it does by accident, so to speak; it is not portable to other RDBMS's and cannot be guaranteed even on future versions of MySQL.
Ken Taylor
@Larry: Ah, that makes sense. I probably misunderstood what Paul DuBois meant. Perhaps he didn't mention internal ordering because it doesn't happen, and it's like telling people not to think about purple elephants. Fortunately I wasn't abusing `GROUP BY` at the time I read that, I was aggregating so all columns I was interested in had the same value anyway.
Duncan
ok ken lol I came back to this part of my site, and it turns out that your query does not work. While the right results are selected, they are in seemingly random order. Any more ideas haha?
Scarface