views:

194

answers:

3

If I GROUP BY on a unique key, and apply a LIMIT clause to the query, will all the groups be calculated before the limit is applied?

If I have hundred records in the table (each has a unique key), Will I have 100 records in the temporary table created (for the GROUP BY) before a LIMIT is applied?

A case study why I need this:

Take Stack Overflow for example.

Each query you run to show a list of questions, also shows the user who asked this question, and the number of badges he has.

So, while a user<->question is one to one, user<->badges is one has many.

The only way to do it in one query (and not one on questions and another one on users and then combine results), is to group the query by the primary key (question_id) and join+group_concat to the user_badges table.

The same goes for the questions TAGS.

Code example:
Table Questions:
question_id  (int)(pk)|   question_body(varchar)


Table tag-question:
question-id (int) | tag_id (int)


SELECT:

SELECT quesuestions.question_id,
       questions.question_body,
       GROUP-CONCAT(tag_id,' ') AS 'tags-ids'
FROM
       questions
   JOIN
       tag_question
   ON
       questions.question_id=tag-question.question-id
GROUP BY
       questions.question-id
LIMIT 15
+4  A: 

Yes, the order the query executes is:

  • FROM
  • WHERE
  • GROUP
  • HAVING
  • SORT
  • SELECT
  • LIMIT

LIMIT is the last thing calculated, so your grouping will be just fine.

Now, looking at your rephrased question, then you're not having just one row per group, but many: in the case of stackoverflow, you'll have just one user per row, but many badges - i.e.

(uid, badge_id, etc.)
(1, 2, ...)
(1, 3, ...)
(1, 12, ...)

all those would be grouped together.

To avoid full table scan all you need are indexes. Besides that, if you need to SUM, for example, you cannot avoid a full scan.

EDIT:

You'll need something like this (look at the WHERE clause):

SELECT
  quesuestions.question_id,
  questions.question_body,
  GROUP_CONCAT(tag_id,' ') AS 'tags_ids'
FROM
  questions q1
  JOIN tag_question tq
    ON q1.question_id = tq.question-id
WHERE
  q1.question_id IN (
    SELECT
      tq2.question_id
    FROM
      tag_question tq2
        ON q2.question_id = tq2.question_id
      JOIN tag t
        tq2.tag_id = t.tag_id
    WHERE
      t.name = 'the-misterious-tag'
  )
GROUP BY
  q1.question_id
LIMIT 15
Seb
I have edited my question to add clarification and example.
Itay Moav
Looking at your question, then you're not having just one row per group, but many: in the case of stackoverflow, you'll have just one user per row, but many badges - i.e. rows like (uid, badge_id, etc.): (1, 2, ...), (1, 3, ...), (1, 12, ...), etc.
Seb
yes, but only one question_id, which is the group by index. and for each group I GROUP_CONCAT the tags/badges. If I have a LIMIT of 15, and have 100 questions, I would like the query to stop after it grouped 15 questions, and not do full table scan.
Itay Moav
Then you'll need to filter them out in the WHERE clause - e.g. if you're interested in the latest 15 questions, do "WHERE date >= (SELECT date FROM questions ORDER BY date DESC LIMIT 14,1)", and you're done. :)
Seb
not so simple, try this. I MUST use the group by to show the relevant TAGS....hmmmm,,,,,,Unless I use a subquery in the SELECT clause for each question. I will investigate this. Thanks.
Itay Moav
People are voting you up, but I think you might be wrong here. They do not read my entire question, I think.
Itay Moav
People are voting this up because it answers your _question_, not your problem. I believe you are not stating your question correctly, then... If you want help solving something, state your problem, not a problem you're facing with the way you intend to solve it. :)
Seb
If you have a problem with the tags, then filter your questions in the subquery as well: "WHERE date >= (SELECT date FROM questions NATURAL JOIN questions_tags ORDER BY date DESC LIMIT 14,1)"
Seb
I am putting a table example in the question. I am aware this is not a very clear subject, and I am not native English write.
Itay Moav
See at my updated answer.
Seb
Why joining with the question in the subselect?
Quassnoi
Or remove the GROUP and fetch the tags using a subquery (similar to what you wrote) in the SELECT clause.
Itay Moav
@Itay you need to group outside, not inside, because the sub-query will only return matching tags, when you want all of them.
Seb
For each question I want only the matching tags:SELECT question_id,(SELECT GROUP_CONCAT(tag_id,' ') WHERE out.question_id=in.question_id GROUP BY in.question_id)FROM questionsLIMIT 15
Itay Moav
I give up... every answer I give you you change some requirement... Try to rephrase your ENTIRE question above and let's go from there, because no one can guess what you ultimately need.
Seb
Sorry, what I meant is that I agree with your query in the last comment, and I showed another way of doing the same thing.
Itay Moav
Oh, ok! Then I'm glad I could help after all :)
Seb
+1 for your patience :)
Quassnoi
+1  A: 

If the field you're grouping on is indexed, it shouldn't do a full table scan.

GoatRider
+1  A: 

LIMIT does get applied after GROUP BY.

Will the temporary table be created or not, depends on how your indexes are built.

If you have an index on the grouping field and don't order by the aggregate results, then an INDEX SCAN FOR GROUP BY is applied, and each aggregate is counted on the fly.

That means that if you don't select an aggregate due to the LIMIT, it won't ever be calculated.

But if you order by an aggregate, then, of course, all of them need to be calculated before they can be sorted.

That's why they are calculated first and then the filesort is applied.

Update:

As for your query, see what EXPLAIN EXTENDED says for it.

Most probably, question_id is a PRIMARY KEY for your table, and most probably, it will be used in a scan.

That means no filesort will be applies and the join itself will not ever happen after the 15'th row.

To make sure, rewrite your query as following:

SELECT question_id,
       question_body,
       (
       SELECT  GROUP_CONCAT(tag_id, ' ')
       FROM    tag_question t
       WHERE   t.question_id = q.question_id
       )
FROM   questions q
ORDER BY
       question_id
LIMIT 15
  • First, it is more readable,
  • Second, it is more efficient, and
  • Third, it will return even untagged questions (which your current query doesn't).
Quassnoi
:-D read my last comment To Seb. (As for untagged questions, In my specific system I don't have such a case, there is always a default tag, but this is not part of this question) Thanks!
Itay Moav