views:

122

answers:

5

I am trying to write a query in Postgresql that pulls a set of ordered data and filters it by a distinct field. I also need to pull several other fields from the same table row, but they need to be left out of the distinct evaluation. example:

  SELECT DISTINCT(user_id) user_id, 
         created_at 
    FROM creations 
ORDER BY created_at   
   LIMIT 20

I need the user_id to be DISTINCT, but don't care whether the created_at date is unique or not. Because the created_at date is being included in the evaluation, I am getting duplicate user_id's in my result set.

Also, the data must be ordered by the date, so using DISTINCT ON is not an option here. It required that the DISTINCT ON field be the first field in the ORDER clause and that does not deliver the results that I seek.

How do I properly use the DISTINCT clause but limit its scope to only one field while still selecting other fields?

A: 

Using a sub-query was suggested by someone on the irc #postgresql channel. It worked:

SELECT user_id  
FROM (SELECT DISTINCT ON (user_id) * FROM creations) ss  
ORDER BY created_at DESC  
LIMIT 20;
mindtonic
This will still yield duplicate user_id's if the same user_id has two of the top 20 created_at values
Matthew PK
@mindtonic, although this is essentially Bill's answer, this answer might be wrong - if you don't use ORDER BY with DISTINCT ON then the choice of the values for the other fields are 'implementation specific' (read: most likely matches the order of inserts, but not guaranteed, see documentation)
Unreason
@Matthew PK, it will not - subquery will get distinct user_id's. Period.
Unreason
I think this answer is different from what was being asked. You could have just as easily `SELECT user_id, min(created_at) AS created_at FROM creations GROUP BY user_id ORDER BY min(created_at);`
vol7ron
@vol7ron, as I have stated in my answer. I claim that it is not different, but the question is not fully defined. Out of curiosity, what do you think is being asked?
Unreason
@Unreason, let us presume that it returns exactly one of each user_id as described... without any rationale for the aggregation you are not appropriately handling any of the duplicated rows... which is the entire basis of the problem! Let's say there are two rows having the same user_id, the select will return data corresponding to whatever row resolves first in the select. This is unsustainable because you are not specifying the return order or any aggregation. It may appear to return the results you're interested in but should not be depended on.
Matthew PK
@Unreason, of two duplicated rows... let's say that you want the most recent created_at, then you should `GROUP` and get the max. If you don't care which created_at you resolve... what is the purpose of selecting it? On the other hand, if, as was answered by Bill Karwin below, your leftmost `ORDER` matched the `DISTINCT ON` then you would be accomplishing the same task as `GROUP` and aggregating the results. Either way, you need to define your aggregation for it to make any sense.
Matthew PK
@Matthew PK, agree with all you said, as is hopefully clear from my answer. Was just correcting your incorrect claim that the above query will get duplicate user_id's. It will not.
Unreason
@Unreason: **Using a DISTINCT clause to filter data but still pull other fields that are not DISTINCT.** `I need the user_id to be DISTINCT, but don't care whether the created_at date is unique or not...Also, the data must be ordered by the date`The question isn't possible, without having a way to either aggregate, or choose a distinct record from the non-unique field. So the answer is incorrect, it answers a more well-defined question, which doesn't exist above.
vol7ron
@vol7ron: in Unreason's answer below he stipulates that the question is incomplete without explaining which data to return from non-unique rows. As such, Unreason explained these limitations and demonstrated how to select some aggregated data. The question was poorly formed but I think the answers have been helpful regardless.
Matthew PK
+1  A: 

As you've discovered, standard SQL treats DISTINCT as applying to the whole select-list, not just one column or a few columns. The reason for this is that it's ambiguous what value to put in the columns you exclude from the DISTINCT. For the same reason, standard SQL doesn't allow you to have ambiguous columns in a query with GROUP BY.

But PostgreSQL has a nonstandard extension to SQL to allow for what you're asking: DISTINCT ON (expr).

SELECT DISTINCT ON (user_id) user_id, created_at 
FROM creations 
ORDER BY user_id, created_at   
LIMIT 20

You have to include the distinct expression(s) as the leftmost part of your ORDER BY clause.

See the manual on DISTINCT Clause for more information.

Bill Karwin
SQL does this because the concept of `DISTINCT` naturally applies to entire select lists; the alternative would be a `GROUP BY` clause. Selecting only distinct results from a single column fundamentally necessitates an aggregation if there are any additional columns in the query
Matthew PK
"You have to include the distinct expression(s) as the leftmost part of your ORDER BY clause." This is the reason why I cannot use DISTINCT ON... I need the results to be in a specific order which is defined by a field unrelated to the DISTINCT ON clause.
mindtonic
+1, ah so postgres has DISTINCT ON... learning. thanks.
Unreason
+1  A: 

The GROUP BY should ensure distinct values of the grouped columns, this might give you what you are after.

(Note I'm putting in my 2 cents even though I am not familiar with PostgreSQL, but rather MySQL and Oracle)

In MySql

SELECT user_id, created_at
FROM creations
GROUP BY user_id
ORDER BY user_id

In Oracle sqlplus

SELECT user_id, FIRST(created_at)
FROM creations
GROUP BY user_id
ORDER BY user_id

These will give you the user_id followed by the first created_at associated with that user_id. If you want a different created_at you have the option to substitute FIRST with other functions like AVG, MIN, MAX, or LAST in Oracle, you can also try adding ORDER BY on other columns (including ones that are not returned, to give you a different created_at.

dsclementsen
+2  A: 

Your question is not well defined - when you say you need also other data from the same row you are not defining which row.

You do say you need to order the results by created_at, so I will assume that you want values from the row with min created_at (earliest).

This now becomes one of the most common so SQL questions - retrieving rows containing some aggregate value (MIN, MAX).

For example

SELECT user_id, MIN(created_at) AS created_at
FROM creations
GROUP BY user_id
ORDER BY MIN(create_at)
LIMIT 20

This approach will not let you (easily) pick other values from the same row.

One approach that will let you pick other values is

SELECT c.user_id, c.created_at, c.other_columns
FROM creations c LEFT JOIN creation c_help
     ON c.user_id = c_help.user_id AND c.created_at > c_help.create_at
WHERE c_help IS NULL
ORDER BY c.created_at
LIMIT 20
Unreason
+1 Bingo. He's missing the point of aggregation. Without aggregating there is no reason for there to be a `DISTINCT` on any single column less the entire result set.
Matthew PK
Ok, lets say that the creations table has the following fields: `id, user_id, created_at, foo, bar, long_description`. What I want is to pull the 20 most recent creations `ORDER BY created_at DESC`, but filter the results by user so that there can be only one creation per user in the result set. I would also like to bring other fields with me such as `foo` and `bar` but leave out `long_descrpition`. Is the answer really to join a creation to itself?
mindtonic
@mindtonic, it is an answer, maybe not the answer for you. Also, you should try it and test it. There are other approaches: obviously DISTINCT ON works (with a bit of reordering), correlated subquery would work, too, etc...
Unreason
+1  A: 

If you want the most recent created_at for each user then I suggest you aggregate like this:

SELECT user_id, MAX(created_at)
FROM creations
WHERE ....
GROUP BY user_id
ORDER BY created_at DESC

This will return the most recent created_at for each user_id If you only want the top 20, then append

LIMIT 20

EDIT: This is basically the same thing Unreason said above... define from which row you want the data by aggregation.

Matthew PK