ansaurus

Question

Answer 1

+6 A:

A definitive answer can only be given if you measure the performance of both approaches in realistic environments. Everything else is guesswork at best.

There are so many variables at play here - the structure of the tables and the types of data in them, the distribution of the data, what kind of indices you have at your disposal, how heavy the load on the server is - it's almost impossible to predict any outcome, really.

So really - my best advice is: try both approaches, on the live system, with live data, not just with a few dozen test rows - and measure, measure, measure.

Marc

marc_s 2009-07-28 16:18:21

+1 - sound advice

AdaTheDev 2009-07-28 16:20:24

might sound a bit frustrating, but most often, this is really almost the only advice available......

marc_s 2009-07-28 16:21:06

+1 - don't forget to use explain plans and indexes

northpole 2009-07-28 16:21:33

+1 for testing. But "with a few dozen test rows"? I'd use (a lot) more and possibly more than one concurrent client - if that is what will happen on the productive system. You have to stress your system enough to find a bad strategy (e.g. with bad complexity) and/or hotspots. With a small dataset you can do _anything_ and it doesn't matter. And even with a slightly bigger dataset and only one client you can't be sure if the dbms "buffers" some bottlenecks that will reveal themselves with more data/concurrent clients.

VolkerK 2009-07-28 16:33:36

@VolkerK: yes, what I was trying to say is to do the testing against the REAL data load - not just a few dozen test rows. Even just the number of rows can make a huge difference

marc_s 2009-07-28 17:02:01

ah darn, my latent "after 4pm" reading disability strikes again :( "with _live data_, _not_ just with a few dozen test rows". Yes, I completely agree with that.

VolkerK 2009-07-28 17:19:10

Answer 2

+1 A:

You want to order by the "return value" of your custom function? Then the database server can't use an index for that. The score has to be calculated for each record in the table (that hasn't been excluded with a WHERE clause) and stored in some temporary storage/table. Then the order by is performed on that temporary table. So this easily can get slower than your union queries (depending on your union statements of course).

VolkerK 2009-07-28 16:20:26

Yes it can - you just need to create an expression index on the function output.

Magnus Hagander 2009-07-29 05:32:44

Yes, it would be possible if it was the same function call every time, like in the manual's example "WHERE lower(col1) = 'value'". But Max wants to compute some kind of matching score for an ever changing search string. "Indexes on Expressions": http://www.postgresql.org/docs/8.4/interactive/indexes-expressional.html

VolkerK 2009-07-29 06:49:07

Answer 3

+1 A:

To add my little bit...

+1 to marc_s, completely agree with what he said - I would only say, you need a test db server with realistic data volumes in to test on, as opposed to production server.

For the function approach, the function would be executed for each record, and then ordered by that result - this will not be an indexed column and so I'd expect to see a negative impact in performance. However, how big that impact is and whether it is actually negative when compared to the cumulative time of the other approach, is only going to be known by testing.

AdaTheDev 2009-07-28 16:25:53

Answer 4

+1 A:

In PostgreSQL 8.3 and below, UNION implied DISTINCT which implied sorting, that means ORDER BY, UNION and DISTINCT were always of same efficiency, since the atter two aways used sorting.

On PostgreSQL 8.3, this query returns the sorted results:

SELECT  *
FROM    generate_series(1, 10) s
UNION
SELECT  *
FROM    generate_series(5, 15) s

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

Since PostgreSQL 8.4 it became possible to use HashAggregate for UNION which may be faster (and almost always is), but does not guarantee ordered output.

The same query returns the following on PostgreSQL 8.4:

SELECT  *
FROM    generate_series(1, 10) s
UNION
SELECT  *
FROM    generate_series(5, 15) s

10
15
8
6
7
11
12
2
13
5
4
1
3
14
9

, and as you can see the resuts are not sorted.

PostgreSQL change list mentions this:

SELECT DISTINCT and UNION/INTERSECT/EXCEPT no longer always produce sorted output (Tom)

So in new PostgreSQL versions, I'd advice to use UNION, since it's more flexible.

In old versions, the performance will be the same.

Quassnoi 2009-07-28 16:30:16

ansaurus

tags:

views:

answers:

SQL Performance: UNION or ORDER BY

related questions