views:

337

answers:

4

The problem: we have a very complex search query. If its result yields too few rows we expand the result by UNIONing the query with a less strict version of the same query.

We are discussing wether a different approach would be faster and/or better in quality. Instead of UNIONing we would create a custom sql function which would return a matching score. Then we could simply order by that matching score.

Regarding performance: will it be slower than a UNION?

We use PostgreSQL.

Any suggestions would be greatly appreciated.

Thank you very much Max

+6  A: 

A definitive answer can only be given if you measure the performance of both approaches in realistic environments. Everything else is guesswork at best.

There are so many variables at play here - the structure of the tables and the types of data in them, the distribution of the data, what kind of indices you have at your disposal, how heavy the load on the server is - it's almost impossible to predict any outcome, really.

So really - my best advice is: try both approaches, on the live system, with live data, not just with a few dozen test rows - and measure, measure, measure.

Marc

marc_s
+1 - sound advice
AdaTheDev
might sound a bit frustrating, but most often, this is really almost the only advice available......
marc_s
+1 - don't forget to use explain plans and indexes
northpole
+1 for testing. But "with a few dozen test rows"? I'd use (a lot) more and possibly more than one concurrent client - if that is what will happen on the productive system. You have to stress your system enough to find a bad strategy (e.g. with bad complexity) and/or hotspots. With a small dataset you can do _anything_ and it doesn't matter. And even with a slightly bigger dataset and only one client you can't be sure if the dbms "buffers" some bottlenecks that will reveal themselves with more data/concurrent clients.
VolkerK
@VolkerK: yes, what I was trying to say is to do the testing against the REAL data load - not just a few dozen test rows. Even just the number of rows can make a huge difference
marc_s
ah darn, my latent "after 4pm" reading disability strikes again :( "with _live data_, _not_ just with a few dozen test rows". Yes, I completely agree with that.
VolkerK
+1  A: 

You want to order by the "return value" of your custom function? Then the database server can't use an index for that. The score has to be calculated for each record in the table (that hasn't been excluded with a WHERE clause) and stored in some temporary storage/table. Then the order by is performed on that temporary table. So this easily can get slower than your union queries (depending on your union statements of course).

VolkerK
Yes it can - you just need to create an expression index on the function output.
Magnus Hagander
Yes, it would be possible if it was the same function call every time, like in the manual's example "WHERE lower(col1) = 'value'". But Max wants to compute some kind of matching score for an ever changing search string. "Indexes on Expressions": http://www.postgresql.org/docs/8.4/interactive/indexes-expressional.html
VolkerK
+1  A: 

To add my little bit...

+1 to marc_s, completely agree with what he said - I would only say, you need a test db server with realistic data volumes in to test on, as opposed to production server.

For the function approach, the function would be executed for each record, and then ordered by that result - this will not be an indexed column and so I'd expect to see a negative impact in performance. However, how big that impact is and whether it is actually negative when compared to the cumulative time of the other approach, is only going to be known by testing.

AdaTheDev
+1  A: 

In PostgreSQL 8.3 and below, UNION implied DISTINCT which implied sorting, that means ORDER BY, UNION and DISTINCT were always of same efficiency, since the atter two aways used sorting.

On PostgreSQL 8.3, this query returns the sorted results:

SELECT  *
FROM    generate_series(1, 10) s
UNION
SELECT  *
FROM    generate_series(5, 15) s

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

Since PostgreSQL 8.4 it became possible to use HashAggregate for UNION which may be faster (and almost always is), but does not guarantee ordered output.

The same query returns the following on PostgreSQL 8.4:

SELECT  *
FROM    generate_series(1, 10) s
UNION
SELECT  *
FROM    generate_series(5, 15) s

10
15
8
6
7
11
12
2
13
5
4
1
3
14
9

, and as you can see the resuts are not sorted.

PostgreSQL change list mentions this:

SELECT DISTINCT and UNION/INTERSECT/EXCEPT no longer always produce sorted output (Tom)

So in new PostgreSQL versions, I'd advice to use UNION, since it's more flexible.

In old versions, the performance will be the same.

Quassnoi