views:

34

answers:

1

I have two tables defined like this:

Page(id), Index(page_id, word)

page_id in Index is a foreign key to Page so that each Page is connected to a group of Index entries. The Index table is a index for the Page table so that you can do fast text searching. E.g:

SELECT page_id FROM Index where word = 'hello'

Would select all page_id's for all pages containing the word 'hello'. But now I want to select all page_id's for pages that contain all of the words 'word1', 'word2' and 'word3'. The best query I can come up with for this is:

SELECT page_id
FROM Index
WHERE word IN ('word1', 'word2', 'word3')
GROUP BY page_id
HAVING COUNT(1) = 3;

It works, but I wonder if someone can think of an alternative more efficient query?

[edit] The above example is slightly simplified. In the real Index table word is replaced with a word_id column that references a Word table. But the basic approach is the same. The RDBMS is PostgreSQL and there is about 2m rows in the Index table and 20k rows in Page.

A: 

As a small thing, I wouldn't call a table Index, that's bound to be confusing :)

Your query would find pages with 3 index entries for word1. That might not be a problem, but you could change the having to:

HAVING COUNT(DISTINCT word) = 3

to avoid this problem.

How this query performs compared to other depends on a lot of factors, like keyword density, the DBMS you use, and the total amount of rows. I would not worry about it until you encountered an actual performance problem.

Andomar
Thanks, you are right. In this case page_id and word is unique together so the distinct check is unnecessary. I have verified that it is this query that is causing a performance degradation.
Björn Lindqvist
@Björn Lindqvist: then post the full problem with the query plan "explain <statement>" :)
Andomar