views:

682

answers:

5

Here is my query:

select word_id, count(sentence_id) 
from sentence_word 
group by word_id 
having count(sentence_id) > 100;

The table sentenceword contains 3 fields, wordid, sentenceid and a primary key id. It has 350k+ rows. This query takes a whopping 85 seconds and I'm wondering (hoping, praying?) there is a faster way to find all the wordids that have more than 100 sentenceids.

I've tried taking out the select count part, and just doing 'having count(1)' but neither speeds it up.

I'd appreciate any help you can lend. Thanks!

+1  A: 

If that query is often performed, and the table rarely updated, you could keep an auxiliary table with word ids and corresponding sentence counts -- hard to think of any further optimization beyond that!

Alex Martelli
You mean, like an index? :-)
bignose
+6  A: 

If you don't already have one, create a composite index on sentence_id, word_id.

Mitch Wheat
I believe the right order of the columns for this index would be (word_id, sentence_id).
Irina C
@Irina C: I just tested on SQL server with the index columns both ways around, and despite differing execution plans, the amount of logical IO was the same.
Mitch Wheat
+1  A: 

Your query is fine, but it needs a bit of help (indexes) to get faster results.

I don't have my resources at hand (or access to SQL), but I'll try to help you from memory.

Conceptually, the only way to answer that query is to count all the records that share the same word_id. That means that the query engine needs a fast way to find those records. Without an index on word_id, the only thing the database can do is go through the table one record at a time and keep running totals of every single distinct word_id it finds. That would usually require a temporary table and no results can be dispatched until the whole table is scanned. Not good.

With an index on word_id, it still has to go through the table, so you would think it wouldn't help much. However, the SQL engine can now compute the count for each word_id without waiting until the end of the table: it can dispatch the row and the count for that value of word_id (if it passes your where clause), or discard the row (if it doesn't); that will result in lower memory load on the server, possibly partial responses, and the temporary table is no longer needed. A second aspect is parallelism; with an index on word_id, SQL can split the job in chunks and use separate processor cores to run the query in parallel (depending on hardware capabilities and existing workload).

That might be enough to help your query; but you will have to try to see:

CREATE INDEX someindexname ON sentence_word (word_id)

(T-SQL syntax; you didn't specify which SQL product you are using)

If that's not enough (or doesn't help at all), there are two other solutions.

First, SQL allows you to precompute the COUNT(*) by using indexed views and other mechanisms. I don't have the details at hand (and I don't do this often). If your data doesn't change often, that would give you faster results but with a cost in complexity and a bit of storage.

Also, you might want to consider storing the results of the query in a separate table. That is practical only if the data never changes, or changes on a precise schedule (say, during a data refresh at 2 in the morning), or if it changes very little and you can live with non perfect results for a few hours (you would have to schedule a periodic data refresh); that's the moral equivalent of a poor-man's data warehouse.

The best way to find out for sure what works for you is to run the query and look at the query plan with and without some candidate indexes like the one above.

Euro Micelli
THanks for the helpful response!In fact, all the relevant fields are indexed already... But simply changing the query to count(*) instead of count(sentence_id) was the answer!! Night and day. So that seems to be it. I think for some reason I assumed count() with a specific field was more efficient than with *, but that now appears to be a silly assumption.Will do a few more checks to confirm this was the problem.
Jeff
+3  A: 

having count(sentence_id) > 100;

There's a problem with this... Either the table has duplicate word/sentence pairs, or it doesn't.

If it does have duplicate word/sentence pairs, you should be using this code to get the correct answer:

HAVING COUNT(DISTINCT Sentence_ID) > 100


If the table does not have duplicate word/sentence pairs... then you shouldn't count sentence_ids, you should just count rows.

HAVING COUNT(*) > 100

In which case, you can create an index on *word_id only*, for optimum performance.

David B
Amazing how something as simple as an * is the difference between 85 second query and literally .3 seconds! Thanks everyone-
Jeff
A: 

There is, surprisingly, an even faster way to accomplish that on large data sets:

SELECT totals.word_id, totals.num 
  FROM (SELECT word_id, COUNT(*) AS num FROM sentence_word GROUP BY word_id) AS totals
 WHERE num > 1000;
Max