views:

270

answers:

3

I'm trying to find some suggestions in optimizing a query that I'm using to fetch a large group of data.

The original code that I'm working on looped through a large set of users, and calculated a date range for each one of them. It would then take that date range and query how many questions they answered and how many they got correct within that date range. Those results were tallied up, and it is those end tallies that we need.

What I've already done to speed this up (because it was taking several minutes) is this: Rather than querying each user individually, the script now just loops through each user calculates the date range that would be applicable to them (all other aspects of the query are identical for each user). That data is collected in a 3d array [startDate][endDate][userid], and a single query built to do that operaion on all of the users. Here's an example of the query that gets output:


SELECT COUNT(uapl.id) AS numAnswered,
SUM(CASE WHEN (a.correct OR q.survey OR uapl.answersId IS NULL) THEN 1 ELSE 0 END) AS numCorrect
FROM usersAnswersProgramsLink uapl
JOIN questions q ON uapl.questionsId=q.id
LEFT JOIN answers a ON uapl.answersId=a.id
WHERE
programsId=123
AND
(
  (
    CAST(timestamp AS date) >= '2009-09-01'
    AND CAST(timestamp AS date) <= '2009-09-21'
    AND usercontextid in('123','234','345','465','567')
  )
  OR
  (
    CAST(timestamp AS date) >= '2009-09-10'
    AND CAST(timestamp AS date) <= '2009-09-21'
    AND usercontextid in('321','432','543')
  )
  OR
  (
    CAST(timestamp AS date) >= '2009-09-16'
    AND CAST(timestamp AS date) <= '2009-09-21'
    AND usercontextid in('987','876')
  )
) 

This works relatively well in speeding up the code. With most of the tests I run on this, it now takes between 20% and 10% as long. With my worst case though, it's only 50%, and I'd like to improve that.

That worst case happens when I have a huge number of user id's to compare (ten-thousandish). The problem now is that there is no more optimizing to be done on the algorithm that I pulled those queries out of. It now goes by in milliseconds. It's this query that takes a long time.

So that's my conundrum. I'd like to speed it up even more. Any suggestions would be welcome. A couple of pieces of info that are pertinent here:

1) There is a 1 to many relationship between the date ranges and the users. none of those user id's will show up in multiple date ranges. 2) The end result we're looking for is just those tallies, but the date ranges need to be calculated on a per-user basis, hence the array of id's per date range.

One thing that I thought ~might~ make it faster would be to crate a temporary table with a column for the date range and a column for the user id's. Then rewrite that query using a JOIN to that table rather than putting those numbers in the query itself. Does anyone know if that would work?

Thanks for any suggestions!

A: 

One thing that I thought ~might~ make it faster would be to crate a temporary table with a column for the date range and a column for the user id's. Then rewrite that query using a JOIN to that table rather than putting those numbers in the query itself. Does anyone know if that would work?

That would be the approach I would take. It will also make the query clearer. You can add indexes to the temp table too, though you should do this after filling it with data. Don't assume you need an index though - test.

Oh - you might want to store timestamps rather than dates (it'll save casting) and perhaps an index on the "timestamp" column in your answers table.

PS - generally considered better not to name columns the same as built-in types. Even if the database doesn't get confused a human reader can.

Richard Huxton
A: 

First, I'd suggest you do add a coarse filter that would use the indexes on usercontextid and timestamp:

SELECT  COUNT(uapl.id) AS numAnswered,
        SUM(CASE WHEN (a.correct OR q.survey OR uapl.answersId IS NULL) THEN 1 ELSE 0 END) AS numCorrect
FROM    questions q
JOIN    usersAnswersProgramsLink uapl
ON      uapl.questionsId = q.id
LEFT JOIN
        answers a
ON      a.id = uapl.answersId
WHERE   programsId=123
        AND timestamp >= '2009-09-01'
        AND timestamp < '2009-09-22'
        AND usercontextid IN (/* all possible values here */)
        AND 
(
  (
    CAST(timestamp AS date) >= '2009-09-01'
    AND CAST(timestamp AS date) <= '2009-09-21'
    AND usercontextid in('123','234','345','465','567')
  )
  OR
  (
    CAST(timestamp AS date) >= '2009-09-10'
    AND CAST(timestamp AS date) <= '2009-09-21'
    AND usercontextid in('321','432','543')
  )
  OR
  (
    CAST(timestamp AS date) >= '2009-09-16'
    AND CAST(timestamp AS date) <= '2009-09-21'
    AND usercontextid in('987','876')
  )
)

You also need to clarify which tables do all these field belong to.

Quassnoi
+2  A: 

as already mentioned: please provide the result of EXPLAIN ANALYZE <query> as well as table structures and created indexes, without that it will be difficult to help

a index on timestamp::date could help ( a index on timestamp would not be used because of the cast)

you could also post the explain analyze output into http://explain.depesz.com/ which will highlight the problematic places in the execution plan

pfote
Yes, http://explain.depesz.com/ is great.
bortzmeyer