+2  A: 

Ok, so 900K rows isn't a massive table, it's reasonably big but and your queries really shouldn't be taking that long.

First things first, which of the 3 statements above is taking the most time?

The first problem I see is with your first query. Your WHERE clause doesn't include an indexed column. So this means that it has to do a full table scan on the entire table.

Create an index on the "data_updated" column, then run the query again and see what that does for you.

If you don't need the hash's and are only using them to avail of the dark magic then remove them completely.

Edit: Someone with more SQL-fu than me will probably reduce your whole set of logic into one SQL statement without the use of the temporary tables.

Edit: My SQL is a little rusty, but are you joining twice in the third SQL staement? Maybe it won't make a difference but shouldn't it be :

SELECT temp1.element_id, 
   temp1.category, 
   temp1.source_prefix, 
   temp1.source_name, 
   temp1.date_updated, 
   AVG(temp1.value) AS avg_value,
   SUM(temp1.value * temp1.weight) / SUM(weight) AS rating
FROM temp1 LEFT JOIN temp2 ON temp1.subcat_hash = temp2.subcat_hash
WHERE temp1.date_updated = temp2.maxdate
GROUP BY temp1.cat_hash;

or

SELECT temp1.element_id, 
   temp1.category, 
   temp1.source_prefix, 
   temp1.source_name, 
   temp1.date_updated, 
   AVG(temp1.value) AS avg_value,
   SUM(temp1.value * temp1.weight) / SUM(weight) AS rating
FROM temp1 temp2
WHERE temp2.subcat_hash = temp1.subcat_hash
AND temp1.date_updated = temp2.maxdate
GROUP BY temp1.cat_hash;
Glen
Last one. First is near instant, second is about 23 minutes.
Kuroki Kaze
I can remove hashes but then query will take an infinite time (okay, maybe not, but i don't have such patience, nor the clients). I suppose these hashes can be made into the indexes somehow.
Kuroki Kaze
Don't think the index suggestion makes sense. An aggregate query like this will always result in a full table scan.
Andomar
+4  A: 

Using hashses is one of the ways in which a database engine can execute a join. It should be very rare that you'd have to write your own hash-based join; this certainly doesn't look like one of them, with a 900k rows table with some aggregates.

Based on your comment, this query might do what you are looking for:

SELECT cur.source_prefix, 
       cur.source_name, 
       cur.category, 
       cur.element_id,
       MAX(cur.date_updated) AS DateUpdated, 
       AVG(cur.value) AS AvgValue,
       SUM(cur.value * cur.weight) / SUM(cur.weight) AS Rating
FROM eev0 cur
LEFT JOIN eev0 next
    ON next.date_updated < '2009-05-01'
    AND next.source_prefix = cur.source_prefix 
    AND next.source_name = cur.source_name
    AND next.element_id = cur.element_id
    AND next.date_updated > cur.date_updated
WHERE cur.date_updated < '2009-05-01'
AND next.category IS NULL
GROUP BY cur.source_prefix, cur.source_name, 
    cur.category, cur.element_id

The GROUP BY performs the calculations per source+category+element.

The JOIN is there to filter out old entries. It looks for later entries, and then the WHERE statement filters out the rows for which a later entry exists. A join like this benefits from an index on (source_prefix, source_name, element_id, date_updated).

There are many ways of filtering out old entries, but this one tends to perform resonably well.

Andomar
Okay, i'll try to explain.There is measurements in this table. Each measurement have source (identified by prefix + name) and category. Each element can have measurements in all categories, or just in some.What i want to do is find latest measurement for element from a source, then calculate weighted average for elements+categories.Sorry for my English, btw - not my main language :\
Kuroki Kaze
Post updated. Is the date_updated *exactly* equal for all of the latest measurements? Or are they just on the same day?
Andomar
They're just latest for the same source and element. They may vary.
Kuroki Kaze
Edited again so it looks for the latest date_updated per source+element. It then groups on all the categories that have a measurement for that particular date_updated.
Andomar
Added picture to post :) In the meantime I'll try your query, thanks :)
Kuroki Kaze
Updated for the cut-out date... now I'm curious how you'll get it to work :) hehe
Andomar
At least i'm always have an option of leaving current monstrocity in place and labeling it as "Magic" for future developers :)
Kuroki Kaze
The answer includes rows for 2009-04-29 and 2009-04-30 that the original doesn't. :D
Jonathan Leffler
Ok, index created in 18 min 52.54 sec. On to the query :)
Kuroki Kaze
66 hours and counting. Too long :(
Kuroki Kaze
Maybe it's trying a hash join and it doesn't fit into memory. Try replacing "LEFT JOIN" with "LEFT LOOP JOIN". How much memory does the server have? Maybe post the result of "EXPLAIN <query>".
Andomar
Added EXPLAIN forr solution. It seems to me it can't use indexes for join.
Kuroki Kaze
Did you try the LOOP join? Can you post EXPLAIN EXTENDED, and SHOW WARNINGS, like in this blog post? http://www.mysqlperformanceblog.com/2006/07/24/extended-explain/
Andomar
Also, no "LOOP JOIN" in MySQL, sorry )
Kuroki Kaze
Added EXPLAIN EXTENDED. Only warning shows my query with code 1003.
Kuroki Kaze
I wonder why it's not using the index. I've edited the query so it doesn't contain subqueries; does this allow MySQL to use the index? Maybe copy a subset from eev0 to testeev0 for testing.
Andomar
Whoa, EXPLAIN for this looks good (added to post). Waiting for actual query to finish :)
Kuroki Kaze
Also, are all ambiguous column names belong to `cur` instance of table?
Kuroki Kaze
Prefixed the column names; not sure what MySQL does with ambiguous names, I would expect it to throw an error.
Andomar
It throws errors, yes. Though it seems i guessed correctly )
Kuroki Kaze
35 minutes!!! This is a victory :)
Kuroki Kaze
Now i'll check resulting values — what do we have here? :)
Kuroki Kaze
A query like this should run in seconds, not minutes, so I'm still confused. Don't forget to verify the results for a sample!
Andomar
It seems to me that all data at this point was relevant, so we filtered out by date exactly zero items (this table is filled in by remote script). So it was an aggregate function of over 180 Mb file. You think it should take seconds?
Kuroki Kaze
I guess it depends on the amount of memory you have, but if you have 1GB of RAM, then a query over 180 MB should run in seconds. Here I just did an aggregate over a 9 gigabyte table with 22 million rows, and it finished in less than 1 second (not joining on anything.)
Andomar
I'm really not into FreeBSD, but i presume the server should have at least gig.
Kuroki Kaze
Second run finished in 27 minutes (first one was printing directly on screen, second uses "CREATE TABLE"), the results seems reasonable.Net buffer length is set to ~8k, max_allowed_packet is ~16M. I suppose this limits the memory for server?
Kuroki Kaze
These sound like network settings, you'd check the memory with "ps" or "top" on the server (I guess-- I'm not into FreeBSD either!)
Andomar
It's MySQL settings. I'm still curious if i can squeeze this query into seconds (or at least in under 10 minutes). Thanks for help :)
Kuroki Kaze
You could try to load the data into Sql Server Express http://www.microsoft.com/express/sql/default.aspx
Andomar
Thanks, i don't think i'm ready to move this project onto SQL server :)
Kuroki Kaze