views:

114

answers:

2

I have a batch job that I run on a table which I'm sure I could write as a prepared statement. Currently it's all in Java and no doubt less efficient than it could be. For a table like so:

CREATE TABLE thing (
  `tag` varchar,
  `document` varchar,
  `weight` float,
)

I want to create a new table that contains the top N entries for every tag. Currently I do this:

create new table with same schema
select distinct tag
for each tag:
  select * limit N insert into the new table

This requires executing a query to get the distinct tags, then selecting the top N items for that tag and inserting them... all very inefficient.

Is there a stored procedure (or even a simple query) that I could use to do this? If dialect is important, I'm using MySQL.

(And yes, I do have my indexes sorted!)

Cheers

Joe

A: 

If you were using SQL Server, I would suggest using the ROW_NUMBER function, grouped by tag, and select where row_number < N. (So in other words, order and number the rows for each tag according to their position in the tag group, then pick the top N rows from each group.) I found an article about simulating the ROW_NUMBER function in MySQL here:

http://www.xaprb.com/blog/2006/12/02/how-to-number-rows-in-mysql/

See if this helps you out!

RMorrisey
A: 

I haven't done this in a while (spoiled by CTE's in SQL Server), and I'm assuming that your data is ordered by weight; try

SELECT tag, document, weight
FROM thing
WHERE (SELECT COUNT(*)
       FROM thing as t
       WHERE t.tag = thing.tag AND t.weight < thing.weight
) < N;

I think that will do it.

EDIT: corrected error in code; need < N, not <= N.

Stuart Ainsworth
Surely that would get all weights under the value of N? What I want is the top N values (i.e. `LIMIT N`).
Joe
If you use <= thing.weight, then you need <=N.If you use < thing.weight, then you need < N.Try it and see :)
Stuart Ainsworth