views:

767

answers:

1

I want grouped ranking on a very large table, I've found a couple of solutions for this problem e.g. in this post and other places on the web. I am, however, unable to figure out the worst case complexity of these solutions. The specific problem consists of a table where each row has a number of points and a name associated. I want to be able to request rank intervals such as 1-4. Here are some data examples:

name | points
Ab     14
Ac     14
B      16
C      16
Da     15
De     13

With these values the following "ranking" is created:

Query id | Rank | Name
1          1      B
2          1      C
3          3      Da
4          4      Ab
5          4      Ac
6          6      De

And it should be possible to create the following interval on query-id's: 2-5 giving rank: 1,3,4 and 4.

The database holds about 3 million records so if possible I want to avoid a solution with complexity greater than log(n). There are constantly updates and inserts on the database so these actions should preferably be performed in log(n) complexity as well. I am not sure it's possible though and I've tried wrapping my head around it for some time. I've come to the conclusion that a binary search should be possible but I haven't been able to create a query that does this. I am using a MySQL server.

I will elaborate on how the pseudo code for the filtering could work. Firstly, an index on (points, name) is needed. As input you give a fromrank and a tillrank. The total number of records in the database is n. The pseudocode should look something like this:

Find median point value, count rows less than this value (the count gives a rough estimate of rank, not considering those with same amount of points). If the number returned is greater than the fromrank delimiter, we subdivide the first half and find median of it. We keep doing this until we are pinpointed to the amount of points where fromrank should start. then we do the same within that amount of points with the name index, and find median until we have reached the correct row. We do the exact same thing for tillrank.

The result should be log(n) number of subdivisions. So given the median and count can be made in log(n) time it should be possible to solve the problem in worst case complexity log(n). Correct me if I am wrong.

+2  A: 

You need a stored procedure to be able to call this with parameters:

CREATE TABLE rank (name VARCHAR(20) NOT NULL, points INTEGER NOT NULL);

CREATE INDEX ix_rank_points ON rank(points, name);

CREATE PROCEDURE prc_ranks(fromrank INT, tillrank INT)
BEGIN
  SET @fromrank = fromrank;
  SET @tillrank = tillrank;
  PREPARE STMT FROM
  '
  SELECT  rn, rank, name, points
  FROM  (
    SELECT  CASE WHEN @cp = points THEN @rank ELSE @rank := @rn + 1 END AS rank,
            @rn := @rn + 1 AS rn,
            @cp := points,
            r.*
    FROM (
         SELECT @cp := -1, @rn := 0, @rank = 1
         ) var,
         (
         SELECT *
         FROM rank
         FORCE INDEX (ix_rank_points)
         ORDER BY
           points DESC, name DESC
         LIMIT ?
         ) r
    ) o
  WHERE rn >= ?
  ';
  EXECUTE STMT USING @tillrank, @fromrank;
END;

CALL prc_ranks (2, 5);

If you create the index and force MySQL to use it (as in my query), then the complexity of the query will not depend on the number of rows at all, it will depend only on tillrank.

It will actually take last tillrank values from the index, perform some simple calculations on them and filter out first fromrank values.

Time of this operation, as you can see, depends only on tillrank, it does not depend on how many records are there.

I just checked in on 400,000 rows, it selects ranks from 5 to 100 in 0,004 seconds (that is, instantly)

Important: this only works if you sort on names in DESCENDING order. MySQL does not support DESC clause in the indices, that means that the points and name must be sorted in one order for INDEX SORT to be usable (either both ASCENDING or both DESCENDING). If you want fast ASC sorting by name, you will need to keep negative points in the database, and change the sign in the SELECT clause.

You may also remove name from the index at all, and perform a final ORDER'ing without using an index:

CREATE INDEX ix_rank_points ON rank(points);

CREATE PROCEDURE prc_ranks(fromrank INT, tillrank INT)
BEGIN
  SET @fromrank = fromrank;
  SET @tillrank = tillrank;
  PREPARE STMT FROM
  '
  SELECT  rn, rank, name, points
  FROM  (
    SELECT  CASE WHEN @cp = points THEN @rank ELSE @rank := @rn + 1 END AS rank,
            @rn := @rn + 1 AS rn,
            @cp := points,
            r.*
    FROM (
         SELECT @cp := -1, @rn := 0, @rank = 1
         ) var,
         (
         SELECT *
         FROM rank
         FORCE INDEX (ix_rank_points)
         ORDER BY
           points DESC
         LIMIT ?
         ) r
    ) o
  WHERE rn >= ?
  ORDER BY rank, name
  ';
  EXECUTE STMT USING @tillrank, @fromrank;
END;

That will impact performance on big ranges, but you will hardly notice it on small ranges.

Quassnoi
Looks very nice, what is the complexity of this query, is it within log(n) range and if so can you explain how come. One thing missing though is sorting on the name as 2nd priority if two rows have the same amount of points.
Per Stilling
The algorithm my rank is counted on is that the ones with most points have rank 1 and if some people have the same amount of points I want them sorted according to their name. So if two people have max points,the row with third most points is recorded as rank 3, while the two others both have rank 1
Per Stilling
Why "De" has rank 5 then?
Quassnoi
Thats a typo, very sorry.
Per Stilling
Thank you for the great response!
Per Stilling
I did some testing, and as far as I can see the time does increase as soon as you choose small intervals in the high numbers.
Per Stilling
I think it is due to the limit clause, are you sure it can be optimized. I thought limit was a convenience clause and wasnt optimized in any way. I did a 50 record request at offset 90000-90050 and took a second, thats a bit slow computer though, but it took 200ms getting rank 1-50.
Per Stilling
That's exactly what I said, read my post carefully. It depends NOT on the interval itself, but on the tillrank (higher or your bounds). It cannot be further optimized, because when you need to take records from 999,900 to 1,000,000, you need to look on whole million.
Quassnoi
And did you create the index and forced it into the query?
Quassnoi
Ah, and ordering by name also matters. The index will work only if points and names are sorted in one direction, as MySQL does not support DESC clause in indexes. You'll need to keep NEGATIVE points in the database, and change sign in the query if you want to sort on names in ascending order.
Quassnoi
Hmm ok, I understood this as if it didnt matter how records there were, but how many you want to fetch:"Time of this operation, as you can see, depends only on how many records you need, it does not depend on how many records are there."
Per Stilling
I created your algorithm on my data, and created the index you also created and forced it exactly like you do. I do, however, think it should be possible to do the operation in log(n) time. I will elaborate in my question how I think it can be implemented.
Per Stilling
Did you change the names ordering from DESC to ASC? And what MySQL version are you using?
Quassnoi
If you use index, this operation will take LOG(n) time. N here means higher bound of your range, not the overall number of records.
Quassnoi
Ok, I will always be using a constant range, my problem is scaling according to the total number of records. I've posted an explanation of the solution I assume could work in log(n) time. I have not changed the name ordering and I believe you that within query range the query you made work in log n
Per Stilling
MySQL Server version: 5.0.51
Per Stilling
See updated post. It's very strange that you get 200 ms on (1, 50), with an index this should work instantly. I'm checking on MySQL 5.1.28, maybe FORCE support is broken in 5.0.51.
Quassnoi
Try to issue this: SELECT * FROM rank FORCE INDEX (ix_rank_points) LIMIT 10 and see how long does it take. It should return 10 your LEAST points and should take less than 10 milliseconds. If it doesn't, then it's something with FORCE INDEX .
Quassnoi
Here is my database, maybe the fact that it is innodb makes it slower.http://pastebin.com/m655e99ea
Per Stilling
Sorry if it's a stupid question, but did you replace "ix_rank_points" with "points2" in the query?
Quassnoi
That query takes 0.0003 seconds
Per Stilling
http://pastebin.com/m49c0ba13 here's the function i use
Per Stilling
However, I figured it out why it did 200ms, it's the EMS SQL query tool I have, it adds traffic delay. So 200ms is probably instant
Per Stilling
And how long does this function takes on (0, 50) and on (200000, 200050)?
Quassnoi
Can you look at the pseudocode i wrote, do you know whether count and median can be made with logn or in constant time using indexes ?
Per Stilling
Post a new question, I'll look it tomorrow
Quassnoi
100k takes about 800ms sec, dont have more than 100k in the database at the time being. The 0-50 is instant.
Per Stilling
Okay I will do that.
Per Stilling