views:

1435

answers:

3

Hi,

I have a very big table of measurement data in MySQL and I need to compute the percentile rank for each and every one of these values. Oracle appears to have a function called percent_rank but I can't find anything similar for MySQL. Sure I could just brute-force it in Python which I use anyways to populate the table but I suspect that would be quite inefficient because one sample might have 200.000 observations.

+1  A: 

there is no easy way to do this. see http://rpbouman.blogspot.com/2008/07/calculating-nth-percentile-in-mysql.html

Nir Levy
What I'm looking for is actually the inverse of that i.e. given a number it should tell me its rank. I'm somewhat confident this would be easier in Oracle but unfortunately that isn't a possibility.
lhahne
A: 

To get the rank, I'd say you need to (left) outer join the table on itself something like :

select t1.name, t1.value, count(distinct isnull(t2.value,0))  
from table t1  
left join table t2  
on t1.value>t2.value  
group by t1.name, t1.value

For each row, you will count how many (if any) rows of the same table have an inferior value.

Note that I'm more familiar with sqlserver so the syntax might not be right. Also the distinct may not have the right behaviour for what you want to achieve. But that's the general idea.
Then to get the real percentile rank you will need to first get the number of values in a variable (or distinct values depending on the convention you want to take) and compute the percentile rank using the real rank given above.

A: 

This is a relatively ugly answer, and I feel guilty saying it. That said, it might help you with your issue.

One way to determine the percentage would be to count all of the rows, and count the number of rows that are greater than the number you provided. You can calculate either greater or less than and take the inverse as necessary.

Create an index on your number. total = select count(*); less_equal = select count(*) where value > indexed_number;

The percentage would be something like: less_equal / total or (total - less_equal)/total

Make sure that both of them are using the index that you created. If they are not, tweak them until they are. The explain query should have "using index" in the right hand column. In the case of the select count(*) it should be using index for InnoDB and something like const for MyISAM. MyISAM will know this value at any time without having to calculate it.

If you needed to have the percentage stored in the database, you can use the setup from above for performance and then calculate the value for each row by using the second query as an inner select. The first query's value can be set as a constant.

Does this help?

Jacob

TheJacobTaylor
I actually tried that a few weeks ago and it was incredibly slow so I ended up calculating percentiles in python and putting the value in database.
lhahne
You tried to use the select count(*) and select count(*) <= yourvalue? Did you confirm that both of them were being handled by an index that only had the columns you needed? If the solution had to touch the data rows at all, I would expect it to be one or two orders of magnitude slower. If the indexes included more than the columns needed or the memory configuration of MySQL was not setup right, it to be very slow. If so, this should have been fast. Roughly how much time is "incredibly slow"? Depending on the order of magnitude of the expected response, my answer could be unwholesomely slow.
TheJacobTaylor