views:

144

answers:

2

Hi everyone, I've been stuck with the rather famous problem of ranking students by grade for a couple weeks now, and while I've learned much, I still haven't solved my problem (the ranks are generated, but the process is too slow):

I have a large table (320,000 rows) that contains the student codes (serves as an identifier, instead of their names), the students classroom, the test , the tests date, the subject, the question number and the students grade on that question. This table is the base for everything else that is calculated and its size makes all these calculations very very slow, to the point where I find me almost breaking everything here at work.

First, some intel on the school (very little info, required to understand the problem)
Here at the school we have weekly tests over several subjects. The school is also separated in classrooms with different purposes (one is focused on math, physics and chemistry, another one is focused on biology, and the last one focuses on history, Portuguese and geography). But they all do the same tests every week.

What we want to do is calculate the standard deviation for each question for everyone in the school (not per-classroom) and the average grade per question (also for everyone in the school), and then generate the following ranks (all of them per date):

-Rank per subject per classroom (with "raw" grades), Rank per subject considering the whole school (with "raw" grades) and Rank per subject considering the whole school (using normalized grades, with the standard deviation per question and the average grade per question information)
-The same ranks that were mentioned above, but not per Subject, considering instead all subjects

As you can see, after calculating the average grades and the standard deviations, we still need to calculate the sums of the grades on each question, and rank according to these sums (the actual subject/test grades). I've attacked this problem in a few ways:

1) Created two tables, one with the grades per student per subject (fields: Students code, Students classroom, Date of test, Subject, Grade, Normalized Grade, Rank in Classroom, Rank in School, Rank in School using normalized grades) and another with the grades per student per test (all subjects taken into account, fields: Students code, Students classroom, Date of test, Grade, Normalized Grade, Rank in Classroom, Rank in School, Rank in School using normalized grades).

The insertion of data in these tables takes about 50 seconds

Then, I tried using SQL to rank, however, I ran into some problems:
-Access has no ROW_NUMBER or RANK functions, and thus I have to use queries with COUNT, like (below is just a simplified version):

SELECT 1+(SELECT Count(*) FROM grades_table_per_subject t2 WHERE 
t2.Grade > t1.Grade AND t1.Date=t2.Date AND t1.Subject=t2.Subject) AS [Global Rank],
1+(SELECT Count(*) FROM grades_table_per_subject t3 WHERE t3.Grade > t1.Grade AND
t3.Date=t1.Date AND t3.Subject=t1.Subject AND t3.Classroom=t1.Classroom) AS
[Rank in classroom] FROM grades_table_per_subject;

*There still is the rank with the normalized grades in the query above, but I omitted it.
The table grades_table_per_subject has about 45,000 lines and this query takes more than 15 minutes here, even with indexing (tried many different index combinations, even some odd ones when I saw that the ones that should work didn't).
I also tried to ORDER BY Count(*) DESC the inner selects, but I hit ctrl+break after 7 minutes and no results.

2) Added the following fields to the tables above: Rank in Classroom, Rank in School, Rank in School using normalized grades

Then I tried using VBA with DAO and manually update the Rank fields, running the following code (simplified version):

Set rs = CurrentDb.OpenRecordset("SELECT Classroom, Date, Subject, Grade, [Rank in classroom] FROM
grades_table_per_subject ORDER BY Date, Classroom, Subject, Grade DESC;", dbOpenDynaset)
...
...
rs.movefirst
i=1
While Not rs.eof
 'Verifies if there was a change on either one of Subject, Classroom, Date and if so:
 ...
  i = 1
 ...

 rs.Edit
 rs![Rank in classroom]=i
 rs.Update

 i = i + 1
 rs.movenext
Wend
rs.close

This obviously builds only one of the ranks (in this case per subject per classroom), and it takes alone 3min 10sec.
I verified that it takes so long due to the writes on the table (rs.Edit and rs.Update are the culprits, commenting them makes the whole thing run in only 4 seconds), but I need the ranks written to the table to generate an access report later.

FINALLY:
I could generate all the ranks once and make ways for the users to access all the data very quickly, but the idea is that everything should be calculated on-the-fly. The times we have achieved, however, make this impossible.

Overall, the question to be asked is the following:
-Is there a way to calculate the ranks shown above through an Access Query under 10 seconds, or to use VBA and calculate-insert these ranks to the table in a similar time considering the size of the tables used here?

Also, I would love to see a list of efficient ranking algorithms, so that even if I can't do everything quickly, I can improve it as much as possible.

A: 

I could generate all the ranks once and make ways for the users to access all the data very quickly, but the idea is that everything should be calculated on-the-fly.

Why?

Why bother regenerating the same data over and over? It's most likely preferable to generate these statistics when the data changes and just look them up every other time. Redoing work you've already done whenever somebody wants to check something is just silly.

Anon.
I agree it is silly, but these are demands I must meet (my boss wants it like this, and we've discussed a lot about this).
Marcelo Zabani
To inform you a little more of why my boss wants it to be generated on-the-fly: He doesn't want more data in the database, if everything could be quickly generated through queries we would just have the original grades_per_question table
Marcelo Zabani
Everything involves tradeoffs. If your boss insists on using as little storage space as possible even if it takes vastly more processing power, you should ask him whether you should go ahead and compress the harddrives the database server uses.
Anon.
A: 

I just saw you say ms access only

so ignore this answer -- or consider moving to a real DB if you want to be able to do this type of power processing.

original answer below

I don't have access to your test data, but how fast does this run?

SELECT RANK () OVER (PARTITION BY [Date],[Subject] ORDER BY Grade) AS [Global Rank],
       RANK () OVER (PARTITION BY [Date],[Subject], Classroom ORDER BY Grade) AS [Rank in classroom]
FROM grades_table_per_subject

My guess is you are not going to be able to beat SQL Servers ranking speed in VBA, if this is not fast enough then you need to look in the profiler and see what indexes it suggests you make.

Hogan
You know SQL Server Express is free (http://www.microsoft.com/express/Database/) you could download it, pop your data into it and see if it solves the problem.
Hogan