tags:

views:

1488

answers:

8

I have the following data:

ExamEntry   Student_ID     Grade
  11           1             80
  12           2             70
  13           3             20
  14           3             68
  15           4             75

I want to find all the students that passed an exam. In this case, if there are few exams that one student attended to, I need to find the last result.

So, in this case I'd get that all students passed.

Can I find it with one fast query? I do it this way:

  1. Find the list of entries by select max(ExamEntry) from data group by Student_ID

  2. Find the results:

select ExamEntry from data where ExamEntry in ( ).

But this is VERY slow - I get around 1000 entries, and this 2 step process takes 10 seconds.

Is there a better way?

Thanks.

+1  A: 
SELECT student_id, MAX(ExamEntry)
FROM data
WHERE Grade > :threshold
GROUP BY student_id

Like this?

Quassnoi
+5  A: 

If your query is very slow at with 1000 records in your table, there is something wrong. For a modern Database system a table containing, 1000 entries is considered very very small.
Most likely, you did not provid a (primary) key for your table?

Assuming that a student would pass if at least on of the grades is above the minimum needed, the appropriate query would be:

SELECT 
  Student_ID
, MAX(Grade) AS maxGrade
FROM table_name
GROUP BY Student_ID
HAVING maxGrade > MINIMUM_GRADE_NEEDED

If you really need the latest grade to be above the minimum:

SELECT 
  Student_ID
, Grade
FROM table_name
WHERE ExamEntry IN ( 
    SELECT 
      MAX(ExamEntry) 
    FROM table_name 
    GROUP BY Student_ID
)
HAVING Grade > MINIMUM_GRADE_NEEDED
Jacco
+1  A: 

I'll make some assumptions that you have a student table and test table and the table you are showing us is the test_result table... (if you don't have a similar structure, you should revisit your schema)

select s.id, s.name, t.name, max(r.score)
from student s
left outer join test_result r on r.student_id = s.id
left outer join test t on r.test_id = t.id
group by s.id, s.name, t.name

All the fields with id in it should be indexed.

If you really only have a single test (type) in your domain... then the query would be

select s.id, s.name, max(r.score)
from student s
left outer join test_result r on r.student_id = s.id
group by s.id, s.name
mson
+1  A: 

As mentioned, indexing is a powerful tool for speeding up queries. The order of the index, however, is fundamentally important.

An index in order of (ExamEntry) then (Student_ID) then (Grade) would be next to useless for finding exams where the student passed.

An index in the opposite order would fit perfectly, if all you wanted was to find what exams had been passed. This would enable the query engine to quickly identify rows for exams that have been passed, and just process those.

In MS SQL Server this can be done with...

CREATE INDEX [IX_results] ON [dbo].[results] 
(
    [Grade],
    [Student_ID],
    [ExamEntry]
)
ON [PRIMARY]

(I recommend reading more about indexs to see what other options there are, such as ClusterdIndexes, etc, etc)

With that index, the following query would be able to ignore the 'failed' exams very quickly, and just display the students who ever passed the exam...

(This assumes that if you ever get over 60, you're counted as a pass, even if you subsequently take the exam again and get 27.)

SELECT
    Student_ID
FROM
    [results]
WHERE
    Grade >= 60
GROUP BY
    Student_ID

Should you definitely need the most recent value, then you need to change the order of the index back to something like...

CREATE INDEX [IX_results] ON [dbo].[results] 
(
    [Student_ID],
    [ExamEntry],
    [Grade]
)
ON [PRIMARY]

This is because the first thing we are interested in is the most recent ExamEntry for any given student. Which can be achieved using the following query...

SELECT
   *
FROM
   [results]
WHERE
   [results].ExamEntry = (
                          SELECT
                              MAX([student_results].ExamEntry)
                          FROM
                              [results] AS [student_results]
                          WHERE
                              [student_results].Student_ID = [results].student_id
                         )
   AND [results].Grade > 60

Having a sub query like this can appear slow, especially since it appears to be executed for every row in [results].

This, however, is not the case...
- Both main and sub query reference the same table
- The query engine scans through the Index for every unique Student_ID
- The sub query is executed, for that Student_ID
- The query engine is already in that part of the index
- So a new Index Lookup is not needed

EDIT:

A comment was made that at 1000 records indexs are not relevant. It should be noted that the question states that there are 1000 records Returned, not that the table contains 1000 records. For a basic query to take as long as stated, I'd wager there are many more than 1000 records in the table. Maybe this can be clarified?

EDIT:

I have just investigated 3 queries, with 999 records in each (3 exam results for each of 333 students)

Method 1: WHERE a.ExamEntry = (SELECT MAX(b.ExamEntry) FROM results [a] WHERE a.Student_ID = b.student_id)

Method 2: WHERE a.ExamEntry IN (SELECT MAX(ExamEntry) FROM resuls GROUP BY Student_ID)

Method 3: USING an INNER JOIN instead of the IN clause

The following times were found:

Method    QueryCost(No Index)   QueryCost(WithIndex)
   1               23%                    9%
   2               38%                   46%
   3               38%                   46%

So, Query 1 is faster regardless of indexes, but indexes also definitely make method 1 substantially faster.

The reason for this is that indexes allow lookups, where otherwise you need a scan. The difference between a linear law and a square law.

Dems
At 1000 records, any indexes are irrelevant.
le dorfier
The statement was that 1000 records were returned, not that the table contained 1000 records.
Dems
A: 

Thanks for the answers!!

I think that Dems is probably closest to what I need, but I will elaborate a bit on the issue.

  1. Only the latest grade counts. If the student had passed first time, attended again and failed, he failed in total. He/She could've attended 3 or 4 exams, but still only the last one counts.
  2. I use MySQL server. The problem I experience in both Linux and Windows installations.
  3. My data set is around 2K entries now and grows with the speed of ~ 1K per new exam.
  4. The query for specific exam also returns ~ 1K entries, when ~ 1K would be the number of students attended (received by SELECT DISTINCT STUDENT_ID from results;), then almost all have passed and some have failed.

  5. I perform the following query in my code: SELECT ExamEntry, Student_ID from exams WHERE ExamEntry in ( SELECT MAX(ExamEntry) from exams GROUP BY Student_ID). As subquery returns about ~1K entries, it appears that main query scans them in loop, making all the query run for a very long time and with 50% server load (100% on Windows).

  6. I feel that there is a better way :-), just can't find it yet.

Alex
You would have a lot better luck getting useful answers if you would post your schema, showing your tables and indexes, and query EXPLAIN analysis (do you know how to do this?). You've been able to get a number of us making guesses on insufficient information.
le dorfier
le dorfier:no, please. I started using databases a month ago, so appreciate any help.Would you give an example on EXPLAIN syntax?thanks
Alex
Alex, did you try any of the suggested queries? With or without indexes? Any noticable improvements in any?
Dems
Dems, no I haven't. I'll get back home an try then...
Alex
A: 

Hi All

I've used the hints given here, and here the query I found that runs almost 3 orders faster than my first one (.03 sec instead of 10 sec):

SELECT ExamEntry, Student_ID, Grade from data,
       ( SELECT max(ExamEntry) as ExId GROUP BY Student_ID) as newdata
WHERE `data`.`ExamEntry`=`newdata`.`ExId` AND Grade > 60;

Thanks All!

Alex
A: 
select examentry,student_id,grade 
from data 
where examentry in 
  (select max(examentry) 
   from data 
   where grade > 60 
   group by student_id)
tuinstoel
tuinstoel - if you read my post, you'd see that this exact query was the problem in the first place.This kind of query runs over 10 seconds on my machine.My query runs for .03 sec.
Alex
I haven't installed MySQL so I can't check. Our your statistics up to date?
tuinstoel
yes, it is very up to date :-)
Alex
A: 

don't use

where grade > 60

but

where grade between 60 and 100

that should go faster

Natrium