views:

123

answers:

2

I have a tblMachineReports with the columns: Status(varchar),LogDate(datetime),Category(varchar), and MachineID(int).

I want to retrieve the latest status update from each category for every machine, so in effect getting a snapshot of the latest statuses of all the machines unique to their MachineID.

The table data would look like

Category - Status - MachineID - LogDate
cata - status1 - 001 - date1
cata - status2 - 002 - date2
catb - status3 - 001 - date2
catc - status2 - 002 - date4
cata - status3 - 001 - date5
catc - status1 - 001 - date6
catb - status2 - 001 - date7
cata - status2 - 002 - date8
catb - status2 - 002 - date9
catc - status2 - 001 - date10


Restated, I have multiple machines reporting on multiple statuses in this tblMachineReports. All the rows are created through inserts, so their will obviously be duplicate entries for machines as new statuses come in. None of the columns can be predicted, so I can't do any ='some hard coded string' comparisons in any part of the select statement.


For the sample table I provided, the desired results would look like:

Category - Status - MachineID - LogDate
catc - status2 - 002 - date4
cata - status3 - 001 - date5
catb - status2 - 001 - date7
cata - status2 - 002 - date8
catb - status2 - 002 - date9
catc - status2 - 001 - date10

What would the select statement look like to achieve this, getting the latest status for each category on each machine, using MS SQL Server 2008? I have tried different combinations of subqueries combined with aggregate MAX(LogDates)'s, along with joins, group bys, distincts, and what-not, but have yet to find a working solution.

+2  A: 
 SELECT * FROM tblMachineReports MR1 
    WHERE NOT EXISTS
       (SELECT * FROM tblMachineReports MR2 WHERE
            MR2.Category = MR1.Category AND 
            MR2.MachineID = MR1.MachineID AND
            MR2.LogDate > MR1.LogDate)

The only issue is that if you have the same category & machine more than once on the same date, you will get all the records for that date, not only the last one. To fix that, change LogDate to DATETIME if you have access to the database structure or add a numerically increasing identity column to the table and use that instead of LogDate in the SELECT statement.

[edited to change < to > in final line of the SQL]

Larry Lustig
Since LogDate is DateTime in my schema, I believe it is accurate down to the millisecond, and the statistical probability of colliding record dates is so low for this application that stated problem should be negligible.
Simpleton
Then you should be set. In the event of a collision on the millisecond level, assuming you don't care which status you get, you could GROUP the result set and take the MAX or MIN status. That would get you only one result row.
Larry Lustig
I think the < sign should be a > greater than sign, I created a test dataset, and using your query the < sign returned the earliest record for each category on each machine, I appended an 'order by LogDate' to the end of your query and with the < all the dates are earlier than the dates I get back with the > sign. Am I wrong, or perhaps I wasn't properly verbose enough in the problem description, I need the latest in terms of date record for each category on each machine.
Simpleton
BTW comments and responses have been extremely helpful so far. Thanks so much for the time and consideration.
Simpleton
You're quite correct. I'll correct the typo now. . .
Larry Lustig
A quick word of caution. Using NOT EXISTS can be quite expensive on large tables, the query optimiser does not know to execute the query only once per category
TerrorAustralis
Larry - Dave wins the day on this one IMO, for this DB will accumulate thousands upon thousands of records, but as far as response time goes, well you both take the crown.
Simpleton
i'm putting +1 on this response because Larry helped me see the problems with my code
TerrorAustralis
Indeed, NOT EXISTS is an expensive technique when the inner SELECT statement is correlated to the outer one. However, unless the OP is writing some kind of botnet he or she is unlikely to have more than a few hundred machines (I was working on plant automation today for a midsize manufacturer, and they have about 50 per plant), and the performance should be fine over even a few thousand records.
Larry Lustig
@Simpleton: The response time should be about equal since each query uses a correlated query which will have to be evaluated for each candidate row in the original table (and both will benefit from a compound index on MachineID, Category). To get a (possibly) faster result, create a GROUP BY view on MachineID, Category, MAX(LogDate) and JOIN that back to the original table.
Larry Lustig
A: 

Appologies if you have already tried this but this form of grouping and aggregation should work

select *
from tblMachineReports t
where LogDate = (select max(LogDate) from tblMachineReports t2 where t.machineID = t2.machineID and t.category = t2.category)

Try that, might need some fine tuning

TerrorAustralis
You want category, not status, in the WHERE clause of the correlated sub-select, I think.
Larry Lustig
I have changed the solution i think since you posted this. I dont have a subselect any more
TerrorAustralis
I'm not a SQL Server specialist, but I think you were closer the first time. In this version you'll get multiple status values, not only the last one, back, and there's no non-aggregated LogDate value in the result set to use in the HAVING clause.
Larry Lustig
Thanks mate, included a non aggregated logged date. And check out the 'desired output' table. The poster WANTS multiple status values for a single machine, he just wants the LAST status value for each status.
TerrorAustralis
The OP wants the _last_ status for each machine/category pair. The current GROUP BY will collapse the triplets machine/category/status, providing multiple status codes per machine/category pair. For instance, it would include the date1 and date5 lines, although only date5 is desired. Without the aggregate function on LogDate, this SQL shouldn't run (except in MySQL, where it will report a "random" LogDate).
Larry Lustig
thanks for your feedback, i misread the question. I was reading he wanted last updated per machine/category/status
TerrorAustralis
Have modified my response to fit
TerrorAustralis
When I run this query I get 'Column 'tblMachineStatus.LogDate' is invalid in the HAVING clause because it is not contained in either an aggregate function or the GROUP BY clause.' on SQL Server 2008
Simpleton
changed solution to match new DB standards. I tested this solution in a temp table
TerrorAustralis
Worked great for me now, appreciate it much kind sir.
Simpleton