views:

145

answers:

4

Hi,

I am working on some business intelligence reports these days. The data is fetched by ordinary SQL SELECT statements. The statements are getting more and more complex.

The data I want to report is partially business critical. So I would feel better, if I could do something to proof the correctness and quality of the SQL statements.

I know there are some ways to do this for application-code. But what can I do to reach these goals at SQL level?

Thanks in advance.

+6  A: 

I'm not aware of any SQL-level proof of QA you could do since you are looking for the intent (semantics) of the query rather than syntactical correctness.

I would write a small test harness that takes the SQL statement and runs it on a known test database, then compare the result with an expecetd set of reference data (spreadsheet, simple CSV file etc).

For bonus points wrap this in a unit test and make it part of your continuous build process.

If you use a spreadsheet or CSV for the reference data it may be possible to walk through it with the business users to capture their requirements ahead of writing the SQL (i.e test-driven development).

Paolo
+1 for running a test as part of continuous build process, against a known test database
AdaTheDev
+1  A: 

Testing correctness of the statements would require a detailed description of the logic that genreated the report requirement, and then independant of your SQL the creation of appropriate test data sets, constructed against the requirements to ensure that the correct data, and only the correct data was selected for each test case.

Constructing these cases on more and more complex conditions will get very difficult though - reporting is notorious for ever changing requirements.

Andrew
+1  A: 

You could also consider capturing metrics such as the duration of running each statement. You could either do this at the application level, or by writing into an audit table at the beginning and end of each SQL statement. This is made easier if your statements are encapsulated in stored procedures, and can also be used to monitor who is calling the procedure, at what times of day and where from.

Adamski
+1  A: 

Looking forward to reading answers on this one. It's simple to check if the statement works or not, either it runs or doesn't run. You can also check against the requirement, does it return these 14 columns in the specified order.

What is hard to check is whether the result set is the correct answer. When you have tables with millions of rows joined to other tables with millions of rows, you can't physically check everything to know what the results should be. I like the theory of running against a test database with known conditions, but building this, and accounting for the edge cases that might affect the data in production, is something that I think would be hard to tackle.

You can sometimes look at things in such a way as to tell you if things are right. Sometimes I add a small limiting where clause to a report query to get a result set I can manually check. (Say only the records for a few days). Then I go into the database tables individually and see if they match what I have. For instance if I know there were 12 meetings for the client in that time period, is that the result I got? Oh I got 14, hmmm must be one of the joins needs more limiting data (there are two records and I only want the latest one) Or I got 10, then I figure out what is eliminating the other two (Usually a join that should be a left join or a where condition) Should those two be missing with the business rules I've been given or not.

Often when building, I return more columns than I actually need so that I can see the other data, this may make you realize that you forgot to filter for something that you need to filter for when an unexpected value turns up.

I look at the number of results carefully as I go through and add joins and where conditions. Did they go up or down, if so is that what I wanted.

If I have a place that is currently returning data that I can expect my users will compare this report to, I will look there. For instance if they can search on the website for the available speakers and I'm doing an export to the client of speaker data, the totals between the two had better match or be explanable by different business rules.

When doing reporting the users of the report often have a better idea of what the data should say than the developer. I always ask one of them to look at the qa data and confirm the reort is correct. They will often say things like what happened to project XYZ, it should be on this report. Then you know to look at that particularly case.

One other thing you need to test, is not just correctness of the data, but performance. If you only test against a small test database, you may have a query that works getting your the correct data but which times out every time you try to run it on prod with the larger data set. So never test just against the limited data set. And if at all possible run load tests as well. You do not want a bad query to take down your prod system and have that be the first indicator that there is a problem.

HLGEM