views:

41

answers:

4

I have a big fat query that's written dynamically to integrate some data. Basically what it does is query some tables, join some other ones, treat some data, and then insert it into a final table.

The problem is that there's too much data, and we can't really trust the sources, because there could be some errored or inconsistent data.

For example, I've spent almost an hour looking for an error while developing using a customer's database because somewhere in the middle of my big fat query there was an error converting some varchar to datetime. It turned out to be that they had some sales dating '2009-02-29', an out-of-range date. And yes, I know. Why was that stored as varchar? Well, the source database has 3 columns for dates, 'Month', 'Day' and 'Year'. I have no idea why it's like that, but still, it is.

But how the hell would I treat that, if the source is not trustable?

I can't HANDLE exceptions, I really need that it comes up to another level with the original message, but I wanted to provide some more info, so that the user could at least try to solve it before calling us.

So I thought about displaying to the user the row number, or some ID that would at least give him some idea of what record he'd have to correct. That's also a hard job because there will be times when the integration will run up to 80000 records. And in an 80000 records integration, a single dummy error message: 'The conversion of a varchar data type to a datetime data type resulted in an out-of-range datetime value' means nothing at all.

So any idea would be appreciated.

Oh I'm using SQL Server 2005 with Service Pack 3.


EDIT:

Ok, so for what I've read as answers, best thing to do is check each column that could be critical to raising errors, and if they do attend the condition, I should myself raise an error, with the message I find more descriptive, and add some info that could have been stored in a separate table or some variables, for example the ID of the row, or some other root information.

+1  A: 

for dates you can use the isdate function

select ISDATE('20090229'),ISDATE('20090227')

I usually insert into a staging table, do my checks and then insert into the real tables

SQLMenace
Dang -- my answer is pretty much the same thing you said about staging, but you were more succinct and clearer.
MJB
Yep, stage the data, cleanup the data, load the data to production tables.
HLGEM
+2  A: 

My suggestion would be to pre-validate the incoming data, and as you encounter errors, set aside the record. For example, check for invalid dates. Say you find 20 in a set of 80K. Pull those 20 out into a separate table, with the error message attached to the record. Run your other validation, then finally import the remaining (all valid) records into the desired target table(s).

This might have too much impact on performance, but would allow you to easily point out the errors and allow them to be corrected and then inserted in a second pass.

MJB
A: 

try something like this to find the rows:

...big fat query here...
WHERE ISDATE(YourBadVarcharColumn)!=1

Load the Data into a staging table, where most columns are varchar and allow NULLs, where you have a status column.

Run an UPDATE command like

UPDATE Staging
    SET Status='X'
    WHERE ISDATE(CONVERT(YourCharYear+YourCharMonth+YourCharDat+)!=1 
        OR OtherColumn<4...

Then just insert from your staging table where Status!='X'

INSERT INTO RealTable
        (col1, col2...)
    SELECT
        col1, col2, ...
        where Status!='X'
KM
you can then easily report the bad rows from the staging table based on the Status column
KM
+1  A: 

This sounds like a standard ETL issue: Extract, Transform, and Load. (Unless you have to run this query over and over again against the same set of data, in which case you'd pretty much do the same thing, only over and over again. So how critical is performance?)

What kind of error handling and/or "reporting of bad data" are you allowed to provide? If you have everything as "one big fat query", your options become very limited -- either the query works or it doesn't, and if it doesn't I'm guessing you get at best one RAISERROR message to tell the caller what's what.

In a situation like this, the general framework I'd try to set up is:

  • Starting with the source table(s)
  • Produce an interim set of tables (SQLMenace's staging tables) that you know are consistant and properly formed (valid data, keys, etc.)
  • Write the "not quite so big and fat query" against those tables

Done this way, you should always be able to return (or store) a valid data set... even if it is empty. The trick will be in determining when the routine fails -- when is the data too corrupt to process and produce the desired results, so you return a properly worded error message instead?

Philip Kelley
I can provide a string back to my program as an error message. But I'm able to concatenate it in my big query to provide some info on the errored record. Oh and as you pointed out, either the query works, or it doesn't, I don't intend to catch errors, I intend to point them out. So I'd stop execution for the first error I find, and show the user the responsible record.
Felipe Fiali