views:

63

answers:

2

I'm brainstorming an application to handle various data integrity checks. Each individual check could query a number of production tables, evaluate the results and report an error providing data relevant to the check. For example, one check would look for customers with a scheduled payment but no remaining balance; a different check might look for credit card transactions that have been authorized but not settled for more than 3 days. Two completely unrelated checks. The dataset from the first one would contain things like customer number, scheduled payment date, payoff date, etc. The second check would have transaction number, card type, last 4 digits of card, amount, etc.

I would like to store the result datasets in a common schema so I can query for any errors from Check A for a specific customer in the last 3 months. Or, how many times has Check B returned an error for distinct transactions. Other tables would also manage issue resolution and such. The only thing I've come up with so far is a table with ~20 columns, one column relating to the specific check, one for date/time, and the remaining would be some form of varchar capable of holding any type of data. There are any number of reasons why this makes me cringe, but performance ranks pretty high up there. I'm hoping to avoid separate tables for each check, but combining that with a lookup table for secondary functionality may be the only way to go.

For the curious I'm trying to keep this in the Microsoft world (VB.NET and SQL Server), but I'm open to other ideas.

A: 

Hmmmm. Just brainstorming here. Do you really need to store the check data? Or just a reference to be able to look it up? How about a table to store the types of checks and then a related table that stores a chekc id, the type of check, that date of the check, and any other data you might need. Then another related table that stores the check id and the record id for the records which failed the check.

An alternative is how we store our audit table data, a varchar (max for the actual data) a column that says what column it belonged to (and in your case what table it belonged to might be a separate column) and then a column that relates it to the check that was performed.

If you really feel you must store the data in one record, this might be a place to store an xml document.

HLGEM
Some of the checks are point in time data that I might not necessarily be able to recreate after the fact, but may still need to reference. That's another problem I'm hoping to resolve with this, as right now I do that by searching log files for the error and reviewing the plain text datasets.
clvrmonkey
+1  A: 

You can pull out those ~20 columns into one normalised "values" table. You could have a lookup table that marries which fields are expected for each check type.

To illustrate what I'm suggesting:

check_datasets

  • dataset_id (PK)
  • check_id (This would be the ID of the check that failed)
  • datetime

value_types

  • value_type_id (PK)
  • name

dataset_value_types

  • value_type_id (FK, PK)
  • dataset_id (FK, PK)

dataset_values

  • dataset_id (FK, PK)
  • value_type_id (FK, PK)
  • value (This column would represent the value of data)

----------

So to explain, in the above tables:

  • check_datasets is the table that holds each failed check (I am assuming that you are only recording failed checks - otherwise, you'd probably want to add a pass/successful flag to this table).
  • value_types is a simple lookup table to provide names for value fields.
  • dataset_value_types is used define which checks have which types of value fields (this is the many-to-many relationship between check_datasets table and value_types).
  • dataset_values is the table that holds the specific values for each check.

This design provides a normalised and easy way to query every value you want for each "dataset" that is returned.


The drawback to this design is that you still have mixed datatypes in the one field (value).

If you wanted to separate datatypes you could subtype dataset_values into seperate tables (ie: one table for integer values, one for varchars, LOB etc).

In the example above, this would mean creating as many sub-tables you want for datatypes (eg: dataset_values_int, dataset_values_text, etc). Each subtable need only have two columns - one holding the value and the second referencing the parent table.

(1) dataset_values would become a parent table and possibly could look like this:

dataset_values

  • dataset_value_id (PK)
  • dataset_id (FK)
  • value_type_id (FK)
  • datatype (Is this value an integer, varchar, etc. This column tells you which sub-table to check for the table)

(2) An example sub-table would be:

dataset_values_varchar

  • dataset_value_id (PK,FK)
  • value (In this table this would be a varchar - but would vary for the other subtables.)
catchdave
This sounds like the best option I've heard so far. I don't even know why I considered a static column count for this.
clvrmonkey
So is it good enough to get an upvote or accepted?
catchdave
This is what I use. If you want to get even fancier with it, you can create another (enum) column in the types table that stores the SQL type that should be matched/forced on the data. Then, when you insert/select you can use that column to force the value into the correct type (in mysql this would be something like: `SELECT CAST(value AS VARCHAR) FROM values_table`) or use it to validate on the insert/update trigger. Obviously to dynamically add the type casting to the query, you will need to do this via stored proceedure. However, by using SP you don't need a table for every type possible.
Kevin Peno