tags:

views:

336

answers:

7

As part of a data regression and quality assurance exercise, I need to look for diffs between two tables that should, ideally, be nearly identical. I have looked at a large number of commercial products and can't seem to find one that satisfies all of our requirements:

  • Must be able to compare LARGE tables (10 million rows, by 200 columns) very efficiently

  • Must work across different DB servers and different DB vendors (Oracle vs. DB2)

  • Must be able to compare tables having different structures, ignoring the columns that are not shared between the two tables

  • Must work with a user supplied, multi-column primary key-- can't rely on key defined in DB

  • Must run on linux/solaris. Will be run as part of a fully automated process that is executed within an enterprise environment.

  • Must be able to run headless (wihtout GUI)

  • Must produce formatted report that identifies row diffs (row on only one side) and value diffs

  • Customer is willing to pay enterprise level price for right solution. In other words, price no object.

Has anyone ever seen something like this?

A: 

Don't know if it's related, by have a look at http://stackoverflow.com/questions/847434/diffing-objects-from-a-relational-database

Martijn
A: 

I'd hash the DB rows based on your defined criteria and then use that. If the comparison details are fairly static you may want to persist the hash, either as a new column in the table itself or in a separate dedicated table. An appropriate index would then allow you to perform whatever comparisons you wish.

Visage
A: 

select 'Table1' as tblName, * from (select * from Table1 except select * from Table2) x union all select 'Table2' as tblName, * from (select * Table2 except select * from Table1) x

Also, There are commercially available tools which are designed for this

Red Gate do one for £245, there must be 'freeware' I'd guess

http://www.red-gate.com/products/SQL_Compare/index.htm

Stuart
guess you didn't read any of our requirements. Your SQL statement doesn't work because we need to cross different physical databases.Red Gate don't work because it is incapable of diffing very large data sets.thanks.
+2  A: 

Not the best solutions but for flexibility, we have implemented this as a set of perl scripts that extract the data and then do file comparison.

Most commercial databases have excellent bulk copy utility (bcp, sqlload etc.) and Perl is fast with string comparison and for proecessing large files.

Rahul
We tried this initially. We have two problems. One is if there is a row on one side that is missing on the other side. A generic diff algorithm will break at this point, preventing diffing beyond the point of first row mismatch.The second problem is that it's incredibly time consuming. Sometimes, we already know there are diffs, because the rowcounts don't match. And we just want a quick report that explains the first few discrepancies so that we can address that problem before moving on.
Also, we were really hoping for something that is more out-of-the-box. The bcp + perl option is not generic for the different instances of diff that we want to carry out (different tables, different ignore columns, etc). We would have to modify perl scipts every time we want to change how it works. Ideal solution would involve no programming, just some kind of a config file.
@unknown: ?? no clue what you mean by "will break at this point". For the rest, it's a matter of customizing to meet your needs. If you want to check rowcounts first, do so. If you want to stop after the first X differences, do that. It's a SMOP.
ysth
@unknown: perl scripts have been known to read config files :)
ysth
Thinking "way" out of the box, would an ETL tool like Informatica help? The two tables could be configured as sources and the diff logic could be implemented in the transformations. This would require less coding and would work with heterogenous tables.
Rahul
A: 

Over the years, I have developed certain scripts that almost exactly match your requirements. I could productize them, i.e. make them easier to use with some additional effort.

Please contact me at androknego at yahoo dot com (no spaces, replace at with @ and dot with .)

Andrew from NZSG
A: 

Only runs on Windows but satisfies all your other requirements Zidsoft CompareData http://www.zidsoft.com/

Farid Z
Target dbms can be on other machines/OS and the results of the comparison can be stored on DB2/Oracle/SQL Server, etc that are running on whatever OS the DBMS is running on (comparison results are stored in SQL tables that you can run reports on, etc)
Farid Z
A: 

A tool that meets all of your criteria now exists. It's free, it's Open Source, and it's very well documented, and it's a high-quality, well tested product.

http://www.diffkit.org