ansaurus

Question

Answer 1

A:

It sounds like you are looking for a Database Diff tool. Such a tool would look for differences between two tables (or two databases), and generate the necessary scripts to align them.

See the following post for more information:
http://stackoverflow.com/questions/104203/anyone-know-of-any-good-database-diff-tools

Robert Harvey 2009-07-04 17:39:17

Thanks but this does not help. I need an actual algorithm. I have a feeling this is a theoretical problem which has been studied, so someone out there knows exactly how to do it.

binarycoder 2009-07-04 17:44:15

Answer 2

A:

OpenDbDiff has source code available. You could look at that and figure out the algorithms.

http://opendbiff.codeplex.com/Release/ProjectReleases.aspx?ReleaseId=25206

Robert Harvey 2009-07-04 17:50:21

Answer 3

+1 A:

I wrote one once, but it's someone else's IP, so I can't go into too much detail. However, I'm willing to tell you the process that taught me how to do this. This was a tool to make a shadow copy of a customer's "database" residing on salesforce.com, written in .NET 1.1.

I started out doing it the brute-force way (create DataSet and database from schema, turn off constraints in the DataSet, iterate through each table, loading rows, ignoring errors, for rows that are not yet in the table, repeat until more rows to add, or no more errors, or no change in the number of errors, then dump the DataSet to the DataBase, until no errors, etc.).

Brute force was the starting point because it wasn't certain that we could do this at all. The "schema" of salesforce.com wasn't a true relational schema. For instance, if I remember correctly, there were some columns which were foreign keys relating to one of several parent tables.

This took forever, even while debugging. I began to notice that most of the time was being spent on handling the constraint violations in the database. I began to notice the pattern of constraint violations, as each iteration converged, slowly, toward getting all the rows saved.

All the revelations I had were due to my boredom, watching the system sit at near 100% CPU for 15-20 minutes at a time, even with a small database. "Necessity is the mother of invention", and "the prospect of waiting another 20 minutes for the same rows, tends to focus the mind", and I figured out how to speed things up by a factor of over 100.

John Saunders 2009-07-04 21:33:43

I am well aware that ordering the INSERTs, UPDATEs, and DELETEs based on the graph of foreign key constraints handles most of the problems. My goal here is to handle a maximum number of kinds of problems without brute force, not just most of the problems.

binarycoder 2009-07-04 22:02:30

What I learned handled all of the problems, without any remaining brute force. I just needed to start with brute force then continue until boredom made the solution plain. I'll edit with the reason brute force was the starting point.

John Saunders 2009-07-04 22:04:59

Answer 4

+1 A:

OK, I think that this is it, though the Unique Key thing is pretty hard to figure out. Note that any errors encountered in the SQL execution should result in complete rollback of the entire transaction.

UPDATE: The original order that I implemented was:

Each Table, BottumUp(All Deletes for table) Each Table, TopDown(All Updates, then All Inserts)

After a counter-example was posted, I believe that I know haw to correct for the restriced problem only (problem #1, without UCs): by changing the order to:

Each Table, TopDown(All Inserts) Each Table, TopDown(All Updates) Each Table, BottumUp(All Deletes)

This will definitely NOT work with Unique Constraints though, which as far as I can figure will need a row-content based dependency sort (as opposed to the static table FK dependency sort I am currently using). What makes this particularily difficult is that it may require getting info about record-content other than the changed ones (in particular checking for the existence of UC conflict-values and child-dependent records for intermediate steps).

Anyway, here's the current version:

Public Class TranformChangesToSQL
 Class ColVal
    Public name As String
    Public value As String  'note: assuming string values'
 End Class

 Class Row
    Public Columns As List(Of ColVal)
 End Class

 Class FKDef
    'NOTE: all FK''s are assumed to be of the same type: records in the FK table'
    ' must have a record in the PK table matching on FK=PK columns.'
    Public PKTableName As String
    Public FKTableName As String
    Public FK As String
 End Class

 Class TableInfo
    Public Name As String
    Public PK As String                     'name of the PK column'
    Public UniqueKeys As List(Of String)    'column name of each Unique key'
    'This table''s Foreign Keys (FK):'
    Public DependsOn As List(Of FKDef)
    'Other tables FKs that point to this table'
    Public DependedBy As List(Of FKDef)
    Public Columns As List(Of String)
    'note: all row collections are indexed by PK'
    Public inserted As List(Of Row)     'inserted after-images'
    Public deleted As List(Of Row)      'deleted before-images'
    Public updBefore As List(Of row)
    Public updAfter As List(Of row)
 End Class

 Sub MakeSQL(ByVal tables As List(Of TableInfo))
    'Note table dependencies(FKs) must NOT form a cycle'

    'Sort the tables by dependency so that'
    ' child tables (FKs) are always after their parents (PK tables)'
    TopologicalSort(tables)

    For Each tbl As TableInfo In tables
        'Do INSERTs, they *must* be done first in parent-> child order, because:'
        '   they may have FKs dependent on parent inserts'
        '   and there may be Updates that will make child records dependent on them'
        For Each r As Row In tbl.inserted
            Dim InsSQL As String = "INSERT INTO " & tbl.Name & "("
            Dim valstr As String = ") VALUES("
            Dim comma As String = ""
            For Each col As ColVal In r.Columns
                InsSQL = InsSQL & comma & col.name
                valstr = valstr & comma & "'" & col.value & "'"
                comma = ", "    'needed for second and later columns'
            Next
            AddSQL(InsSQL & valstr & ");")
        Next
    Next

    For Each tbl As TableInfo In tables
        'Do UPDATEs'
        For Each aft In tbl.updAfter
            'get the matching before-update row'
            Dim bef As Row = tbl.updBefore(aft.Columns(tbl.PK.ColName).value)
            Dim UpdSql As String = "UPDATE " & tbl.Name & " SET "
            Dim comma As String = ""
            For Each col As ColVal In aft.Columns
                If bef.Columns(col.name).value <> col.value Then
                    UpdSql = UpdSql & comma & col.name & " = '" & col.value & "'"
                    comma = ", "  'needed for second and later columns'
                End If
            Next
            'only add it if any columns were different:'
            If comma <> "" Then AddSQL(UpdSql & ";")
        Next
    Next

    'Now reverse it so that INSERTs & UPDATEs are done in parent->child order'
    tables.Reverse()

    For Each tbl As TableInfo In tables.Reverse
        'Do DELETEs, they *must* be done last, and in child->paernt order because:'
        '   Parents may have children that depend on them, so children must be deleted first,'
        '   and there may be children dependent until after Updates pointed them away'
        For Each r As Row In tbl.deleted
            AddSQL("DELETE From " & tbl.Name & " WHERE " & tbl.PK.ColName & " = '" & r.Columns(tbl.PK.ColName).value) & "';"
        Next
    Next

 End Sub
End Class

RBarryYoung 2009-07-04 22:02:07

It's going to take me awhile to fully digest this, but I am already aware of some ways that sort the foreign key graph. Your algorithm does not deal with all problems that are technically possible to deal with. There are actually ways to insert records with cycles and ways to delete them via clever updating. There are also some possible unique constraints which can cause your algorithm grief. These comments don't really allow enough characters to explain in detail, e.g., UC on two columns: Row1: A B Row2: B A ==> Row1: B A Row2: A B

binarycoder 2009-07-04 22:14:53

To the best of my knowledge, this is the correct answer for your question #1, with the limitations that you allowed (the PK stuff). As long as they cannot mess with the PK, the order of the original operations shouldn't matter. The UC stuff is much harder and more ambiguous however...

RBarryYoung 2009-07-04 22:23:06

Also, most SQL DB's do not permit circular FK dependencies and those that do have no end of problems with it.

RBarryYoung 2009-07-04 22:24:14

I should mention that the Topological Sort is not sorting the rows, but rather the tables themselves, based solely on their static dependencies (FKs) and not on any data content of the rows. This approach also has the advantage of minimizing the possibility of deadlocks.

RBarryYoung 2009-07-04 22:46:20

This gets soooo close but fails on the cases that are a subset of question #1 that caused me to post here initially! I am going to revise the original question with a walkthrough of the dilemma.

binarycoder 2009-07-04 22:55:59

Answer 5

+1 A:

No, I don't find it fascinating. I don't find the quadrature-of-the-circle-problem fascinating either, and on that topic too there do exist people who strongly, or even violently, disagree with me.

When you say that "it has practical applications", do you mean to say that "the solution to this problem has practical applications" ? I suggest that a solution that does not exist, by definition cannot have "practical applications". (And I do suggest that the solution you're seeking does not exist, just like the quadrature of the circle.)

You argued something about "when other apps hang cascading deletes ...". Your initial problem statement contained no mention whatsoever of "other apps".

The problem I find way more fascinating is "how to build a DBMS that is good enough so that programmers will no longer be facing these kinds of problem, and no longer be forced to ask these kinds of question". Such a DBMS supports Multiple Assignment.

I wish you the best of luck.

2009-07-05 11:10:25

Erwin, you should have edited your original question instead of adding a new one. I recommend you copy this to the bottom of your original question, then delete it.

John Saunders 2009-07-05 11:56:29

Answer 6

+3 A:

Why are you even trying to do this? The correct way to do it is to get the database engine to defer the checking of the constraints until the transaction is committed.

The problem that you pose is intractable in the general case. If you consider just a transitive closure of the foreign keys in the rows you want to update in database then it is only possible to solve this where the graph describes a tree. If there is a cycle in the graph and you can break the cycle by replacing a foreign key value with a NULL then you can re-write one SQL and add another to later update the column. If you can't replace a key value with a NULL then it can't be solved.

As I say, the correct way to do this is to turn off the constraints until all of the SQL has been run and then turn them back on for the commit. The commit will fail if the constraints aren't met. Postgres (for example) has a feature which makes this very easy.

KayEss 2009-07-10 03:22:24

ansaurus

tags:

views:

answers:

Algorithms for Updating Relational Data

related questions