views:

437

answers:

4

I'm reverse engineering the relationships between a medium-sized number of tables (50+) in an Oracle database where there are no foreign keys defined between the tables. I can count (somewhat) on being able to match column names across tables. For example, column name "SomeDescriptiveName" is probably the same across the set of tables.

What I would like to be able to do is to find a better way of extracting some set of relationships based on those matching column names than manually going through the tables one by one. I could do something with Java DatabaseMetaData methods but it seems like this is one of those tasks that someone has probably had to script before. Maybe extract the columns names with Perl or some other scripting lang, use the column names as a hash key and add tables to an array pointed to by the hash key?

Anyone have any tips or suggestions that might make this simpler or provide a good starting point? It's an ugly need, if foreign keys had already been defined, understanding the relationships would have been much easier.

Thanks.

+1  A: 

You pretty much wrote the answer in your question.

my %column_tables;
foreach my $table (@tables) {
    foreach my $column ($table->columns) {
        push @{$column_tables[$column]}, $table;
    }
}
print "Likely foreign key relationships:\n";
foreach my $column (keys %column_tables) {
    my @tables = @{$column_tables[$column]};
    next
        if @tables < 2;
    print $column, ': ';
    foreach my $table (@tables) {
        print $table->name, ' ';
    }
    print "\n";
}
chaos
Of course, this would work as long as all columns which are logically linked have the same name. Otherwise, you'd have to do an extensive analysis of the data, to try to get the relationships... a tough task I wouldn't want, indeed.
Joe Pineda
+1  A: 

You can use a combination of three (or four) approaches, depending on how obfuscated the schema is:

  • dynamic methods
    • observation:
      • enable tracing in the RDBMS (or ODBC layer), then
      • perform various activities in the application (ideally record creation), then
      • identify which tables were altered in tight sequence, and with what column-value pairs
      • values occurring in more than one column during the sequence interval may indicate a foreign key relationship
  • static methods (just analyzing existing data, no need to have a running application)
    • nomenclature: try to infer relationships from column names
    • statistical: look at minimum/maximum (and possibly the average) of unique values in all numerical columns, and attempt to perform a match
    • code reverse engineering: your last resort (unless dealing with scripts) - not for the faint of heart :)

Cheers, V.

vladr
A: 

My strategy would be to use the Oracle system catalog to find columns that are the same in column name and data type but different in table name. Also which one of the columns is part of a table's primary or unique key.

Here's a query that may be close to doing this, but I don't have an Oracle instance handy to test it:

SELECT col1.table_name || '.' || col1.column_name || ' -> ' 
    || col2.table_name || '.' || col2.column_name
FROM all_tab_columns col1 
  JOIN all_tab_columns col2
    ON (col1.column_name = col2.column_name 
    AND col1.data_type = col2.data_type)
  JOIN all_cons_columns cc
    ON (col2.table_name = cc.table_name 
    AND col2.column_name = cc.column_name)
  JOIN all_constraints con
    ON (cc.constraint_name = con.constraint_name 
    AND cc.table_name = con.table_name 
    AND con.constraint_type IN ('P', 'U')
WHERE col1.table_name != col2.table_name;

Of course this won't get any case of columns that are related but have different names.

Bill Karwin
A: 

This is an interesting question. The approach I took was a brute force search for columns that matched types and values for a small sample set. You'll probably have to tweak the heuristics to provide good results for your schema. I ran this on a schema that didn't use auto-incremented keys and it worked well. The code is written for MySQL, but it's very easy to adapt to Oracle.

use strict;
use warnings;
use DBI;

my $dbh = DBI->connect("dbi:mysql:host=localhost;database=SCHEMA", "USER", "PASS");

my @list;
foreach my $table (show_tables()) {
    foreach my $column (show_columns($table)) {
        push @list, { table => $table, column => $column };
    }
}

foreach my $m (@list) {
    my @match;
    foreach my $f (@list) {
        if (($m->{table} ne $f->{table}) &&
            ($m->{column}{type} eq $f->{column}{type}) &&
            (samples_found($m->{table}, $m->{column}{name}, $f->{column}{samples})))
        {
            # For better confidence, add other heuristics such as
            # joining the tables and verifying that every value
            # appears in the master. Also it may be useful to exclude
            # columns in large tables without an index although that
            # heuristic may fail for composite keys.
            #
            # Heuristics such as columns having the same name are too
            # brittle for many of the schemas I've worked with. It may
            # be too much to even require identical types.

            push @match, "$f->{table}.$f->{column}{name}";
        }
    }
    if (@match) {
        print "$m->{table}.$m->{column}{name} $m->{column}{type} <-- @match\n";
    }
}

$dbh->disconnect();

exit;

sub show_tables {
    my $result = query("show tables");
    return ($result) ? @$result : ();
}

sub show_columns {
    my ($table) = @_;
    my $result = query("desc $table");
    my @columns;
    if ($result) {
        @columns = map {
            { name => $_->[0],
              type => $_->[1],
              samples => query("select distinct $_->[0] from $table limit 10") }
        } @$result;
    }
    return @columns;
}

sub samples_found {
    my ($table, $column, $samples) = @_;
    foreach my $v (@$samples) {
        my $result = query("select count(1) from $table where $column=?", $v);
        if (!$result || $result->[0] == 0) {
            return 0;
        }
    }
    return 1;
}

sub query {
    my ($sql, @binding) = @_;
    my $result = $dbh->selectall_arrayref($sql, undef, @binding);
    if ($result && $result->[0] && @{$result->[0]} == 1) {
        foreach my $row (@$result) {
            $row = $row->[0];
        }
    }
    return $result;
}
Ken Fox