views:

58

answers:

2

I'm looking for Python code that can take tabular data and establish which normal form(s) it is in (if any) and show any functional dependencies, etc.

A: 

I think that this code will be really hard to implement.

Look at this question about normalization algorithms of databases.

Pedro Ghilardi
+2  A: 

There are logical tests for "normalization". However, they're not trivial exercises in programming; they're relationships in the metadata that are imposed on the data. They require "thinking".

1NF -- no repeating groups. How does one identify a "repeating group"? It would be an array structure imposed on the columns of a table. How is that done? SQL doesn't provide a mechanism, so you'd have to look at the column names to check for a "pattern". COL_1, COL_2, COL_3, for example.

1NF -- consistent layout of rows. Duh. SQL imposes this by the very nature of table definition.

2NF -- data in a row depends on the key. You'd have to do a procedure something like this.

For each non-key column:
   Query distinct pairs (Key and the non-key column)
   Do all non-key values depend in a consistent way on a key value?
   Can you build a simple dict mapping non-key to key?

The full algorithm is here: http://en.wikipedia.org/wiki/Relational_model#Set-theoretic_formulation

3NF -- data in a row depends ONLY on the key. This is worse, because you have to compare all combinations of non-key columns against all combinations of non-key columns to be sure that there were no non-key dependencies among the values.

4NF and 5NF confuse me, so I'll stop here.

My point is that -- theoretically -- you can do it. Practically, it's a lot of complex permutations of data to assert that the normal form relationships actually hold.

It's much, much easier to have a hypothesis about a specific violation and probe just that issue with some SQL queries and some thinking.

The formal math is here:

http://en.wikipedia.org/wiki/Relational_model

S.Lott
I'm aware of what the normal forms are, I just wondered if there was pre-existing code in Python before I write my own.
James Tauber
@James Tauber: "I'm aware of what the normal forms are..." wasn't clear from your question. Sorry for all the redundancy. But, the question appeared uninformed, since it's usually done through manual analysis of the schema not automated grinding of the data. If you already know, then you're aware that the algorithms can be intractably slow for large datasets.
S.Lott