views:

714

answers:

5

I'm interested in learning about tools that operate like this:

Given a database schema and some data, suggest whether the schema is likely structured in any particular normal form, and then tell how the schema might be factored to produce further normalization.

Basically, a static analysis tool for database schema design.

Like other static analysis tools, such a database tool need not generate perfect results (I doubt such a "perfect" tool is computationally feasible), or applicable to all database systems, or free/open source, or anything else. The tool need not be stand-alone; it could be bundled as some sophisticated IDE feature. I'm just wondering what's out there.

A: 

It can't be done.

Normalization is defined in terms of functional dependencies which

  1. Cannot be expressed in SQL
  2. Cannot be inferred from the data

You can look at a table such as

A | B
--+--
1 | 1
1 | 2

and deduce that B is not dependent on A (because it has two distinct values for a single value of A), but you can never deduce that a dependency does exist, e.g. A may or may not depend on B.

finnw
Keep in mind that I'm not looking for any perfectly correct tool, in fact, I qualified my question to exclude that possibility. Maybe the tool could consider an encoding of such functional dependencies as well.
Greg Mattes
But if you have a table with 1 million rows where B is 2 when A is 1 for 99.9999% of the rows (maybe there's an error in there, and all B values should be 2), and B is never 2 for any other value of A, you might like a tool that points that out so that you can consider it.
Greg Mattes
+2  A: 

I think that's hard to implement.

Like, a table

Id | Name | Surname | SSN

is in 1NF, and

Id | Name | Surname | Mobile

is not, but you can say it neither from the design nor from data, only from the field name.

I've seen once a database for ultrasonography that actually had both GENDER and LMP (last menstrual period) in one table.

Quassnoi
Both your examples are in 1NF. Do you mean 2NF?
finnw
It's not just hard, it's probably computationally impossible. But you can image a scan that determines that any non-NULL LMP entry is highly likely to be related to a 'Female' GENDER entry (some data entry error could have checked the 'Male' box by accident). Think Heuristics here...
Greg Mattes
@finnw: one can have more than one mobile number. It's a classical example from Wikipedia :)
Quassnoi
@Quassnoi, That's a 2NF violation, because (Id, Mobile) is the only candidate key but Name and Surname depend on Id alone. It's still 1NF, until you do something like making Mobile a comma-separated string, but then it would probably be called Mobiles (plural)
finnw
+1  A: 

A tool like you describe, that tried to analyze your data and metadata and advise you of possible non-normalized structure, would give wrong suggestions so frequently that it would make Clippy, the Microsoft assistant seem like an indispensable writer's aid.

The process of normalization involves mapping software requirements to a logical data model. An analysis tool cannot know your data requirements any better than you do. So it can't infer from an incorrect database design which parts of it are wrong.

I understand you have qualified the question and you have limited expectations for the tool's capability. But you would want it to be useful for ordinary, everyday tasks -- but it would not be reliable even for the most simple cases.

Compare to static code analysis tools. Suppose you write an application and deliver it to your client, and the client says "why can't I send email from this app?" How would a static code analysis tool tell you that you omitted a desired feature? It can't know those requirements.

Likewise, how would a database normalization analysis tool know whether it's appropriate for the UserAccount table to have a single MobilePhoneNumber attribute, or if it would be more appropriate to separate phone numbers into another table so a single user could have multiple phones listed?

Bill Karwin
Greg Mattes
wrt phone numbers, a tool could detect tables with several phone number columns, either by detecting the string "phonenumber" is repeated in column names, or by detecting that the data is of the same format, length, etc.
Greg Mattes
Any such tools or techniques would have a high rate of "false positives." It would be more accurate, reliable, and cost-effective to have a human being do the design work.
Bill Karwin
+2  A: 

I have thought about this problem too. It is theoritically possible and there are some research papers on this topic. there used to be pretty cool tool at www.dbtools.cs.cornell.edu. This was developed by the famous author Raghu Ramakrishnan. He is currently at Yahoo Research. You can refer to the following papers for more information

  1. Diederich,T. and Miton,J.,(1988),"New Methods and Fast Algorithms for Database Normalization ",ACM Transactions on Database Systems, 13(3),339-365.
  2. Bernstein, P. A.(1986),"Synthesizing Third Normal Form Relations from Functional Dependencies",ACM Transactions on Database Systems , Vol.1. No. 4, pp 277-298.
  3. JMathNorm: A Database Normalization Tool Using Mathematica, Lecture Notes In Computer Science; Vol. 4488, Proceedings of the 7th international conference on Computational Science, Part II, Ali Yazici, Ziya Karakaya

The third link is very interesting. Here is the abstract of the paper:

This paper is about designing a complete interactive tool, named JMathNorm, for relational database (RDB) normalization using Mathematica. It is an extension of the prototype developed by the same authors [1] with the inclusion of Second Normal Form (2NF), and Boyce-Codd Normal Form (BCNF) in addition to the existing Third normal Form (3NF) module. The tool developed in this study is complete and can be used for real-time database design as well as an aid in teaching fundamental concepts of DB normalization to students with limited mathematical background. JMathNorm also supports interactive use of modules for experimenting the fundamental set operations such as closure, and full closure together with modules to obtain the minimal cover of the functional dependency set and testing an attribute for a candidate key. JMathNorm's GUI interface is written in Java and utilizes Mathematica's JLink facility to drive the Mathematica kernel.

I am also very interested in an answer to this question. If anyone has come across a tool, please let us know!

bkm
Great information - Thanks!
Greg Mattes
About 20 years ago I implemented the process described in the Bernstein paper as an undergraduate project. Input to the program is the set of Functional Dependancies (FD's) for your database. The output was a normalized schema. The problem with this approach is that those people who understand FD's have no trouble developing a normalized schema, and those that don't understand FD's cannot put together the inputs for the program anyway! Unfortunately the only copy I have left is on a 5.25 inch floppy disk - and no way to read it.
NealB
+3  A: 

It can be done, and there are at least 2 commecially available tools that can do normalization for you: GeneXus and DeKlarit. They use a process named NormalizationBySynthesis

Luxspes