views:

212

answers:

7

Hello all-knowing co-stackers.

Our masters thesis project is creating a database schema analyzer. As a foundation to this, we are working on quantifying bad database design.

Our supervisor has tasked us with analyzing a real world schema, of our choosing, such that we can identify some/several design issues. These issues are to be used as a starting point in the schema analyzer.

Finding a good schema is a bit difficult because we do not want a schema which is well designed in all aspects, but a schema that is more "rare to medium".

We have already scheduled the following schemas for analysis: wikimedia, moodle and drupal. Not sure in which category each fit. It is not necessary that the schema is open source.

The database engine used is not important, though we would like to focus on SQL server, Posgresql and Oracle.

Thanks to all in advance. :)

EDIT: Great responses already! I will surely work my way through them. For now literature will be deferred, as this task is supposed to give us real world examples which can be used in the thesis. I.e. "Design X is perceived by us as bad design, which our analyzer identifies and suggests improvements to", instead of coming up with contrived examples.

EDIT2: Thank you all for your responses. I have accepted the post which we, at the end, found most beneficial. I will update this post when we have some kind of a tool ready :)

+1  A: 

vBulletin has a really bad database schema.

Yorirou
It may be me being a lame duck, but I can't find the vBulletin schema. Do you know if it is possible to get without buying a license?
Benjamin
Can't you download a trial version? I don't know how to get the schema, I had to extract data from an existing system for a customer.It might be also a good idea to send a mail to Jelsoft. Sometimes it works :)
Yorirou
We found some online graphics which we used instead. :) Btw. imagine how the email would have been worded: "Dear vBulletin, we heard you have a really bad DB design, plz share."
Benjamin
+4  A: 

Hi,

I'm working on a project including a geographical information system. And in my opinion these designs are often "medium" to "rare".

Here are some examples:

1) Geonames.org

You can find the data and the schema here: http://download.geonames.org/export/dump/ (scroll down to the bottom of the page for the schema, it's in plain text on the site !)

It'd be interesting how this DB design performs with such a HUGE amount of data!

2) OpenGeoDB

This one is very popular in german-speaking countries (Germany, Austria, Switzerland) because it's a database containing nearly every city/town/village in the german speaking region with zip-code, name, hierarchy and coordinates.

This one comes with a .sql schema and the table fields are in english, so this shouldn't be a problem.

http://fa-technik.adfc.de/code/opengeodb/

The interesting thing in both examples is how they managed the hierarchy of entities like Country -> State -> County -> City -> Village etc.

PS: Maybe you could judge my DB design too ;) http://stackoverflow.com/questions/3686516/db-schema-of-a-role-based-access-control

sled
+5  A: 

Check the Dell-dvd-store, you can use it for free.

The Dell DVD Store is an open source simulation of an online ecommerce site with implementations in Microsoft SQL Server, Oracle and MySQL along with driver programs and web applications

Bill Karwin has written a great book about bad designs: SQL antipatterns

Frank Heikens
thx for the book hint! :)
sled
+3  A: 

"we are working on quantifying bad database design."

It seems to me like you are developing a model, or process, or apparatus, that takes a relational schema as input and scores it for quality.

I invite you to ponder the following:

Can a physical schema be "bad" while the logical schema is nonetheless "extremely good" ? Do you intend to distinguish properly between "logical schema" and "physical schema" ? How do you dream to achieve that ?

How do you decide that a certain aspect of physical design is "bad" ? Take for example the absence of some index. If the relvar that that "supposedly desirable index" is to be on, is itself constrained to be a singleton, then what detrimental effects would the absence of that index cause for the system ? If there are no such detrimental effects, then what grounds are there for qualifying the absence of such an index as "bad" ?

How do you decide that a certain aspect of logical design is "bad" ? Choices in logical design are done as a consequence of what the actual requirements are. How can you make any judgment whatsoever about a logical design, without a formalized and machine-readable way to specify what the actual requirements are ?

Erwin Smout
+3  A: 

Wow - you have an ambitious project ahead of you. To determine what is a good database design may be impossible, except for broadly understood principles and guidelines.

Here are a few ideas that come to mind:

I work for a company that does database management for several large retail companies. We have custom databases designed for each of these companies, according to how they intend for us to use the data (for direct mail, email campaigns, etc.), and what kind of analysis and selection parameters they like to use. For example, a company that sells musical equipment in stores and online will want to distinguish between walk-in and online customers, categorize the customers according to the type of items they buy (drums, guitars, microphones, keyboards, recording equipment, amplifiers, etc.), and keep track of how much they spent, and what they bought, over the past 6 months or the past year. They use this information to decide who will receive catalogs in the mail. These mailings are very expensive; maybe one or two dollars per customer, so the company wants to mail the catalogs only to those most likely to buy something. They may have 15 million customers in their database, but only 3 million buy drums, and only 750,000 have purchased anything in the past year.

If you were to analyze the database we created, you would find many "work" tables, that are used for specific selection purposes, and that may not actually be properly designed, according to database design principles. While the "main" tables are efficiently designed and have proper relationships and indexes, these "work" tables would make it appear that the entire database is poorly designed, when in reality, the work tables may just be used a few times, or even just once, and we haven't gone in yet to clear them out or drop them. The work tables far outnumber the main tables in this particular database.

One also has to take into account the volume of the data being managed. A customer base of 10 million may have transaction data numbering 10 to 20 million transactions per week. Or per day. Sometimes, for manageability, this data has to be partitioned into tables by date range, and then a view would be used to select data from the proper sub-table. This is efficient for this huge volume, but it may appear repetitive to an automated analyzer.

Your analyzer would need to be user configurable before the analysis began. Some items must be skipped, while others may be absolutely critical.

Also, how does one analyze stored procedures and user-defined functions, etc? I have seen some really ugly code that works quite efficiently. And, some of the ugliest, most inefficient code was written for one-time use only.

OK, I am out of ideas for the moment. Good luck with your project.

James Carr
is this some kind of OLAP Cube software? :)
sled
A: 

If you can get ahold of it, the project management system Clarity has a horrible database design. I don't know if they have a trial version you can download.

HLGEM
A: 

Favorite bad database designs that I've seen in the wild:

  • generic design with one table with a varchar for what kind of entity is was, another varchar for a generated key field, and a blob field holding xml. Querying it was horrible. Perpetrated by developers with a religious aversion to relational databases.

  • a table in a reporting database that stored ages instead of dates, so that a stored procedure had to be set up on a schedule to run to update the ages in order for anything to work. Terribly fragile for no good reason.

Nathan Hughes