views:

58

answers:

3

Good morning everyone,

Please read everything before answering.

I have sort of an opinion question but at the same time one that may have a right answer. I'm trying to develop a suite of products and want to make sure that since I'm doing it myself, that I do it right the first time. I've re-wrote the schema multiple times, each time thinking it was better. Then I could come across some new idea and it would either require a lot of work on the schema, or it would break my schema.

In college, I learned about "rationalizing" (I think that's the word they used, could be way off) a database and there are 5 levels. From what I remember, level 3 was most common. I know the practice was to make sure that data wasn't repeated and to do that, you had to break up tables into smaller tables. And depending on how far you broke it up, the higher the level was. Well, I don't know if I want the highest level, but I know I want it as efficient as I can get it. I've had 4 years of SQL Server 2000/2005/2008 and 2 years of Oracle, about 6 months with Informix (5+ years ago), a touch here or there with mySQL and about 6 months of Access. My preference is SQL Server, but I would like the schema to be as efficient on either platform.

Here's a psuedo schema layout of some of the tables, then I'll explain what I want to do.

Manufacturers
  ManufacturerID (Identity)
  ManufacturerName
  ManufacturerStreetAddress
  ManufacturerZipCodeID
  ...

ZipCodes
  ZipCodeID (Identity)
  ZipCode
  ZipCodeStateID
  ...

States
  StateID (Identity)
  StateName
  StateAbbreviation
  ...

Cities
  CityID (Identity)
  CityName
  CityStateID
  ...

I apologize for it only being a psuedo schema, but that's all I have right now as I'm doing the design on paper on break, but had a question before I got too far. What I want to do is make sure that everything ties to each other properly. My belief is that a Zip-Code belongs to a State and a City, but no City belongs to any ONE Zip-Code, it could have many. If I put the Zip-Code in the Manufacturers table, I want to be able to get the State and the City. But I don't want to use any IDs too many times in other tables. What I mean by that is having StateID in ZipCodes and Cities may be one too many times. A state can have multiple cities with the same name, and multiple states can have cities with the same names. But I'm not sure if I'd want a CityNames table and then a CityStates table (CityNameID and StateID). I am well aware that there are location databases out there for purchase, maybe some free, that I could use and wouldn't have to worry about this. However, I'd like to work on my understanding of this because I believe it would help me schema design wise in the future, but also because I would like to have the customability of the layout if anything needed to be changed.

Questions:

  1. Does that psuedo schema, as it is, seem correct or could it be better (opinion)?
  2. Is it called "rationalizing" the database, or something else (will vote up for right answer)? And how far is too far (opinion)
  3. There will also be a Users table, and other tables that will include addresses (Teams, Capitols, etc.), so would the psuedo schema, if it is correct in theory, be a good plan for a database like that (opinion)?

Thank you all for your time, I will vote up any answer that is thorough and coherent. Database experts or people with many years of database experience are preferred, but I will listen to all answers. Also, I'm not sure if this should be a community wiki, but I am not marking it as one right now. Thanks.

Update: Also, I forgot to mention that I know with "rationalizing" the database comes the need for joins and sometimes subqueries. I normally abuse LEFT OUTER JOINs, but what would be the most efficient way to tie these tables together to display an address, rather than performing 4 different queries? Thanks.

Update: Ok, now this might be too normalized or not normalized enough or at all, but could you guys tell me if you like this psuedo schema better?

Manufacturers
  ManufacturerID (Identity)
  ManufacturerName
  ManufacturerStreetAddress
  ManufacturerCCSZID --CCSZ (Country, City, State, Zip), needs a better name
  ...

ZipCodes
  ZipCodeID (Identity)
  ZipCode
  ...

States
  StateID (Identity)
  StateName
  StateAbbreviation
  ...

Cities
  CityID (Identity)
  CityName
  ...

Countries
  CountryID (Identity)
  CountryName
  CountryAbbreviation
  ...

CountryCityStateZipCodes
  CountryCityStateZipCodeID (Identity)
  CCSZCountryID
  CCSZStateID
  CCSZCityID
  CCSZZipCodeID

And to get an address, it would look like:

SELECT  M.ManufacturerStreetAddress,
        CN.CountryName,
        CN.CountryAbbreviation,
        S.StateName,
        S.StateAbbreviation,
        C.CityName,
        Z.ZipCode
FROM Manufacturers M
LEFT OUTER JOIN CountryCityStateZipCodes CCSZ ON CCSZ.CountryCityStateZipCodeID = M.ManufacturerCCSZID
LEFT OUTER JOIN Countries CN ON CN.CountryID = CCSZ.CCSZCountryID
LEFT OUTER JOIN States S ON S.StateID = CCSZ.CCSZStateID
LEFT OUTER JOIN Cities C ON C.CityID = CCSZ.CCSZCityID
LEFT OUTER JOIN ZipCodes Z ON Z.ZipCodeID = CCSZ.CCSZZipCodeID

Or maybe you guys know a better way to write that query. But regardless, does that look better than the first schema?

+1  A: 

I've always heard it called "normalization," but we're talking about the same thing.

The easiest thing may be to combine city, state, and zip into one table. You can even consider using the zip code itself as the key, although I can think of two reasons why you'd want to avoid that:

  1. Northeastern states have zip codes that begin with 0, which will be truncated if you make zip code a numerical field.
  2. If you use zip code as a key, you cannot have that zip in multiple times for multiple towns. Like you said, the post office cares more about the zip than the town name. But this setup would restrict you from searching on those individual towns later.

To search by city, state, or zip later on, just JOIN this table to the Manufacturers table. You're OK using an INNER JOIN - unless there are fields in the Manufacturers table where ManufacturerZipCodeID is blank, in which case you'll want a LEFT JOIN to get those to show as well.

JohnK813
Thank you for your answer. I knew it ended in "alization", lol. Yeah, I didn't even think about the 0's in the zip-codes, even though I could easily use the code or view to lead with 0's. Looking through the database manually would be confusing though. I will be manually putting everything in, so no Manufacturer should have a null ZipCode, but that's good knowledge to know (about the Joins). I'll have to contact the post office about the zips crossing state boundaries though, because I didn't think about that either. Thanks.
XstreamINsanity
+1  A: 

I don't have much of a problem with the way you have things setup. A state ID in zip code might be dangerous - it wouldn't surprise me to learn that there are zip codes which cross state boundaries, but I'm not sure about that.

You're going to do a lot of joins by storing state, city and zip code in separate tables, but having dealt with databases that stored the addresses without consistency measures, that's much more of a nightmare than a few joins. For example you end up with "NY" and "ny" and "Ny" and "New York" and "NewYork". So I think have the separate table for state, city and zips will pay off in the long run.

Wade Williams
Yeah, that's what I was thinking of. I was also thinking while I was waiting for answers, but have now decided against it, was to allow the users to enter in their City, State, Zip and such if it isn't in the database. That way it would build on it's own and I wouldn't have to manually do it all. I could give myself editing capabilities, but at the same time, it might take just as long as me putting it in myself. And I believe there may be a few zip codes that cross state borders. Thanks for the answer though.
XstreamINsanity
+1  A: 

I am not a database expert but, in my perspective, the given pseudo schema seems to be incorrect. Here's the explanation. Facts known from problems are :

  1. A state can have multiple cities.
  2. A state is unique
  3. A cities can have multiple Zip Codes
  4. City name may be equals to another city name.
  5. A Zip Code is unique

First, write down the uniques. So we construct these two raw tables :

STATE
---
State ID (PK)
State Name

ZIP
---
Zip ID (PK)
Zip Code (NK)

Then, a logical question arises. Knowing a Zip ID, how would we retrieve City ID? To answer it, we need to provide a link between Zip and City. Where should this link be put on? It's not in City table since from Fact#3 we know that a city can have many different Zip code. So it must be in ZIP table. This is our next version of ZIP table :

ZIP
---
Zip ID (PK)
Zip Code (NK)
City ID (FK)

Now, since we can "move" from Zip to City, we will discuss about City table. A City name can have same name with others. So we don't need to force it (City Name field) to be unique. So this is our first version of City table :

CITY
----
City ID (PK)
City Name

Again, same logical question arises. How do we move to State knowing a City? A link must be created somewhere between these two tables. Again, knowing the fact#4 cannot guarantee anything about uniqueness of city name. The link must be put on City table. So this is our next version of City table :

CITY
---
City ID (PK)
City Name
State ID (FK)

With this link, we can retrieve State correctly. Overall, we can move from Zip to City through City ID (provided in Zip table) and we can continue to move from City to State through State ID (provided in City table).

Rationalizing a database is good from Database perspective but can be considered "evil" in Programming perspective. Because it pushes programmer to write more and more classes. After all, "too far" can be defined as "the table becomes irrational". City Name table seems irrational since it's an attribute, not an entity. I will happily label "too far" if my Database Analyst create such an irrational table :) On the other hand, over-rationalizing database can greatly impact the database performance. From my experience, it will makes a query runs slower.

Concerning another problems like Users, Teams, Capitols, etc. I cannot say anything for now since i haven't seen the problem yet.

jancrot
+1 for thoroughness. There wouldn't be much difference between the other tables. They would just have addresses as well, but I'd want to tie their addresses to the afore mentioned tables just like with Manufacturers. However, I am trying to find an example of if a zip code is shared by two states. I know my wife has family in Toney, AL and Toney is right on the other side of the border in TN, so they share the city, I'm not sure about the Zip. If they do, then there would need to be two entries for that zip code, one for each city ID.
XstreamINsanity
A question I would have for everyone is if a query runs 15 milliseconds slower, or even as much as two seconds slower, is that too much of a cost? Or is that moreso a question for the user? And I guess you guys would need to know the size of the database as well.
XstreamINsanity
The general rule when designing databases is to normalize as much as you can. Then if you see areas that will be performance problems, de-normalize those areas for performance reasons. However, you shouldn't make guesses about performance and denormalize because you think something will be a problem. You should test your assumptions and de-normalize when testing proves performance would be enhanced by doing so.
Wade Williams
Thanks, that's what I was thinking as well. I really appreciate all the help guys.
XstreamINsanity