views:

66

answers:

3

I'm importing data to a future database that will have one, static MyISAM table (will only be read from). I chose MyISAM because as far as I understand it's faster for my requirements (I'm not very experienced with MySQL / SQL at all).

That table will have various columns such as ID, Name, Gender, Phone, Status... and Country, City, Street columns. Now the question is, should I create tables (e.g Country: Country_ID, Country_Name) for the last 3 columns and refer to them in the main table by ID (normalize...[?]), or just store them as VARCHAR in the main table (having duplicates, obviously)?

My primary concern is speed - since the table won't be written into, data integrity is not a priority. The only actions will be selecting a specific row or searching for rows that much a certain criteria.

Would searching by the Country, City and/or Street columns (and possibly other columns in the same search) be faster if I simply use VARCHAR?

EDIT: The table has about 30 columns and about 10m rows.

+4  A: 

It can be faster to search if you normalize as the database will only have to compare an integer instead of a string. The table data will also be smaller which makes it faster to search as more can be loaded into memory at once.

If your tables are indexed correctly then it will be very fast either way - you probably won't notice a significant difference.

You might also want to look at a full text search if you find yourself writing LIKE '%foo%' as the latter won't be able to use an index and will result in a full table scan.

Mark Byers
+1 @hello MAKE SURE YOU INDEX YOUR TABLES PROPERLY!!!!
Byron Whitlock
I don't necessarily agree that it *will* be faster to search if the tables are normalized, but overall this is good advice.
David M
+1  A: 

I'll try to give you something more than the usual "It Depends" answer.

#1 - Everything is fast for small N - if you have less than 100,000 rows, just load it flat, index it as you need to and move on to something higher priority.

Keeping everything flat in one table is faster for reading everything (all columns), but to seek or search into it you usually need indexes, if your data is very large with redundant City and Country information, it might be better to have surrogate foreign keys into separate tables, but you can't really say hard and fast.

This is why some kind of data modeling principles are almost always used - either traditional normalized (e.g. Entity-Relationship) or dimensional (e.g. Kimball) is usually used - the rules or methodologies in both cases are designed to help you model the data without having to anticipate every use case. Obviously, knowing all the usage patterns will bias your data model towards supporting them - so a lot of aggregations and analysis is a strong indicator to use a denormalized dimensional model.

So it really depends a lot on your data profile (row width and row count) and usage patterns.

Cade Roux
I did forget to mention the "size" of the table.About 30 columns wide (column types vary, mostly VARCHAR) and about 10m rows. So I'm guessing normalizing would be wiser.
hello
@hello Normalizing is generally good, but a dimensional approach can have some real benefits - particularly, I'm thinking of the "junk dimension" technique which can make searching on combinations of codes/types/demographics very quick.
Cade Roux
A: 

I don't have much more than the usual "It Depends" answer, unfortunately.

Go with as much normalization as you need for the searches you actually do. If you never actually search for people who live on Elm Street in Sacramento or on Maple Avenue in Denver, any effort to normalize those columns is pretty much wasted. Ordinarily you would normalize something like that to avoid update errors, but you've stated that data integrity is not a risk.

Watch your slow query log like a hawk! That will tell you what you need to normalize. Do EXPLAIN on those queries and determine whether you can add an index to improve it or whether you need to normalize.

I've worked with some data models that we would called "hyper-normalized." They were in all the proper normal forms, but often for things that just didn't need it for how we used the data. Those kinds of data models are difficult to understand with a casual glance, and they can be very annoying.

David M