views:

214

answers:

3

I was wondering if there has been any research about how much of the data stored in a databases consists of string data. Also, how much of that string data is free-text data (i.e. completly unstructered) and how much of it consists of identifiers such as proper names. My intuitive feeling is that often the size of a record is mainly defined by some large varchar fields, for instance a simple table containing an event:

Column | Type        | Size
------------------------
ID     | Integer     | 4
Date   | Date        | 6
Event  | Varchar(50) | 50

Even though only one column is a string column (i.e 33% of the columns), this one column makes up for 80% of a record's size (the actual data stored my be smaller of course). From my experience a lot of tables have such form. It would be even more extreme, if Event was some free text field that can be up to 2000 characters for instance.

So does anybody have some hard facts about this in real-world databses? Or some reliable estimate on that?

+3  A: 

I don't know if it's exactly what you're looking for, but there are a few large datasets publicly available online that you could look at to analyze, and should be fairly representative.

For example, the entire wireless licensing data for the U.S. FCC is available here. That's probably fairly representative of a customer database, which tends a little heavier on the string side.

Another example is the Bureau of Labor Statistics Data. That's probably fairly representative of a statistical database, which tends a little heavier on the numerical side.

As far as how those stack up against every database in the world, you probably aren't going to be able to do any better than an educated guess.

Karl Bielefeldt
Yes, this goes in the right direction. However, I'm preferably looking for a study that has already done analysis on such data sets.
inflagranti
A: 

Its important to realise that since Event is a VARCHAR field rather than CHAR it only takes up as much space as the number of characters actually in the fields (the VAR here stands for Variable). There may be a byte stuck onto the end as a flag.

Infact the VARCHAR field is probably more efficient than the int and date fields since they will always take up 4 or 6 bytes whereas the VARCHAR field can take up as little as 2 bytes.

I dont know of any studies though.

Toby Allen
+1  A: 

It really depends on the database. 80% of a record being text sounds like a good estimate for an average.

It would help to know the background as to why you are doing this research. I'm not so sure that finding a good answer to your question will really help you solve a problem.

If you really need some sort of answer you could collect the data yourself. Set-up a form where people can upload data and ask developers to upload their schemas (perhaps as .sql dump).

andyuk