Could a hashing algorithm be used to save space in a database?

tags:

database

views:

answers:

+2 Q:

Could a hashing algorithm be used to save space in a database?

I have been using git a lot recently and I quite like the concept of how GIT avoid duplicating similar data by using a hashing function based on sha1. I was wondering if current databases do something similar, or is this inefficient for some reason?

+2 A:

There is no need for this. Databases already have a good way of avoiding duplicating data - database normalization.

For example imagine you have a column that can contain one of five different strings. Instead of storing one of these strings into each row you should move these string out into a separate table. Create a table with two columns, one with the strings values and the other as a primary key. You can now use a foreign key in your original table instead of storing the whole string.

Mark Byers 2010-05-01 08:47:28

+1 A:

I came up with a nice "reuse-based-on-hash" technique (it's probably widely used though)

I computed the hash-code of all fields in the row, and then I used this hash-code as primary key.

When I inserted I simply did "INSERT IGNORE" (to suppress errors about duplicate primary keys). Either way I could be sure that what I wanted to insert, was present in the database after insertion.

If this is a known concept I'd be glad to hear about it!

aioobe 2010-05-01 09:01:19

What kind of hash did you use? What did you do when two different fields had the same hash by chance?

Mark Byers 2010-05-01 09:02:42

Do you have anything about this on the internet. I would be interested to hear more.

Zubair 2010-05-01 09:04:08

@Mark Byers, That is of course an issue in the general case, however I had a very limited set of possible value-combinations (vastly out numbered by the range of the hash function)@Zubair, Actually I don't. I'd like to find something about the technique though.

aioobe 2010-05-01 09:11:48

@aioobe: Why did you choose this hash-based solution instead of an auto-increment PK and a unique index on the remaining fields?

Mark Byers 2010-05-01 09:18:15

I don't remember the exact context, but I suppose it at least saves me a select since I can blindly insert a row and be sure that the hash-value exists as primary key. (Perhaps it's not as smart as I thought it was ;)

aioobe 2010-05-01 09:31:16

Allow me to rephrase; what does GIT do in case of hash-collitions?

aioobe 2010-05-01 09:49:36

The more I think about it, the more my suggestion fits as an answer to the question. It avoids duplication by relying on hash-codes.

aioobe 2010-05-01 09:59:48

Yes, interesting. Do you know if others have tried this technique too?

Zubair 2010-05-01 13:28:58

No I don't. I've googled without success. If you find someone, please let me know :)

aioobe 2010-05-01 14:33:36

ansaurus

tags:

views:

answers:

Could a hashing algorithm be used to save space in a database?

related questions