tags:

views:

40

answers:

2

I have been using git a lot recently and I quite like the concept of how GIT avoid duplicating similar data by using a hashing function based on sha1. I was wondering if current databases do something similar, or is this inefficient for some reason?

+2  A: 

There is no need for this. Databases already have a good way of avoiding duplicating data - database normalization.

For example imagine you have a column that can contain one of five different strings. Instead of storing one of these strings into each row you should move these string out into a separate table. Create a table with two columns, one with the strings values and the other as a primary key. You can now use a foreign key in your original table instead of storing the whole string.

Mark Byers
+1  A: 

I came up with a nice "reuse-based-on-hash" technique (it's probably widely used though)

I computed the hash-code of all fields in the row, and then I used this hash-code as primary key.

When I inserted I simply did "INSERT IGNORE" (to suppress errors about duplicate primary keys). Either way I could be sure that what I wanted to insert, was present in the database after insertion.

If this is a known concept I'd be glad to hear about it!

aioobe
What kind of hash did you use? What did you do when two different fields had the same hash by chance?
Mark Byers
Do you have anything about this on the internet. I would be interested to hear more.
Zubair
@Mark Byers, That is of course an issue in the general case, however I had a very limited set of possible value-combinations (vastly out numbered by the range of the hash function)@Zubair, Actually I don't. I'd like to find something about the technique though.
aioobe
@aioobe: Why did you choose this hash-based solution instead of an auto-increment PK and a unique index on the remaining fields?
Mark Byers
I don't remember the exact context, but I suppose it at least saves me a select since I can blindly insert a row and be sure that the hash-value exists as primary key. (Perhaps it's not as smart as I thought it was ;)
aioobe
Allow me to rephrase; what does GIT do in case of hash-collitions?
aioobe
The more I think about it, the more my suggestion fits as an answer to the question. It avoids duplication by relying on hash-codes.
aioobe
Yes, interesting. Do you know if others have tried this technique too?
Zubair
No I don't. I've googled without success. If you find someone, please let me know :)
aioobe