views:

150

answers:

5

Many file storage systems use hashes to avoid duplication of the same file content data (among other reasons), e.g., Git and Dropbox both use SHA256. The file names and dates can be different, but as long as the content gets the same hash generated, it never gets stored more than once.

It seems this would be a sensible thing to do in a OS file system in order to save space. Are there any file systems for Windows or *nix that do this, or is there a good reason why none of them do?

This would, for the most part, eliminate the need for duplicate file finder utilities, because at that point the only space you would be saving would be for the file entry in the file system, which for most users is not enough to matter.

Edit: Arguably this could go on serverfault, but I feel developers are more likely to understand the issues and trade-offs involved.

+2  A: 

It would save space, but the time cost is prohibitive. The products you mention are already io bound, so the computational cost of hashing is not a bottleneck. If you hashed at the filesystem level, all io operations which are already slow will get worse.

Matt
...my point about performance.
jldupont
But you wouldn't need to hash all files, only ones that had the exact same size as another file...
RedFilter
But how would you know that there is another file of the same size?Would you store that as an index in the filesystem table? Then adding/updating files becomes expensive to support searching for same-size files. Technically, you certainly could hash files and try to detect duplicates, but since io is already the rate limiter for so many operations, I'm not sure that you could do anything that would be performant enough and still be 100 percent accurate.
Matt
Interesting - just read Sun's blog post (thanks FR) - it seems thatthe claim is that performance will end up as a tradeoff because of saved disk writes, which I hadn't thought of. The storage of bashes is still an issue, but the assumption is that the hashtable fits into memory, which is probably true, assuming that a machine with lots of storage will also have lots of memory.
Matt
Sorry for the typos - typing on my phone you know :)
Matt
+6  A: 

ZFS supports deduplication since last month: http://blogs.sun.com/bonwick/en_US/entry/zfs_dedup

Though I wouldn't call this a "common" filesystem (afaik, it is currently only supported by *BSD), it is definitely one worth looking at.

FRotthowe
It's also supported by Solaris...
prestomation
I plan to build a fileserver and Solaris is my choice exactly because of ZFS.
liori
Thanks for the info!
RedFilter
A: 

It would require a fair amount of work to make this work in a file system. First of all, a user might be creating a copy of a file, planning to edit one copy, while the other remains intact -- so when you eliminate the duplication, the hard link you created that way would have to give COW semantics.

Second, the permissions on a file are often based on the directory into which that file's name is placed. You'd have to ensure that when you create your hidden hard link, that the permissions were correctly applied based on the link, not just the location of the actual content.

Third, users are likely to be upset if they make (say) three copies of a file on physically separate media to ensure against data loss from hardware failure, then find out that there was really only one copy of the file, so when that hardware failed, all three copies disappeared.

This strikes me as a bit like a second-system effect -- a solution to a problem long after the problem ceased to exist (or at least matter). With hard drives current running less than $100US/terabyte, I find it hard to believe that this would save most people a whole dollar worth of hard drive space. At that point, it's hard to imagine most people caring much.

Jerry Coffin
Interesting, I was not aware of COW. The second point does not seem a concern, as you would ignore the location of the content, all permissions would be based on the link. Re the third point, storing a single copy only makes sense on the same physical drive, as soon as there is a new disk (as far as the OS can tell anyway), a duplicate copy of the content would be needed.
RedFilter
Re your last point, it's all a matter of file size, which is growing constantly. It would be nice to have the option, esp. as the content of what a filesystem is may soon span the cloud, so the trade-off in computation over xfer time would be worth it (a la Dropbox).
RedFilter
+1  A: 

NTFS has single instance storage.

blowdart
Thanks, from here http://blogs.techrepublic.com.com/datacenter/?p=266, found "Single Instance Storage will also be included in Windows Server 2008, but only in the Storage edition. The feature will not be made available in other editions." It is also implemented in Exchange.
RedFilter
In the next version of Exchange it's gone. However it's implemented in Windows Home Server as well, which is Win2003 underneath.
blowdart
+2  A: 

NetApp has supported deduplication (that's what its called in the storage industry) in the WAFL filesystem (yeah, not your common filesystem) for a few years now. This is one of the most important features found in the enterprise filesystems today (and NetApp stands out because they support this on their primary storage also as compared to other similar products which support it only on their backup or secondary storage; they are too slow for primary storage).

The amount of data which is duplicate in a large enterprise with thousands of users is staggering. A lot of those users store the same documents, source code, etc. across their home directories. Reports of 50-70% data deduplicated have been seen often, saving lots of space and tons of money for large enterprises.

All of this means that if you create any common filesystem on a LUN exported by a NetApp filer, then you get deduplication for free, no matter what the filesystem created in that LUN. Cheers. Find out how it works here and here.

Sudhanshu