views:

278

answers:

3

I have a couple of identical files stored in more than one place on my hard disk. I figure I can save a lot of disk space by hard-linking them to point to the same file. I am a little worried about possibly disastrous side effects.

I guess it does not affect permissions, as those are stored in the respective directories, just like the file name, right? (Update: Apparently, I guessed wrong, permissions are shared, as Carl demonstrates in his answer)

The biggest concern is changes to one file inadvertently also changing the other files. Read-only files should be safe then. And files that can be changed are also okay, if rather than updating within the existing file, a new file gets written. I believe most applications work that way, but probably not all.

Is there anything else to consider?

I am on OS X / HFS+.

+1  A: 

Don't use hard links if you want changes to one file not to be reflected in other files. That's the whole point of hard links - multiple directory entries for the same file (same blocks on disk). Changing permissions on one of the names of a hard link changes them on both:

$ touch file
$ ln file link
$ ls -l
total 0
-rw-r--r--  2 owner group  0 Nov 11 16:44 file
-rw-r--r--  2 owner group  0 Nov 11 16:44 link
$ chmod 444 file
$ ls -l
total 0
-r--r--r--  2 owner group  0 Nov 11 16:44 file
-r--r--r--  2 owner group  0 Nov 11 16:44 link

From the ln man page:

A hard link to a file is indistinguishable from the original directory entry; any changes to a file are effectively independent of the name used to reference the file.

Carl Norum
That was the main part of my question: Do applications really update files? Or just rewrite them? And are they generally hard-link aware, and will rewrite rather than update them, if there is more than one link to them?
Thilo
That's completely application dependent, but I expect almost completely opposite to what you think is happening. I would guess 99% or more of all applications modify existing files rather than deleting them and creating new ones.
Carl Norum
Maybe 99% is too much. But I certainly wouldn't count on a deletion/recreation in the general case.
Carl Norum
@Carl: Yes, I am mostly thinking about read-only files. Primarily, I want to dedupe Time Machine backups. The fact that the permissions are shared (as you have shown above) worries me a little. Why is that, by the way? Where are the permissions stored? I thought in the directory.
Thilo
They're stored in the inode. This link has more: http://docstore.mik.ua/orelly/networking/puis/ch05_01.htm
Carl Norum
Doesn't Time Machine already use hard links where possible to avoid duplicating file blocks?
mipadi
Is that a legitimate online version of O'Reilly books?
Thilo
@mipadi: TM uses hard links only if the same file (path) has not changed from the previous version. It does not work if you just happen to store the same content in two different locations. It also does not work across machines (if you back up two machines to the backup disk).
Thilo
@Thilo, I have no idea. I just googled it.
Carl Norum
A: 

Hardlinks are not generally a best practice. plain old soft/symbolic links (ln -s) should serve just as well.

bmargulies
I figure that for files that can change, soft-links are even worse, because then the change is reflected in all copies. Also, if the target of a soft link gets deleted, the data is lost (does not happen with a hard link)
Thilo
A: 

I wrote a little script to do just this. I'd only be concerned about permissions if your backup was spanning multiple users or system files.

I had a bunch of old backups on CD's and DVD's, many of which had a lot of redundant data on them. Rather than sift through all that info and delete the duplicates, I took the Time Machine route and made hard links between all the matching files (truly matching content, I took a SHA1 checksum of them all).

Now all my backup volumes look just like they would otherwise and most of the redundant files are history. The one hiccup is that a lot of media files store metadata in the file contents so each version is slightly different. See this article for the python code. No Warranties!!!

Make sure you do mdimport your_backup_dir/ afterwards: Spotlight and Finder get a bit flustered when you do massive data manipulations. I've de-duplicated my 240 GB backup folder in this manner and it took about 45 minutes.

Also note, most OSX apps will break your hard links and save in a new inode, most UNIX'y apps probably will preserve the hard links (except emacs, i hear).

andyvanee