tags:

views:

332

answers:

4

I opened a file stream to a very big file using fopen. Before performing any read operation on that stream, I deleted the file using unlink(). And still, I was able to read the whole file.

I am guessing that there is a buffer associated with the stream, which holds the data of the file. But obviously that buffer will have a limit. That was the reason why I chose a_big_file whose size was 551126688 bytes or 526MB.

I want to know what is the exact reason behind it. Here is the test code that I used.

#include <stdio.h>
#include <unistd.h>

int main(){

    FILE *fp;
    long long int file_size = 0;
    int bytes_read = 0;
    char buf[1];

    fp = fopen("a_big_file", "r");

    unlink("a_big_file");

    while(0 != (bytes_read = fread(buf, 1, 1, fp))){
     file_size += bytes_read;
    }

    printf("file_size is %llu\n", file_size);

    return 0;
}

Output: file_size is 551126688

+17  A: 

In Unix and Unix-like operating systems, the file doesn't actually go away until the last open file handle on it is closed. This is a very useful trick for temporary files - if you unlink it as soon as you open it, the file won't be visible to other processes, and it will be removed from the system as soon as your program closes it, ends or crashes. That helps prevent the proliferation of orphan temp files.

Practically (glossing over some technical details here) what happens is that Unix file systems are reference counted. When you open the file, you actually get connected to the file's inode (which is the real indication of where the actual content of the file lives). But unlinking the file just removes the directory entry, so the file doesn't have a name any more. The file system will only reclaim the file space (ie the inode) if it isn't in any directory entries, AND nobody has it open. The other processes can't open it in the ordinary manner because they can't map a file name to the inode.

Note that Unix file systems allow multiple directory entries to point to the same inode - we call that a "hard link". If you do a "ls -l", one of the fields is the count of hard links to that same inode, and if you do an "ls -li", you can see the actual inode address.

Paul Tomblin
Ok that's a quick answer :) , but I am interested in knowing what exactly happens inside the system. I am interested in knowing why can't other processes see the file. I mean what's going on internally, can you shed some light :)
Aman Jain
Because there is no longer a filename pointing to the inode. So if you already have a filehandle you can still use it (in any process), but you can't get a new one anymore.
puetzk
Now as per my understanding, removing an entry from filesystem means removing an inode. Does this mean that the location pointed to by this inode contains inconsistent data as it may be written by any new file(created after unlink)?
Aman Jain
Moreover, acc. to this logic file_size will be shown correctly, but the data read may/may not be correct, depending upon whether any new file's inode points to those locations.
Aman Jain
@Aman, as I stated above, the inode isn't recycled by the OS until there are no open file handles AND there are no directory entries. Nothing will be able to write to that file space unless it had it open before you unlinked it.
Paul Tomblin
There is absolutely no way that the file data would be corrupt, unless there was hard link making another directory entry on the same inode, or another process had the file open before you unlinked it.
Paul Tomblin
Please see my linux-specific answer for accessing a file's data even if there are no longer any file names associated with the inode, provided at least one process still has the file open: http://stackoverflow.com/questions/507109/is-fread-possible-after-a-file-is-removed/507301#507301
Chris Young
Can you refer me to a good book that can clear my concepts regarding inodes, filesystems etc.
Aman Jain
http://en.wikipedia.org/wiki/Inode is a good overview.
Paul Tomblin
+9  A: 

From the man page for unlink:

unlink() deletes a name from the filesystem. If that name was the last link to a file and no processes have the file open the file is deleted and the space it was using is made available for reuse.

If the name was the last link to a file but any processes still have the file open the file will remain in existence until the last file descriptor referring to it is closed.

The bold bit explains the behaviour. :-)

[edit] BTW you should really close the file with fclose() before the return statement... [/edit]

Tooony
What would happen if I don't do a fclose(fp) before return.When the process ends, fp has no significance as the address space of process gets freed up. Correct me if I am wrong.
Aman Jain
1) if you were writing, instead of reading, the last bits of buffered output could be lost2) If you were a thread instead of a process, the file handle would remain open3) it's bad style because it makes the reader stop and think through whether (1) or (2) apply, or whether it's harmless.
puetzk
The OS doesn't care about your silly buffered file IO anyways -- all that matters to it is that you had an open file descriptor, and then you died, so it can close the fd (and unlink the file).
ephemient
I agree with puetzk and on top of that the behaviour is not defined what will happen with the open file handle when the process is ended and resources returned to the OS. It is much up to the OS what to do.
Tooony
Contd... Good practice is to show that you are in control of the resources inside your code. An fclose() do, if nothing else, show where you do not want to use the open file any longer. Also it avoids unwanted behaviour like shown in the original question. :-D
Tooony
A: 

In Linux, file is really removed only when it's last open handle is closed.

People usually use temporary files this way: mkstemp(3) followed by immediate unlink(2). This way only you can access the file's data, no other process can.

Even if another process creates other file with the same name, they new file will have nothing in common with the original file.

Quassnoi
+3  A: 

On some systems, such as linux, you can easily still access files that have no name on the filesystem as long as a process still has it open. There's a list of file descriptors in

/proc/<pid>/fd

Edit: As per Paul Tomblin's comment, you can only access this directory if you are the same user as the process or root.

For example:

# Create a file with cat
chris@shrubbery:~$ cat > MYFILE
Hello

# Suspend the process and find its pid
[1]+  Stopped                 cat > MYFILE
chris@shrubbery:~$ ps waux | grep cat
chris     1311  0.0  0.0   5088   668 pts/6    T    14:29   0:00 cat
chris     1313  0.0  0.0   5168   840 pts/6    R+   14:29   0:00 grep cat

# Inspect the list of open files
chris@shrubbery:~$ cd /proc/1311/fd
chris@shrubbery:/proc/1311/fd$ ls -l
total 0
lrwx------ 1 chris chris 64 2009-02-03 14:29 0 -> /dev/pts/6
l-wx------ 1 chris chris 64 2009-02-03 14:29 1 -> /home/chris/MYFILE
lrwx------ 1 chris chris 64 2009-02-03 14:29 2 -> /dev/pts/6

# View MYFILE from the symlink on the /proc pseudofilesystem.
chris@shrubbery:/proc/1311/fd$ cat 1
Hello

# Delete the filename /home/chris/MYFILE
chris@shrubbery:/proc/1311/fd$ rm /home/chris/MYFILE
chris@shrubbery:/proc/1311/fd$ cat /home/chris/MYFILE
cat: /home/chris/MYFILE: No such file or directory

# But the process still has it open. 
# The /proc system knows the original name was deleted
chris@shrubbery:/proc/1311/fd$ ls -l
total 0
lrwx------ 1 chris chris 64 2009-02-03 14:29 0 -> /dev/pts/6
l-wx------ 1 chris chris 64 2009-02-03 14:29 1 -> /home/chris/MYFILE (deleted)
lrwx------ 1 chris chris 64 2009-02-03 14:29 2 -> /dev/pts/6

# We can still view the file, useful for debugging.
chris@shrubbery:/proc/1311/fd$ cat 1
Hello
Chris Young
What permissions do you need to access this data? I would think you'd need to be either the same user as started the process or root?
Paul Tomblin
Yes, that is correct, I will add that to my answer, thanks.
Chris Young