tags:

views:

268

answers:

6

how to find a loop in the file system in Linux? i am indexing all files for making search fast(O(1))... i am using c programming language to implement by using the library functions in dir.h.... I can scan through entire file system but it goes in a loop if there is loop in the filesystem(example loop mount)... how to find the loop in file system.. i have seen updatedb command reporting when there is loop in file system... i don't understand the logic... can anyone pls help find solution for this?

A: 

This is more typically referred to as a "cycle." So you want to implement "cycle detection." There are many ways to do this; I don't know if this is for home work or not but one simple, not necessarily the most optimal, method is through pointer chasing.

BobbyShaftoe
+5  A: 

The general way to prevent re-scanning nodes in a graph is to mark nodes as you pass them, and then ignore nodes that are marked. This isn't terribly practical if you don't want to modify the graph you are scanning over, so you need a way to mark the nodes externally. The easiest way I can think of to do this under linux would be to store a the device/inode for each directory you visit. Then when you look at a directory, first check that you haven't already seen any directories with the same device/inode. This not only handles cycles, but also trees that merge back into each other.

To get the device/inode number take a look at the stat/fstat functions and the st_dev and st_ino members of the stat struct.

For storing the data, you're probably wanting to look at a hash-table or a binary tree.

Eclipse
+1  A: 
1800 INFORMATION
In Directed graphs, not in DAGs. It is quite obvious that there is no cycles in Directed Acyclic Graphs :)
Ben
Ah you found my deliberate mistake which I put in there to test you
1800 INFORMATION
+1  A: 

Perhaps I'm being a bit dim here but aren't the two ways you could create a cycle:

  • by creating a symbolic link
  • by mounting something twice

To deal with these, you can get a list of the mounts before you start indexing and ifnore all but the first of the same fs, and you can ignore links as you come across them in the indexing process.

anon
+1  A: 

Simple way. Just do a depth-first tree-walk of the directory tree, keeping a stack of the nodes above you as you go. At each node you visit, if that node is already in the stack, you have a cycle.

 // here's a stack of nodes
node stack[1000];

walk(node, level){
    if (node in stack[0..level-1]) then there is a cycle
    else
        stack[level] = node
        for each subnode x of node
            walk(x, level+1)
}
Mike Dunlavey
This is not efficient.. i ma looking for an efficient algorithm
suresh
@suresh: It's only inefficient when there are shared subtrees. If that is too much of a problem, then you need to mark the nodes as "visited".
Mike Dunlavey
+1  A: 

As others have said, there is no such thing as a loop in a file system if you realize that the path is part of a file name, unless its a cyclical symbolic link.

For instance, if you bootstrap some distro (lets say Debian) to a loop device, or even to a directory, and do this ON a Debian machine, you have now duplicated a lot of things.

For instance, lets say you are running Debian Lenny, and bootstrap a minimal copy of it to /lenny.

/lenny/usr/* is going to be the same as /usr/*. There is no 'cheap' way of avoiding this.

Since you're already calling a stat() on each node (I'm assuming you're using ftw() / ftw64() , you may as well:

  • Have ftw()'s callback insert the name of the node into an array, that has structure members that can store a hash of the file which is unlikely to collide. md5 isn't going to cut it for this.
  • Update a hash table based on that digest and the file name (not path).

That's in no danger of speeding up your scan, but it will significantly cut down on time it takes to search.

If you use threads correctly and set affinity, the hashing and indexing can happen on one core while the other is i/o bound (when multiple cores are available).

However, 'just' checking for duplicate mounts is not going to be a cure, plus I'm sure your program would want to return the locations of all files named 'foo', even if there are four identical copies to mention.

Tim Post