views:

190

answers:

2

I think of implementing a file search program using indexing in linux... I know that there are several other file search programs like beagled. but I am doing this for study purpose... I am struck with how to do indexing.. I have the following idea that I took from maemo-mapper application.. for example if u have file named "suresh" its index in the file system as files...

/home/$USERNAME/.file_search_index/s/u/r/e/s/h/list.txt.. This list.txt contains the location of all files with name = "suresh"... Pls suggest a better idea/algorithm to implement it... And If there is any material on various file search technique pls post it....

+2  A: 

You haven't seen the locate command that comes with findutils? Like beagled, it's free software, so you can study the code.

The findutils package is always looking for contributors.

Information on the database format is at http://www.gnu.org/software/findutils/manual/html_node/find_html/Database-Formats.html

ashawley
There's also http://slocate.trakker.ca/ and http://carolina.mff.cuni.cz/~trmac/blog/mlocate/ (though the GNU Findutils locate may be the most widely-installed)
ephemient
Hi do u know how intrenals of locate or do u have any doc/link on it? Why I am askin this is It saves time by avoiding code view of locate and updatedb
suresh
Yes, there is documentation. Added.
ashawley
+1  A: 

Beagle uses a very interesting approach with inotify. It starts, establishes a watch on the parent directory and starts another thread that does a recursive scan. As more directories are accessed, the parent sees them and adds more watches, while watching what it already knows about.

So, when its started, you're watching an entire tree quite cheaply (one watch per directory) and have the whole thing indexed. This also helps to ensure that no files are 'missed' during the scan.

So, that's most of your battle.. typically FS search programs hit their sluggish point when indexing, for instance 'updatedb'.

As for storing the index, I would not favor splitting it up in directories. You'd be in essence calling stat() on each character in a file name array. some-very-long-shared-object-name.so.0 for instance would be one call to stat() for every character in the name. You might try using a well designed SQLite3 database.

I'm working on something very similar, a program to provide slightly cheaper auditing means for PCI certification (credit card processor), without using kernel auditing hooks.

Tim Post
Why not use splitting of directories... I can find all the files in index by lookin single file .... so it is retrieval is o(n)... Time taken by OS filesystem to retrieve one file... I have to inotify to see any changes are made to the files in file system....
suresh
Its the eventual (and inevitable) calls to stat() that make me want to avoid that approach. Each lookup would be as expensive as the file name is long.
Tim Post
I won't to do stat on every folder.. I will form a string "/home/$USERNAME/.file_search_index/s/u/r/e/s/h/list.txt" and do stat on only this list.txt....and open that file for listing the results of search...
suresh
this only for file name search and not for keywords in the file
suresh