tags:

views:

130

answers:

4

I'm looking to apply a KMP (or similar) search to a large file (> 4GB).

I'm expecting this to give me problems though.I can't copy it all to memory because there isn't enough space there.

My question is, what is the best way to go about doing this search? Should I simply create a FILE* and do the search directly in the file, should I copy blocks (say 4k) to memory and search those, or something else completely?

+1  A: 

Searching directly in the file would be very slow, using buffering will give much better performance. But note that your buffer has to be bigger than what you search (SearchLength), of course, and you have to refresh the buffer when being SearchLength bytes before its end.

schnaader
+1  A: 

Best approach is to read it in blocks and search that. You should make the block size a parameter, so you can experiment with what gives the best performance.

However, it is usually more efficient to try and index the file in some way so that you don't have to linearly search through the whole file. For example, KMP is a string searching algorithm - are you just looking for occuences of a word? Then you can just create a hash table (on disk) of the words and their location in the file and have very efficient search.

Larry Watanabe
Well I'm trying to do a search for all occurrences of a hex string in a user supplied file. Since the file will be different everytime and since I'm searching for hex values, hash tables seem like they wouldn't be worth the cost.
samoz
True, that's why I said "usually" :) Every search problem is somewhat different. I would advocate just paging, but again, always use parameters so you can tune the settings for your particular setup.
Larry Watanabe
+2  A: 

If you are using a platform that supports it, you can use mmap(). Pagination of the file is also a possibility, but remember to keep the buffer as large as possible to reduce the IO overhead, and to be careful between boundaries of two pages (suppose a string is matching, but is splitted by the page boundary)

Alternatively, I suggest you to build an index of some sort, and use the index to restrict the search. KMP search is not particularly efficient. This of course depends on the nature of your file, how it gets created, etc.

Stefano Borini
+1 for using mmap. It should just be noted that you will still need to mmap in blocks, on 32 bits machines, because the address space is not enough.
tsg
Yes, mmap (at least on OSX, but it's stardard BSD) accepts size_t len and off_t offset. The OP needs to check if these types hold 64 bit values, otherwise he will never be able to address past the 4 GiB limit.
Stefano Borini
+2  A: 

For the file access I would recommend to use memory mapped file to avoid the data copy. It is trivial on unix machines. You may have to split the file mapping into smaller blocks if it can't be allocated in one block. I can provide some code if you are interested.

For the search I would recommend using the Boyer More search algorithm.

chmike