tags:

views:

317

answers:

3

I have a big binary file (1 MB < size < 50 MB). I need to search for a string and extract the subsequent four bytes (which is the {size,offset} of actual data in another file). What is the most efficient way to do it so that the search would be fastest?

EDIT: The strings in the index files are in sorted order.

+2  A: 

Store the {string, size, offset} tuples in sorted order (by string) and use a binary search for the string.

You might also store, at the start of the file, offsets for each first letter of strings. For example if strings starting with 'a' began at position 120 and those starting with 'b' began at position 2000 in the file you could start the file with something like 120, 2000, ...

AgileJon
+2  A: 

Look up the Boyer–Moore string search algorithm.

smcameron
A: 

First, use memory mapping on the file. It will be a lot more efficient than reading it into RAM because instead of two copies (one in your program and one in file cache) there is only one copy.

If each string is a fixed length then a binary search is very easy because you can treat the memory as an array of character arrays.

If each string is variable length but 0 terminated, then you can use a variant of binary search where you jump to the middle of the string list, search for the next 0, then test the next string after that. Then jump forward or back to 1/4 or 3/4 of the string list and repeat.

If each string is variable length in Pascal style, with a byte count at the beginning it is trickier. A linear search from the beginning is not too slow, for infrequent searches. If you're looking for exact string matches, do not forget that you can skip most strings just by checking that the lengths do not match.

If you have to search the list often then building an array of char pointers to the string list would again make binary search really easy. If this file is really an index file for fast searches then it probably already has this in it somewhere, unless the designer intended to build a char pointer array while loading the file.

Zan Lynx
How to memorymap in C#?
iJeeves