tags:

views:

98

answers:

2

I need help in deciding with search algorithm to use for searching large files. here is what I am doing. Lets say file consists of time range t1 to t2. (t2>t1)

I need to get file offsets (fseek) of:

  1. time t3 that is bigger than t1
  2. time t4 that is smaller than time t2

    | ------| ---|----------------|
    
    
    t1      t3   t4              t2
    

Naive version is to iterate lines through whole file and return fseek when current time is t3, start from returned seek and iterate while current time is t4, return second fseek

Now lets say file is 100GB and i need to iterate while file just to get period of 2 seconds. Then this logic becomes too CPU and filesystem expensive. Looking for better solutions. Language in use is C. Lines are currently fixed size but I'd like to look into future and deal with some algorithm that doesn't use fixed size lengths.

+3  A: 

You can use a binary search if the times in the file are all sorted. Even better if the records in your file are of a fixed width, but you probably can make use of it even if they are not, with some work.

Mike Daniels
All times are in timestamp, sorted, fixed width.Sometimes there are multiple rows with same timestamp.
damir
@user196188: Perfect. If your records are fixed width, then you can compute the exact starting point of any record. You'd start your binary search by computing the offset of the record in the middle of the file, and then seeing if the time you are searching for is earlier or later than that time. You then look up the record that is halfway between the first time and the start/end of the file, and so on, until you've located the right record.
Mike Daniels
If there can be multiple records with the same timestamp, then you will have to examine the previous record(s) when you have found a matching timestamp, to ensure you wind up with the first record matching a given timestamp and not an arbitrary one.
Mike Daniels
Once you've got this working, you should then look at modifying it to be an interpolated binary search. This is the same as a binary search, except that instead of seeking to the middle of the current search window, you seek to position `(t - t1) / (t2 - t1)`.
caf
A: 

Since the values are fixed width, something like a binary search or an interpolation search sound like the best options. Also, if you plan on working with files in those size classes (100GB), you should consider using fgetpos/fsetpos due to the file size limits of fseek.

tsiki
Ok and If I need to add more types of files, i.e. files with rows not fixed length, any idea? Im currently using fseek,ftell with #define _FILE_OFFSET_BITS 64in gcc
damir
fgetpos/fsetpos are not useful because they can only return to locations you've already visited, not seek to arbitrary offsets. Use fseeko/ftello (these are POSIX not plain C) and make sure your build environment is for 64bit file offsets, as it seems to be.
R..
ok and lets complicate things a bit more, I'm sending this whole mess with RPC to java to read those offsets and present lines to user. I don't think java is compatible with seeking in off_t
damir
Hmm true, looks like fseeko/ftello is a better choice in this case.As for variable length rows, maybe making an index of the time values (as suggested above) would do the trick? You could also try some pre-made database management like Tokyo Cabinet.
tsiki
ftello returns off_ t, that wouldn't work.
damir