ansaurus

Question

Answer 1

+3 A:

You can use a binary search if the times in the file are all sorted. Even better if the records in your file are of a fixed width, but you probably can make use of it even if they are not, with some work.

Mike Daniels 2010-07-08 17:46:29

All times are in timestamp, sorted, fixed width.Sometimes there are multiple rows with same timestamp.

damir 2010-07-08 17:48:16

@user196188: Perfect. If your records are fixed width, then you can compute the exact starting point of any record. You'd start your binary search by computing the offset of the record in the middle of the file, and then seeing if the time you are searching for is earlier or later than that time. You then look up the record that is halfway between the first time and the start/end of the file, and so on, until you've located the right record.

Mike Daniels 2010-07-08 17:52:17

If there can be multiple records with the same timestamp, then you will have to examine the previous record(s) when you have found a matching timestamp, to ensure you wind up with the first record matching a given timestamp and not an arbitrary one.

Mike Daniels 2010-07-08 17:54:31

Once you've got this working, you should then look at modifying it to be an interpolated binary search. This is the same as a binary search, except that instead of seeking to the middle of the current search window, you seek to position `(t - t1) / (t2 - t1)`.

caf 2010-07-09 02:21:54

Answer 2

A:

Since the values are fixed width, something like a binary search or an interpolation search sound like the best options. Also, if you plan on working with files in those size classes (100GB), you should consider using fgetpos/fsetpos due to the file size limits of fseek.

tsiki 2010-07-08 17:53:16

Ok and If I need to add more types of files, i.e. files with rows not fixed length, any idea? Im currently using fseek,ftell with #define _FILE_OFFSET_BITS 64in gcc

damir 2010-07-08 18:03:29

fgetpos/fsetpos are not useful because they can only return to locations you've already visited, not seek to arbitrary offsets. Use fseeko/ftello (these are POSIX not plain C) and make sure your build environment is for 64bit file offsets, as it seems to be.

R.. 2010-07-08 18:09:25

ok and lets complicate things a bit more, I'm sending this whole mess with RPC to java to read those offsets and present lines to user. I don't think java is compatible with seeking in off_t

damir 2010-07-08 18:13:55

Hmm true, looks like fseeko/ftello is a better choice in this case.As for variable length rows, maybe making an index of the time values (as suggested above) would do the trick? You could also try some pre-made database management like Tokyo Cabinet.

tsiki 2010-07-08 18:31:15

ftello returns off_ t, that wouldn't work.

damir 2010-07-08 18:39:23

ansaurus

tags:

views:

answers:

search algorithm on large files

related questions