views:

306

answers:

7

For example, let's say I want to find a particular word or number in a file. The contents are in sorted order (obviously). Since I want to run a binary search on the file, it seems like a real waste of time to copy the entire file into an array and then run binary search...I've effectively made it a linear time algorithm, because I'll have to spend O(n) time copy the darn file before I can run my search.

Is there a faster way to do this? Is there maybe something like lseek which works with lines instead of bytes?

If there isn't, am I better off just doing a linear search instead (assuming I'm only running the search once for the entire duration of my program) ?

+4  A: 

A disk-based binary search needs to be, at least initially, "block-aware", i.e. aware of the fact that whether you read a single byte of a whole bunch, the I/O cost are the same. The other think it need to be aware is of the relative higher cost for a seek operation as compared to a sequential read operation.

Several of the ways that it can use this awareness about the characteristics of disk I/O:

  • Towards the end of the search, favor linear searching (scanning) rather than seeking into.
  • In the beginning check both the first and last element in the block, this may help extrapolate a better guess for the next split
  • Cache a tree (or even short flat list), of some of the items found in various places in the file (a bit like the intermediate nodes in a formal btree structure)
  • Declare and use an appropriate buffer size
mjv
+5  A: 

You cannot seek by line. It's pretty obvious once you think about it.

But you can do a sort-of binary search on a text file.

What you do is:

  • Stat the file to get the length or seek to the end and get the position.
  • Memory map the file.
    (This is best, I think, but you can use lseek and read if you must.)
  • Seek to the middle of the file, minus your average line length. Just guess.
  • Scan forward for a newline, unless you are at position 0.
  • Read your line and compare.
  • Repeat for 1/4th or 3/4ths, 1/8th, 1/16th, etc.
Zan Lynx
The "minus average line length" bit isn't really necessary. Just be careful about fencepost errors.
Laurence Gonsalves
That's not a bad answer, assuming your file fits within your address space, of course. + 1.
paxdiablo
Memory-map the file; it reduces the problem to scanning for lines in a block of memory with no visible I/O.
Jonathan Leffler
I suppose not, but the last time I did this (a while back) it seemed to get the middle string more reliably. For some reason this seemed important at the time. Heh.
Zan Lynx
memory-mapping involves copying to memory which could turn out to be costly for the problem at hand.
Murali VP
It only copies to memory the pages as they're accessed - demand paging. Which is no different to what happens when you use `read` (actually, it's better, because it involves one less memory-memory copy).
caf
+1  A: 

Yes you can lseek but it would help if the size of each word/number per line is fixed, if that is not the case, which is more likely, then you have to lseek by the size of file and seek to the nearest word beginning to still achieve close to the typical O(log n) time complexity of binary searches.

Murali VP
+2  A: 

If the file is small, like under a few hundred kilobytes, it's almost certainly faster to read (or virtually memory map) the entire file into memory. This is because the overhead of doing several i/o operations to seek and transfer is much worse than just reading the whole file, which is what most programs do and most operating systems assume is done.

Unless all the lines are the same length, or have a very predictable length, there's no easy way to seek to line #n. But, to perform a binary search, I'd work with byte offsets in the binary search and read, say 100 bytes (if the words are all less than 100 characters long) before and after the offset—a total of 200 bytes. Then scan for the newline before and after the middle of it to extract the word.

wallyk
+1  A: 

There wouldn't be a "lseek" function, because the file commands do not have the concept of a "line" This concept exists in a different layer of abstraction then the raw file commands.

As to whether it's faster or not, the answer will depend upon a number of factors, including the size of the file, the disk drive speed, and the amount of RAM available. If it isn't a large file, my guess is it would be faster to load the entire file into memory.

If it is a large file, I would use the binary search algorithm to narrow it down to a smaller range (say, a couple of megabytes), then load up that entire block.

Andrew Shepherd
A: 

There are so many performance tradeoffs here that it's impossible to know what makes sense until you have measurements on typical data.

If you're going to maintain this code, it needs to be simple. If searches are rare or the file is small, go with linear search. If the cost actually matters, you'll have to do some experiments.

The second thing I would try after linear search would be to mmap the file and scan through it for newlines. This does take linear time, but strchr can be very fast. It helps if you can guarantee the file ends in a newline. Once you have the lines demarcated, you can keep the number of comparisons small by doing a binary search.

Another option you should consider is Boyer-Moore string search. This is a sub-linear time search and depending on the size of the search pattern, it may be faster than the logarithmic binary search. Boyer-Moore is especially good with long search strings.

Finally, if you determine binary search is really good, but that identifying the lines is a performance bottleneck, you could precompute the start location of each line and store these precomputed locations in binary format in an auxiliary file.

I feel comfortable making only one prediction: it is almost certainly worth avoiding reading in one line at a time with something like readline() or fgets(), because this strategy invariably involves calling malloc() to hold the contents of the line. The cost of calling malloc() on every line is likely to swamp any cost of search or comparison.

Norman Ramsey
I was with you up till the last line...what did you mean by "it is almost certainly worth avoiding the cost of calling malloc() on every line." ?
Goose Bumper
edited! hope it's clearer now
Norman Ramsey
Not really any clearer. I don't see *why* you would call malloc for each line. I never have done fgets that way, I have always used a static char buffer.
Zan Lynx
You would call malloc() so that you can read lines of any size, as opposed to being limited to the size of a static char buffer.
Norman Ramsey
Well, I suppose I can see using realloc to get a bigger buffer when needed, but there is no need to keep freeing it. That would be a mostly static buffer.
Zan Lynx
If you are keeping the lines around in order to do binary search on them, each one has to be saved on the heap. Otherwise, by the time you are ready to do a binary search, only the last line will be in the buffer. The way to save space for something is of course `malloc()`.
Norman Ramsey
There is still no reason to keep calling malloc enough times to make it a speed bottleneck. Get the file size and malloc all the space up front, then stuff the strings into it.
Zan Lynx
A: 

As mentioned above, since the file is a text file, predicting the byte at which a given line begins within the file can't be done reliably. The ersatz binary search idea is a pretty good one. But it really won't save you a ton unless the file is huge, given how fast sequential I/O is nowadays and how slow random I/O is.

As you mention, if you are going to read it in, you might as well linearly search it as you go. So do so, use a modified Boyer-Moore search as you read it in and you'll do pretty well.

Southern Hospitality
Boyer-Moore would usually be the smart thing to do, but in this case I just have one word per line. So strcmp() will have the same running time as Boyer-Moore in this case.
Goose Bumper