There are so many performance tradeoffs here that it's impossible to know what makes sense until you have measurements on typical data.
If you're going to maintain this code, it needs to be simple. If searches are rare or the file is small, go with linear search. If the cost actually matters, you'll have to do some experiments.
The second thing I would try after linear search would be to mmap
the file and scan through it for newlines. This does take linear time, but strchr
can be very fast. It helps if you can guarantee the file ends in a newline. Once you have the lines demarcated, you can keep the number of comparisons small by doing a binary search.
Another option you should consider is Boyer-Moore string search. This is a sub-linear time search and depending on the size of the search pattern, it may be faster than the logarithmic binary search. Boyer-Moore is especially good with long search strings.
Finally, if you determine binary search is really good, but that identifying the lines is a performance bottleneck, you could precompute the start location of each line and store these precomputed locations in binary format in an auxiliary file.
I feel comfortable making only one prediction: it is almost certainly worth avoiding reading in one line at a time with something like readline()
or fgets()
, because this strategy invariably involves calling malloc()
to hold the contents of the line. The cost of calling malloc()
on every line is likely to swamp any cost of search or comparison.