views:

416

answers:

5

Is there a library that I can use to perform binary search in a very big text file (can be 10GB).

The file is a sort of a log file - every row starts with a date and time. Therefore rows are ordered.

A: 

The List object has a Binary Search method.

http://msdn.microsoft.com/en-us/library/w4e7fxsh%28VS.80%29.aspx

Jason M
This would require him to load the entire file into a list of lines. The file size prohibits this.
Jason Punyon
And how in gods name should he load a 10GB file in memory?
Steven
@Jason Punyon, @Steven -- Yeah, if you're excessively stupid and you plan on loading the entire thing into a single list, yes. Did I say that? No. I was just giving a starting spot. Go troll someone else.
Jason M
A: 

You would need to be able to stream the file, but you would also need random access. I'm not sure how you accomplish this short of a guarantee that each line of the file contains the same number of bytes. If you had that, you could get a Stream of the object and use the Seek method to move around in the file, and from there you could conduct your binary search by reading in the number of bytes that constitute a line. But again, this is only valid if the lines are the same number of bytes. Otherwise, you would jump in and out of the middle of lines.

Something like

byte[] buffer = new byte[lineLength];
stream.Seek(lineLength * searchPosition, SeekOrigin.Begin);
stream.Read(buffer, 0, lineLength);
string line = Encoding.Default.GetString(buffer);
Anthony Pegram
It makes sense, but unfortunately the lines are not guaranteed to be of the same length
Alex
+2  A: 

As the line lengths are not guaranteed to be the same length, you're going to need some form of recognisable line delimiter e.g. carriage return or line feed.

The binary search pattern can then be pretty much your traditional algorithm. Seek to the 'middle' of the file (by length), seek backwards (byte by byte) to the start of the line you happen to land in, as identified by the line delimiter sequence, read that record and make your comparison. Depending on the comparison, seek halfway up or down (in bytes) and repeat.

When you identify the start index of a record, check whether it was the same as the last seek. You may find that, as you dial in on your target record, moving halfway won't get you to a different record. e.g. you have adjacent records of 100 bytes and 50 bytes respectively, so jumping in at 75 bytes always takes you back to the start of the first record. If that happens, read on to the next record before making your comparison.

You should find that you will reach your target pretty quickly.

Neil Moss
This isn't bad as well, it will work.
csharptest.net
A: 

This shouldn't be too bad under the constraint that you hold an Int64 in memory for every line-feed in the file. That really depends upon how long the line of text is on average, given 1000 bytes per line you be looking at around (10,000,000,000 / 1000 * 4) = 40mb. Very big, but possible.

So try this:

  1. Scan the file and store the ordinal offset of each line-feed in a List
  2. Binary search the List with a custom comparer that scans to the file offset and reads the data.
csharptest.net
If your going to be performing the search only a few times then I'd go with Neil Moss's answer, If you going to do the search a lot or are uncomfortable writing a binary search, use this approach.
csharptest.net
+2  A: 

I started to write the pseudo-code on how to do it, but I gave up since it may seem condescending. You probably know how to write a binary search, it's really not complicated.

You won't find it in a library, for two reasons:

  1. It's not really "binary search" - the line sizes are different, so you need to adapt the algorithm (e.g. look for the middle of the file, then look for the next "newline" and consider that to be the "middle").
  2. Your datetime log format is most likely non-standard (ok, it may look "standard", but think a bit.... you probably use '[]' or something to separate the date from the log message, something like [10/02/2001 10:35:02] My message ).

On summary - I think your need is too specific and too simple to implement in custom code for someone to bother writing a library :)

Virgil
You are probably right. Thanks
Alex