views:

835

answers:

5

I have a big text file with more then 200.000 lines, and I need to read just a few lines. For instance: line 10.000 to 20.000.

Important: I don´t want to open and search the full file to extract theses lines because of performance issues.

Is this possible?

+1  A: 

You will have to search through the file to count the newlines, unless you know that all lines are the same length (in which case you could seek to the offset = line_number * line_size_in_bytes, where line_number counts from zero and line_size_in_bytes includes all characters in the line).

If the lines are variable / unknown length then while reading through it once you could index the beginning offset of each line so that subsequent reads could seek to the start of a given line.

Karl Voigtland
+6  A: 

If the lines are fixed length, then it would be possible to seek to a specific byte position and load just the lines you want. If lines are variable length, the only way to find the lines you're looking for is to parse the file and count the number of end-of-line markers. If the file changes infrequently, you might be able to get sufficient performance by performing this parsing once and then keeping an index of the byte positions of each line to speed future accesses (perhaps writing that index to disk so it doesn't need to be done every time your program is run).

rmeador
Caveat: Some file format include an index ear the beginning or sometimes near the end. Then you read the index, and use that to compute the starting position of the data you need. Yes, this is easier and more common in binary formats, but I've seen it done in a text format.
dmckee
+1 for the answer@dmckee: An index at the beginning seems no real problem ? At the end you can probably seek to the end and you probably know the index size, so it seems not to be a big problem ?
neuro
@neuro: the last element of an index at the end has to be a fixed-size offset for the beginning of the index. You seek to the end, back up by a known amount, read the index offset, go to the index, and proceed from there. Obvious, right? :)
dmckee
A: 

If these lines are all the same length you could compute an offset for a given line and read just those bytes.

If the lines are varying length then you really have to read the entire file to count how many lines there are. Line terminating characters are just arbitrary bytes in the file.

Brian Ensink
A: 

If the line are fixed length then you just compute the offset, no problem.

If they're not (i.e. a regular CSV file) then you'll need to go through the file, either to build an index or to just read the lines you need. To make the file reading a little faster a good idea would be to use memory mapped files (see the implementation that's part of the Boost iostreams: http://www.boost.org/doc/libs/1_39_0/libs/iostreams/doc/classes/mapped_file.html).

Eugen Constantin Dinca
A: 

As others noted, if you do not have the lines of fixed width, it is impossible to do without building the index. However, if you are in control of the format of the file, you can get a ~O(log(size)) instead of O(size) performance in finding the start line, if you manage to store number of the line itself on each line, i.e. to have the file contents look something like this:

1: val1, val2, val3
2: val4
3: val5, val6
4: val7, val8, val9, val10

With this format of the file, you can quickly find the needed line by binary search: start with seeking into the middle of the file. Read till the next newline. Then read the line, and parse the number. If the number is bigger than the target, then you need to repeat the algorithm on the first half of the file, if it is smaller than the target line number, then you need to repeat it on the second half of the file.

You'd need to be careful about the corner cases (e.g.: your "beginning" of the range and "end" of the range are on the same line, etc.), but for me this approach worked excellently in the past for parsing the logfiles which had the date in it (and I needed to find the lines that are between the certain timestamps).

Of course, this still does not beat the performance of the explicitly built index or the fixed-size records.

Andrew Y