tags:

views:

2410

answers:

8

Dear all,

I need to read the number of lines in a file before doing some operations on that file. When I try to read the file and increment the line_count variable at each iteration until i reach eof. It was not that fast in my case. I used both ifstream and fgets . They were both slow . Is there a hacky way to do this, which is also used by, for instance BSD, Linux kernel or berkeley db.(may be by using bitwise operations).

As I told before there are millions of lines in that file and it keeps get larger, each line has about 40 or 50 characters. I'm using Linux.

Note: I'm sure there will be people who might say use a DB idiot. But briefly in my case i can't use a db.

+3  A: 

It isn't slow because of your algorithm , It is slow because IO operations are slow. I suppose you are using a simple O(n) algorithm that is simply going over the file sequentially. In that case , there is no faster algorithm that can optimize your program.

However , I said there is no faster algorithm , but there is a faster mechanism which called "Memory Mapped file " , There are some drawback for mapped files and it might not be appropiate for you case , So you'll have to read about it and figure out by yourself.

Memory mapped files won't let you implement an algorithm better then O(n) but it may will reduce IO access time.

The question was about making the runtime faster, not how the algorithm scales. An O(n) algorithm which takes a couple of cycles per n is faster than one which takes a year for each n.
Pete Kirkham
+11  A: 

The only way to find the line count is to read the whole file and count the number of line-end characters. The fastest way tom do this is probably to read the whole file into a large buffer with one read operation and then go through the buffer counting the '\n' characters.

As your current file size appears to be about 60Mb, this is not an attractive option. You can get some of the speed by not reading the whole file, but reading it in chunks., say of size 1Mb. You also say that a database is out of the question, but it really does look to be the best long-term solution.

Edit: I just ran a small benchmark on this and using the buffered approach (buffer size 1024K) seems to be a bit more than twice as fast as reading a line at a time with getline(). Here's the code - my tests were done with g++ using -O2 optimisation level:

#include <iostream>
#include <fstream>
#include <vector>
#include <ctime>
using namespace std;

unsigned int FileRead( istream & is, vector <char> & buff ) {
    is.read( &buff[0], buff.size() );
    return is.gcount();
}

unsigned int CountLines( const vector <char> & buff, int sz ) {
    int newlines = 0;
    const char * p = &buff[0];
    for ( int i = 0; i < sz; i++ ) {
     if ( p[i] == '\n' ) {
      newlines++;
     }
    }
    return newlines;
}

int main( int argc, char * argv[] ) {
    time_t now = time(0);
    if ( argc == 1  ) {
     cout << "lines\n";
     ifstream ifs( "lines.dat" );
     int n = 0;
     string s;
     while( getline( ifs, s ) ) {
      n++;
     }
     cout << n << endl;
    }
    else {
     cout << "buffer\n";
     const int SZ = 1024 * 1024;
     std::vector <char> buff( SZ );
     ifstream ifs( "lines.dat" );
     int n = 0;
     while( int cc = FileRead( ifs, buff ) ) {
      n += CountLines( buff, cc );
     }
     cout << n << endl;
    }
    cout << time(0) - now << endl;
}
anon
I'd like to add that the size of the 1MB should be 1,048,576 bytes not 1,000,000 bytes. More precisely, it should be a multiple of the "block size" for the device which is storing the file. In my experience this is 512 bytes for hard disk drives, and 2048 bytes for CD/DVD drives. Don't know if that is in some specification or not.
Vilx-
Also, you should not read one block at a time. There is a constant overhead for every read operation, so the more data you read at once, the faster your process will be. However, I've also found out that there is little difference in overall speed (for a HDD) if you read 400 blocks at once or if you read 800 blocks at once. So I guess that 1,048,576 bytes will be more than enough.
Vilx-
@neil can you post your code too?
systemsfault
Last time I played this game, I tested a lot of buffer sizes to find the sweet spot where we got most bang per buck (both RAM and speed were at a premium). This was output to flash, so the result isn't applicable to this problem, but I found that you got 90% or so of the max possible speed with 25% or so of the max beneficial memory. So it's all down to how the two trade off. Of course 1MB is nothing on a PC.
Steve Jessop
anon
Above a certain point, using a larger buffer will actually result in *slower* performance, since less of the RAM will fit in L2 cache. And for really large buffers, the OS may need to page to disk in order to provide you with the requested amount of RAM.
j_random_hacker
Little bug: if the file does not end with the character \n, your second loop will report a line count one lower than your first.
j_random_hacker
True - I wasn't offering this as an actual solution to the question, just as a rough benchmark for performance of two possible approaches.
anon
Creating your own buffer is overkill. The fstream objects do their own buffering.
Martin York
The buffered version is faser than using the built-in stream buffering. See my comments to your answer.
anon
You can replace the fstream's own buffers by calling the relevant methods on ifs.rdbuf(). You'd still have more call overhead scanning smaller chunks of data, but it should help with the overhead from blocking I/O.
Steve Jessop
+6  A: 

Don't use C++ stl strings and getline ( or C's fgets), just C style raw pointers and either block read in page-size chunks or mmap the file.

Then scan the block at the native word size of your system ( ie either uint32_t or uint64_t) using one of the magic algorithms 'SIMD Within A Register (SWAR) Operations' for testing the bytes within the word. An example is here; the loop with the 0x0a0a0a0a0a0a0a0aLL in it scans for line breaks. ( that code gets to around 5 cycles per input byte matching a regex on each line of a file )

If the file is only a few tens or a hundred or so megabytes, and it keeps growing (ie something keeps writing to it), then there's a good likelihood that linux has it cached in memory, so it won't be disk IO limited, but memory bandwidth limited.

If the file is only ever being appended to, you could also remember the number of lines and previous length, and start from there.


It has been pointed out that you could use mmap with C++ stl algorithms, and create a functor to pass to std::foreach. I suggested that you shouldn't do it not because you can't do it that way, but there is no gain in writing the extra code to do so. Or you can use boost's mmapped iterator, which handles it all for you; but for the problem the code I linked to was written for this was much, much slower, and the question was about speed not style.

Pete Kirkham
There is no reason you can't use mmap or reading blocks in C++.
anon
You have to jump through a lot of hoops to convert between stl containers and strings and raw mmapped memory, which often involves copying and indirection, rather than just calling the functions and using the memory directly using the C subset of C++.
Pete Kirkham
There is of course no "C subset" of C++. Just because you don't use a std container or string does not make it "not C++" in some way.
anon
And to add to what Neil said, std algorithms work perfectly fine on raw memory, using pointers as iterators. You could trivially write a functor performing the SIMD tricks specified, and run that over the file data with std::for_each. STL is more than just the containers
jalf
@Neil you know full well that there are a subset of C++ which is close to idiomatic CI've posted this exact response before and been criticised for it being C rather than C++; you can't win.
Pete Kirkham
Thanx Pete, this is actually the technique that i tried to imply in the question. I'm going to try to implement this.
systemsfault
+1  A: 

You can only get a definitive answer by scanning the entire file looking for newline characters. There's no way around that.

However, there are a couple of possibilities which you may want to consider.

1/ If you're using a simplistic loop, reading one character at a time checking for newlines, don't. Even though the I/O may be buffered, function calls themselves are expensive, time-wise.

A better option is to read large chunks of the file (say 5M) into memory with a single I/O operation, then process that. You probably don't need to worry too much about special assembly instruction since the C runtime library will be optimized anyway - a simple strchr() should do it.

2/ If you're saying that the general line length is about 40-50 characters and you don't need an exact line count, just grab the file size and divide by 45 (or whatever average you deem to use).

3/ If this is something like a log file and you don't have to keep it in one file (may require rework on other parts of the system), consider splitting the file periodically.

For example, when it gets to 5M, move it (e.g., x.log) to a dated file name (e.g., x_20090101_1022.log) and work out how many lines there are at that point (storing it in x_20090101_1022.count, then start a new x.log log file. Characteristics of log files mean that this dated section that was created will never change so you will never have to recalculate the number of lines.

To process the log "file", you'd just cat x_*.log through some process pipe rather than cat x.log. To get the line count of the "file", do a wc -l on the current x.log (relatively fast) and add it to the sum of all the values in the x_*.count files.

paxdiablo
+1  A: 

The thing that takes time is loading 40+ MB into memory. The fastest way to do that is to either memorymap it, or load it in one go into a big buffer. Once you have it in memory, one way or another, a loop traversing the data looking for \n characters is almost instantaneous, no matter how it is implemented.

So really, the most important trick is to load the file into memory as fast as possible. And the fastest way to do that is to do it as a single operation.

Otherwise, plenty of tricks may exist to speed up the algorithm. If lines are only added, never modified or removed, and if you're reading the file repeatedly, you can cache the lines read previously, and the next time you have to read the file, only read the newly added lines.

Or perhaps you can maintain a separate index file showing the location of known '\n' characters, so those parts of the file can be skipped over.

Reading large amounts of data from the harddrive is slow. There's no way around that.

jalf
Above a certain point, using a larger buffer will actually result in *slower* performance, since less of the RAM will fit in L2 cache. And for really large buffers, the OS may need to page to disk in order to provide you with the requested amount of RAM
j_random_hacker
+6  A: 

You wrote that it keeps get larger. This sounds like it is a log file or something similar where new lines are appended but exisiting lines are not changed. If this is the case you could try an incremental approach.

Parse to the end of file. Remember the line count and the offset of EOF. When the file grows fseek to the offset, parse to EOF and update the line count and the offset.

Ludwig Weinzierl
yes that's true it is like a log file (but not a log file: an aggrete of statistics from several log files from several developers) and this program is developed by another person and he maintains it. But your for a feasible and local solution. But i just wondered a global optimum solution which can calculate very fastly without having an apriori knowledge about the file. This is just for curiosity :).
systemsfault
+1  A: 

Some common gotchas to watch out for if getting an exact count is important:

  1. Many text files don't have a line separator at the end of the last line. So if your file says "Hello, World!", you could end up with a count of 0 rather than 1. Rather than just counting the line separators, you'll need a simple state machine to keep track.

  2. Text files from older versions of Mac OS may use '\r' rather than '\n' to separate lines. If you're reading the raw bytes into a buffer (e.g., with your stream in binary mode) and scanning them, you'll come up with a count of 0 on these files. You can't count both '\r's and '\n's, because PC files generally end a line with both. Again, you'll need a simple state machine. (Alternatively, you can read the file in text mode rather than binary mode. The text interfaces will normalize newlines to '\n' for files that conform to the convention used on your platform. If you're reading files from other platforms, you'll be back to binary mode with a state machine.)

  3. Some very obscure files use Unicode U+2028 LINE SEPARATOR (or even U+2029 PARAGRAPH SEPARATOR) as line separators instead of the more common carriage return and/or line feed.

  4. You'll have to consider whether you want to count some other control characters as line breakers. For example, should a form feed or vertical tab be considered going to a new line?

  5. If you ever have a super long line in the file, the getline() approach can throw an exception causing your simple line counter to fail on a small number of files. (This is particularly true if you're reading an old Mac file on a non-Mac platform, causing getline() to see the entire file as one gigantic line.) By reading chunks into a fixed-size buffer and using a state machine, you can make it bullet proof.

The code in the accepted answer suffers from most of these traps. Perhaps none of them apply in your situation.

Adrian McCarthy
+1, excellent points. #3 scared me though -- first time I've heard of that! <shiver>
j_random_hacker
A: 

Remember that all fstreams are buffered. So they in-effect do actually reads in chunks so you do not have to recreate this functionality. So all you need to do is scan the buffer. Don't use getline() though as this will force you to size a string. So I would just use the STL std::count and stream iterators.

#include <iostream>
#include <fstream>
#include <iterator>
#include <algorithm>


struct TestEOL
{
    bool operator()(char c)
    {
        last    = c;
        return last == '\n';
    }
    char    last;
};

int main()
{
    std::fstream  file("Plop.txt");

    TestEOL       test;
    std::size_t   count   = std::count_if(std::istreambuf_iterator<char>(file),
                                          std::istreambuf_iterator<char>(),
                                          test);

    if (test.last != '\n')  // If the last character checked is not '\n'
    {                       // then the last line in the file has not been 
        ++count             // counted. So increement the count so we count
    }                       // the last line even if it is not '\n' terminated.
}
Martin York
I ran your code against the two possible implementations I posted. Yours runs at the same speed as the getline() version (i.e. half the speed of the explicitly buffered one) but unfortunately also prints out the wrong number of lines - 1000001 versus 1000000 for the implemntations I posted on my million line text file, which being in windows format is terminated by CR/LF.
anon
PS It seems to think that test.last contains the zero byte.
anon
This is because the file is opened in text mode. This forces a EOL translation which requires processing. If the file is opened in binary mode there is no EOL translation and it will be faster. Weather you need EOL translation or will depend on what you are doing.
Martin York
@Neil: windows format is terminated by CR/LF: This sequence should be auto translated by the stream into '\n' because it is the EOL marker. If you move files from one OS to another then all bets are off and you should transform your files before use. see dos2unix
Martin York