views:

10587

answers:

5

I use huge data files, sometimes I only need to know the number of lines in these files, usually I open them up and read them line by line until I reach the end of the file

I was wondering if there is a smarter way to do that

+2  A: 

On Unix-based systems, use the wc command on the command-line.

Peter Hilton
wc -l for the line count..
IainMH
@IainmH, your second suggestion just counts the number of entries in the current directory. Not what was intended? (or asked for by the OP)
Paul
@IainMH: that's what wc does anyway (reading the file, counting line-ending).
PhiLho
@PhiLho You'd have to use the -l switch to count the lines. (Don't you? - it's been a while)
IainMH
@Paul - you are of course 100% right. My only defence is that I posted that before my coffee. I'm as sharp as a button now. :D
IainMH
You can get wc.exe for Win32 systems: see http://unxutils.sourceforge.net/
Jason S
+1  A: 

Only way to know how many lines there are in file is to count them. You can of course create a metric from your data giving you an average length of one line and then get the file size and divide that with avg. length but that won't be accurate.

Esko
Interesting downvote, no matter what command line tool you're using they all DO THE SAME THING anyway, only internally. There's no magic way to figure out the number of lines, they have to be counted by hand. Sure it can be saved as metadata but that's a whole another story...
Esko
+1 to make you feel better.
Richie_W
+12  A: 

This is the fastest version I have found so far, about 6 times faster than readLines. On a 150MB log file this takes 0.35 seconds, versus 2.40 seconds when using readLines(). Just for fun, linux' wc -l command takes 0.15 seconds.

public int count(String filename) throws IOException {
    InputStream is = new BufferedInputStream(new FileInputStream(filename));
    byte[] c = new byte[1024];
    int count = 0;
    int readChars = 0;
    while ((readChars = is.read(c)) != -1) {
        for (int i = 0; i < readChars; ++i) {
            if (c[i] == '\n')
                ++count;
        }
    }
    return count;
}
martinus
you were right david, I thought the JVM would be good enough for this... I have updated the code, this one is faster.
martinus
BufferedInputStream should be doing the buffering for you, so I don't see how using an intermediate byte[] array will make it any faster. You're unlikely to do much better than using readLine() repeatedly anyway (since that will be optimized towards by the API).
wds
Ive benchmarked it with and without the buffered inputstream, and it is afaster when using it.
martinus
Its neat, than you so much
Mark
You're going to close that InputStream when you're done with it, aren't you?
bendin
If buffering helped it would because BufferedInputStream buffers 8K by default. Increase your byte[] to this size or larger and you can drop the BufferedInputStream. e.g. try 1024*1024 bytes.
Peter Lawrey
A: 

If you don't have any index structures, you'll not get around the reading of the complete file. But you can optimize it by avoiding to read it line by line and use a regex to match all line terminators.

David Schmitt
Sounds like a neat idea. Anyone tried it and has a regexp for it?
willcodejavaforfood
I doubt it is such a good idea: it will need to read the whole file at once (martinus avoids this) and regexes are overkill (and slower) for such usage (simple search of fixed char(s)).
PhiLho
@will: what about /\n/ ?@PhiLo: good point.
David Schmitt
A: 

The answer with the method count() above gave me line miscounts if a file didn't have a newline at the end of the file - it failed to count the last line in the file.

This method works better for me:

public int countLines(String filename) throws IOException {
    LineNumberReader reader  = new LineNumberReader(new FileReader(filename));
int cnt = 0;
String lineRead = "";
while ((lineRead = reader.readLine()) != null) {}

cnt = reader.getLineNumber(); 
reader.close();
return cnt;
}
Dave Bergert