ansaurus

Question

Are there any tricks for counting the number of lines in a text file?

Answer 1

+3 A:

I'd read it 32kb's at a time(or more), count the number of \r\n's in the memoryblock and repeat until done.

Toad 2009-10-09 18:27:06

Just don't miss the case where there is a CR last in a block, and LF first in the next...

Guffa 2009-10-09 18:41:43

Answer 2

+10 A:

Probably not the fastest but it will be the most versatile...

int lines = 0;
/* if you need to use an encoding other than UTF-8 you way want to try...
   new StreamReader("filename.txt", yourEncoding) 
   ... instead of File.OpenText("myFile.txt")
*/
using (var fs = File.OpenText("myFile.txt"))
    while (!fs.EndOfStream)
    {
        fs.ReadLine();
        lines++;
    }

... this will probably be faster ...

if you need even more speed you might try a Duff's device and check 10 or 20 bytes before the branch

int lines = 0;
var buffer = new byte[32768];
var bufferLen = 1;    
using (var fs = File.OpenRead("filename.txt"))
    while (bufferLen > 0)
    {
        bufferLen = fs.Read(buffer, 0, 32768);
        for (int i = 0; i < bufferLen; i++)
            /* this is only known to work for UTF-8/ASCII other 
               file types may need to search for different End Of Line 
               characters */                
            if (buffer[i] == 10)           
                lines++;
    }

Matthew Whited 2009-10-09 18:29:54

uhh.. when people say "read 32kb at a time" they mean 32768, not 32000. Disk access is fastest when reading in multiples of the cluster size (or equiv), which is going to be a power of 2.

darron 2009-10-09 18:43:29

I understand that. I was typing off the top of my head. but I will update it for you. -Thanks

Matthew Whited 2009-10-09 18:46:29

If you really want to optimise it you could check for the real block size of the device. Like if you are streaming over a network it may be much smaller and a massive SAN could be larger... but that's probably over kill.

Matthew Whited 2009-10-09 18:47:53

Note that this only works for some encodings...

Jon Skeet 2009-10-09 18:47:55

true... but I'm sure sure people can make the changes as needed.

Matthew Whited 2009-10-09 18:50:35

The problem is that for many encodings, doing it at a binary level is going to be really hard, particularly if it's a variable width encoding. This is certainly more *efficient* than opening a reader and reading into a char array, but it's decidedly more limited. Interestingly though, it happens to work with UTF-8 despite that being variable width (assuming the text file is valid, of course).

Jon Skeet 2009-10-09 18:52:50

I think the point Matthew is making is that you can load things into a buffer to save memory. Not the method of counting the newlines.

Robert 2009-10-09 19:01:23

@SmilingRob: But by providing an example which doesn't take the encoding into account, I think his answer could lead to problems. You can still load things into a buffer with a `TextReader` and get that to deal with the encoding issue.

Jon Skeet 2009-10-09 19:05:57

@Jon: I posted two examples. the first can be changed to use 'new StreamReader("myFile.txt")' instead of 'File.OpenText("myFile.txt"))' and will allow you to overload the encoding type. Again I was going for a quick sample not a novel on how to program.

Matthew Whited 2009-10-09 19:10:09

@Matthew: File.OpenText is fine, but I just wanted to point out the relative inflexibility of your second approach, that's all. You've made a big assumption about the encoding of the file; an assumption which could require a significant rewrite in some cases - do you think I'm wrong to point that out?

Jon Skeet 2009-10-09 19:17:30

(Oops, hadn't spotted that File.OpenText can't take an encoding at all. I'd assumed there was an overload. What a strange oversight.)

Jon Skeet 2009-10-09 19:24:46

There is no problem point it out but I think most people would realize the limits in it pretty quickly. At the same time it will work for most files by looking for the CR. This will allow it to work on UTF-8, UTF-16, and ASCII. And it will work with both DOS and Unix text files. Again it was only meant as a sample and there is a reason I left the first version in the answer. I will add a note to clarify for the future that it does not support all encodings. -Thanks

Matthew Whited 2009-10-09 19:25:26

Honestly I thought it did at first but I went back to check and saw I needed stream reader. I will post both notes.

Matthew Whited 2009-10-09 19:26:15

It won't work for UTF-16... it would count U+0A00 to U+0AFF as newlines incorrectly.

Jon Skeet 2009-10-09 19:28:46

good point. That would be a good case for a look ahead check

Matthew Whited 2009-10-09 19:32:16

In fact, it's worse than I first thought... it would count U+010A as a newline, and U+020A as a newline etc. Basically in the BMP where you should count just U+000A as a newline, you'd count 511 different characters ... and you'd count U+0A0A twice.

Jon Skeet 2009-10-09 19:33:03

You could fix that pretty quick though. by doing a i+=2 and checking to make sure that buffer[i-1] is always == 0 you should be pretty safe. But then you need to worry about the endian and any other EOL characters (which is why this was meant as a quick sample)

Matthew Whited 2009-10-09 19:38:43

I think my wider point is that dealing with text data in binary form is very rarely straightforward, and it's easy to miss subtleties. I prefer to use something like TextReader whenever I'm reading text data... it just deals with a lot of the complexity for you :)

Jon Skeet 2009-10-09 19:44:06

Very true, which is they the File.OpenText version was my first thought right off the line. Well really it was a LineReader I wrote a while back for a different answer. But I figured I should go with framework code for an example.This LineReader probably works similar to yours http://hackersbasement.com/post/2009/06/10/Code-Golf-Extended82303b.aspx

Matthew Whited 2009-10-09 19:56:23

My LineReader is slightly simpler as it just uses an iterator block to handle closing the reader etc. However, it's lazy about when it opens the reader - if you create an instance of my LineReader with a filename, it remembers a function which it will call to open the file when GetEnumerator is called.

Jon Skeet 2009-10-09 20:13:03

It's a shame my down-voter pulled his comment. I love when people attack me for "pre-optimizing" stuff... sorry if my head works closer to the metal guys I went to school for electronics not computers.

Matthew Whited 2009-10-09 20:22:31

One issue I was looking at when I made my line reader is a good way to allow you to reset the reader. If the Stream it is created from is readonly a second call to the enumerator could really confuse things.

Matthew Whited 2009-10-09 20:26:42

Calling GetEnumerator on my implementation will open the file a second time (or do whatever it needs to do get a reader).

Jon Skeet 2009-10-09 20:30:53

That would be possible for files but what if it's a networkstream, compression stream, or a crypto stream. And maybe they only want to start it half way into a stream.

Matthew Whited 2009-10-09 20:34:30

Well, the general contract for GetEnumerator() suggests that it should return an independent object. That could well be fine for a compressed/crypto stream... likewise if you want to start halfway through a stream, you provide a function which opens the stream and does a seek. It also means you don't need to make the IEnumerable disposable - only the IEnumerator. Personally I prefer that as a design, but it's a matter of personal taste :)

Jon Skeet 2009-10-11 15:39:42

True, but seeing that I only developed it for a proof of concept I just did it a quick and fairly reusable way. I just don't like handling the dispose of something that was passed into me. If i opened the file stream myself I would close it.

Matthew Whited 2009-10-11 16:08:48

Answer 3

+5 A:

If it's a fixed record you can get the size of a record and then divide the total file size by that amount to get the number of records. If you're just looking for an estimate, what I've done in the past is just read the first x rows (e.g. 200) and use that to come up with an average row size which you can then use to guess the total number of records (divide total file size by average row size). This works well if your records are going to be fairly uniform and you don't need an exact count. I've used this on large files (do a quick check to get the file size, if it's over 20 MB then get an estimate rather than reading the entire file).

Other than that, the only 100% accurate way is to go through the file line by line using ReadLine.

TLiebe 2009-10-09 18:29:56

They aren't fixed records, but that optional estimate technique is cool.

frou 2009-10-11 13:15:48

Answer 4

+9 A:

Unless you've got a fixed line length (in terms of bytes) you'll definitely need to read the data. Whether you can avoid converting all the data into text or not will depend on the encoding.

Now the most efficient way will be reinier's - counting line endings manually. However, the simplest code would be to use TextReader.ReadLine(). And in fact, the simplest way of doing that would be to use my LineReader class from MiscUtil, which converts a filename (or various other things) into an IEnumerable<string>. You can then just use LINQ:

int lines = new LineReader(filename).Count();

(If you don't want to grab the whole of MiscUtil, you can get just LineReader on its own from this answer.)

Now that will create a lot of garbage which repeatedly reading into the same char array wouldn't - but it won't read more than one line at a time, so while you'll be stressing the GC a bit, it's not going to blow up with large files. It will also require decoding all the data into text - which you may be able to get away without doing for some encodings.

Personally, that's the code I'd use until I found that it caused a bottleneck - it's a lot simpler to get right than doing it manually. Do you absolutely know that in your current situation, code like the above will be the bottleneck?

As ever, don't micro-optimise until you have to... and you can very easily optimise this at a later date without changing your overall design, so postponing it isn't going to do any harm.

EDIT: To convert Matthew's answer to one which will work for any encoding - but which will incur the penalty of decoding all the data, of course, you might end up with something like the code below. I'm assuming that you only care about \n - rather than \r, \n and \r\n which TextReader normally handles:

public static int CountLines(string file, Encoding encoding)
{
    using (TextReader reader = new StreamReader(file, encoding))
    {
        return CountLines(reader);
    }
}

public static int CountLines(TextReader reader)
{
    char[] buffer = new char[32768];

    int charsRead;
    int count = 0;

    while ((charsRead = reader.Read(buffer, 0, buffer.Length)) > 0)
    {
        for (int i = 0; i < charsRead; i++)
        {
            if (buffer[i] == '\n')
            {
                count++;
            }
        }
    }
    return count;
}

Jon Skeet 2009-10-09 18:34:59

I think your solution will be slow. You'd be unnecessarily parsing and converting all the text to the proper internal encoding. A lot of overhead when all we want is count the eof's

Toad 2009-10-09 19:05:33

@reinier: I think you EOL not EOF... pretty sure you only need one EOF :)

Matthew Whited 2009-10-09 19:13:12

It won't be the fastest possible way, no. It will easily work with any encoding you throw at it, however, and it's very simple to get right. Those *usually* matter more than getting the absolute fastest code in my experience. I would only start micro-optimising after I'd got good evidence that the simplest way was too slow.

Jon Skeet 2009-10-09 19:14:00

I marked Matthew as the answer, but your answer and comments have been helpful, and I take them on board, Jon.

frou 2009-10-11 13:25:04

Answer 5

+2 A:

The simplest:

int lines = File.ReadAllLines(fileName).Length;

This will of course read all of the file into memory, so it's not memory efficient at all. The most memory efficient is reading the file as a stream and looking for the line break characters. This will also be the fastest, as it's a minimum of overhead.

There is no shortcut that you can use. Files are not line based, so there is no extra information that you can use, one way of the other you have to read and examine every single byte of the file.

Guffa 2009-10-09 18:38:00

Why the downvote? If you don't explain why, it's rather pointless.

Guffa 2009-10-09 19:46:01

Maybe because the question is "what's the fastest and/or most memory efficient way" and you specified the "simplest"? Either way, you're not below 0 right now, so your comment makes little sense.

bzlm 2009-10-11 12:29:47

Answer 6

+1 A:

I believe Windows uses two characters to mark the end of the line (10H and 13H if I recall correctly), so you only need to check every second character against these two.

Emilio M Bumachar 2009-10-09 18:57:01

You cannot always assume a file is coming from a windows editor.

Aaron 2009-10-10 20:09:58

Answer 7

+1 A:

Since this is a purely sequential process with no dependencies between locations, consider map/reduce if data is really huge. In C/C++, you can use OpenMP for parallelism. Each thread will read a chunk and count CRLF in that chunk. Finally, in the reduce part, they will sum their individual counts. Intel Threading Building Blocks provide you C++ template based constructs for parallelism. I agree this is a sledge hammer approach for small files but from a pure performance perspective, this is optimal (divide and conquer)

hackworks 2009-10-14 10:43:01

ansaurus

tags:

views:

answers:

Are there any tricks for counting the number of lines in a text file?

related questions