Say you have a text file - what's the fastest and/or most memory efficient way to determine the number of lines of text in that file?
Is it simply a matter of scanning through it character by character and looking for newline characters?
Say you have a text file - what's the fastest and/or most memory efficient way to determine the number of lines of text in that file?
Is it simply a matter of scanning through it character by character and looking for newline characters?
I'd read it 32kb's at a time(or more), count the number of \r\n's in the memoryblock and repeat until done.
Probably not the fastest but it will be the most versatile...
int lines = 0;
/* if you need to use an encoding other than UTF-8 you way want to try...
new StreamReader("filename.txt", yourEncoding)
... instead of File.OpenText("myFile.txt")
*/
using (var fs = File.OpenText("myFile.txt"))
while (!fs.EndOfStream)
{
fs.ReadLine();
lines++;
}
... this will probably be faster ...
if you need even more speed you might try a Duff's device and check 10 or 20 bytes before the branch
int lines = 0;
var buffer = new byte[32768];
var bufferLen = 1;
using (var fs = File.OpenRead("filename.txt"))
while (bufferLen > 0)
{
bufferLen = fs.Read(buffer, 0, 32768);
for (int i = 0; i < bufferLen; i++)
/* this is only known to work for UTF-8/ASCII other
file types may need to search for different End Of Line
characters */
if (buffer[i] == 10)
lines++;
}
If it's a fixed record you can get the size of a record and then divide the total file size by that amount to get the number of records. If you're just looking for an estimate, what I've done in the past is just read the first x rows (e.g. 200) and use that to come up with an average row size which you can then use to guess the total number of records (divide total file size by average row size). This works well if your records are going to be fairly uniform and you don't need an exact count. I've used this on large files (do a quick check to get the file size, if it's over 20 MB then get an estimate rather than reading the entire file).
Other than that, the only 100% accurate way is to go through the file line by line using ReadLine.
Unless you've got a fixed line length (in terms of bytes) you'll definitely need to read the data. Whether you can avoid converting all the data into text or not will depend on the encoding.
Now the most efficient way will be reinier's - counting line endings manually. However, the simplest code would be to use TextReader.ReadLine()
. And in fact, the simplest way of doing that would be to use my LineReader
class from MiscUtil, which converts a filename (or various other things) into an IEnumerable<string>
. You can then just use LINQ:
int lines = new LineReader(filename).Count();
(If you don't want to grab the whole of MiscUtil, you can get just LineReader
on its own from this answer.)
Now that will create a lot of garbage which repeatedly reading into the same char array wouldn't - but it won't read more than one line at a time, so while you'll be stressing the GC a bit, it's not going to blow up with large files. It will also require decoding all the data into text - which you may be able to get away without doing for some encodings.
Personally, that's the code I'd use until I found that it caused a bottleneck - it's a lot simpler to get right than doing it manually. Do you absolutely know that in your current situation, code like the above will be the bottleneck?
As ever, don't micro-optimise until you have to... and you can very easily optimise this at a later date without changing your overall design, so postponing it isn't going to do any harm.
EDIT: To convert Matthew's answer to one which will work for any encoding - but which will incur the penalty of decoding all the data, of course, you might end up with something like the code below. I'm assuming that you only care about \n
- rather than \r
, \n
and \r\n
which TextReader
normally handles:
public static int CountLines(string file, Encoding encoding)
{
using (TextReader reader = new StreamReader(file, encoding))
{
return CountLines(reader);
}
}
public static int CountLines(TextReader reader)
{
char[] buffer = new char[32768];
int charsRead;
int count = 0;
while ((charsRead = reader.Read(buffer, 0, buffer.Length)) > 0)
{
for (int i = 0; i < charsRead; i++)
{
if (buffer[i] == '\n')
{
count++;
}
}
}
return count;
}
The simplest:
int lines = File.ReadAllLines(fileName).Length;
This will of course read all of the file into memory, so it's not memory efficient at all. The most memory efficient is reading the file as a stream and looking for the line break characters. This will also be the fastest, as it's a minimum of overhead.
There is no shortcut that you can use. Files are not line based, so there is no extra information that you can use, one way of the other you have to read and examine every single byte of the file.
I believe Windows uses two characters to mark the end of the line (10H and 13H if I recall correctly), so you only need to check every second character against these two.
Since this is a purely sequential process with no dependencies between locations, consider map/reduce if data is really huge. In C/C++, you can use OpenMP for parallelism. Each thread will read a chunk and count CRLF in that chunk. Finally, in the reduce part, they will sum their individual counts. Intel Threading Building Blocks provide you C++ template based constructs for parallelism. I agree this is a sledge hammer approach for small files but from a pure performance perspective, this is optimal (divide and conquer)