What is the most efficient way to display the last 10 lines of a very large text file (this particular file is over 10GB). I was thinking of just writing a simple c# app but I'm not sure how to do this effectively.
Thanks!
What is the most efficient way to display the last 10 lines of a very large text file (this particular file is over 10GB). I was thinking of just writing a simple c# app but I'm not sure how to do this effectively.
Thanks!
I'd likely just open it as a binary stream, seek to the end, then back up looking for line breaks. Back up 10 (or 11 depending on that last line) to find your 10 lines, then just read to the end and use Encoding.GetString on what you read to get it into a string format. Split as desired.
You should be able to use FileStream.Seek() to move to the end of the file, then work your way backwards, looking for \n until you have enough lines.
Tail? Tail is a unix command that will display the last few lines of a file. There is a Windows version in the Windows 2003 Server resource kit.
That is what unix tail command does. See http://en.wikipedia.org/wiki/Tail_(Unix)
There is lots of open source implementations on internet and here is one for win32: Tail for WIn32
As the others have suggested, you can go to the end of the file and read backwards, effectively. However, it's slightly tricky - particularly because if you have a variable-length encoding (such as UTF-8) you need to be cunning about making sure you get "whole" characters.
If you open the file with FileMode.Append it will seek to the end of the file for you. Then you could seek back the number of bytes you want and read them. It might not be fast though regardless of what you do since that's a pretty massive file.
One useful method is FileInfo.Length
. It gives the size of a file in bytes.
What structure is your file? Are you sure the last 10 lines will be near the end of the file? If you have a file with 12 lines of text and 10GB of 0s, then looking at the end won't really be that fast. Then again, you might have to look through the whole file.
If you are sure that the file contains numerous short strings each on a new line, seek to the end, then check back until you've counted 11 end of lines. Then you can read forward for the next 10 lines.
Read to the end of the file, then seek backwards until you find ten newlines, and then read forward to the end taking into consideration various encodings. Be sure to handle cases where the number of lines in the file is less than ten. Below is an implementation (in C# as you tagged this), generalized to find the last numberOfTokens
in the file located at path
encoded in encoding
where the token separator is represented by tokenSeparator
; the result is returned as a string
(this could be improved by returning an IEnumerable<string>
that enumerates the tokens).
public static string ReadEndTokens(string path, Int64 numberOfTokens, Encoding encoding, string tokenSeparator) {
int sizeOfChar = encoding.GetByteCount("\n");
byte[] buffer = encoding.GetBytes(tokenSeparator);
using (FileStream fs = new FileStream(path, FileMode.Open)) {
Int64 tokenCount = 0;
Int64 endPosition = fs.Length / sizeOfChar;
for (Int64 position = sizeOfChar; position < endPosition; position += sizeOfChar) {
fs.Seek(-position, SeekOrigin.End);
fs.Read(buffer, 0, buffer.Length);
if (encoding.GetString(buffer) == tokenSeparator) {
tokenCount++;
if (tokenCount == numberOfTokens) {
byte[] returnBuffer = new byte[fs.Length - fs.Position];
fs.Read(returnBuffer, 0, returnBuffer.Length);
return encoding.GetString(returnBuffer);
}
}
}
// handle case where number of tokens in file is less than numberOfTokens
fs.Seek(0, SeekOrigin.Begin);
buffer = new byte[fs.Length];
fs.Read(buffer, 0, buffer.Length);
return encoding.GetString(buffer);
}
}
I'm not sure how efficient it will be, but in Windows PowerShell getting the last ten lines of a file is as easy as
Get-Content file.txt | Select-Object -last 10
I think the other posters have all shown that there is no real shortcut.
You can either use a tool such as tail (or powershell) or you can write some dumb code that seeks end of file and then looks back for n newlines.
There are plenty of implementations of tail out there on the web - take a look at the source code to see how they do it. Tail is pretty efficient (even on very very large files) and so they must have got it right when they wrote it!
Open the file and start reading lines. After you've read 10 lines open another pointer, starting at the front of the file, so the second pointer lags the first by 10 lines. Keep reading, moving the two pointers in unison, until the first reaches the end of the file. Then use the second pointer to read the result. It works with any size file including empty and shorter than the tail length. And it's easy to adjust for any length of tail. The drawback, of course, is that you end up reading the entire file and that may be exactly what you're trying to avoid.
Why not use file.readalllines which returns a string[]?
Then you can get the last 10 lines (or members of the array) which would be a trivial task.
This approach isn't taking into account any encoding issues and I'm not sure on the exact efficiency of this approach (time taken to complete method, etc).
I think the following code will solve the prblem with subtle changes regrading encoding
StreamReader reader = new StreamReader(@"c:\test.txt",Encoding.ASCII);
reader.BaseStream.Seek(0, SeekOrigin.End);
int count = 0;
while (count <= 10)
{
reader.BaseStream.Position--;
int c=reader.Read();
reader.BaseStream.Position--;
if (c == '\n')
{
++count;
}
}
string str = reader.ReadToEnd();
string[] arr = str.Split('\n');
reader.Close();
If you have a file that has a even format per line (such as a daq system), you just use streamreader to get the length of the file, then take one of the lines, (readline()).
Divide the total length by the length of the string. Now you have a general long number to represent the number of lines in the file.
The key is that you use the readline() prior to getting your data for your array or whatever. This is will ensure that you will start at the beginning of a new line, and not get any leftover data from the previous one.
StreamReader leader = new StreamReader(GetReadFile);
leader.BaseStream.Position = 0;
StreamReader follower = new StreamReader(GetReadFile);
int count = 0;
string tmper = null;
while (count <= 12)
{
tmper = leader.ReadLine();
count++;
}
long total = follower.BaseStream.Length; // get total length of file
long step = tmper.Length; // get length of 1 line
long size = total / step; // divide to get number of lines
long go = step * (size - 12); // get the bit location
long cut = follower.BaseStream.Seek(go, SeekOrigin.Begin); // Go to that location
follower.BaseStream.Position = go;
string led = null;
string[] lead = null ;
List<string[]> samples = new List<string[]>();
follower.ReadLine();
while (!follower.EndOfStream)
{
led = follower.ReadLine();
lead = Tokenize(led);
samples.Add(lead);
}
Well duh, obviously you need a bit of code like this...
for( line = 0 to 10000000000; line++ ){
t = file.readline( checkEachCharForSpecialEncoding = true )
if( line > 9999999990 ) output t;
}
problem solved. take a quick holiday and wait for your answer.