tags:

views:

10255

answers:

16

What is the most efficient way to display the last 10 lines of a very large text file (this particular file is over 10GB). I was thinking of just writing a simple c# app but I'm not sure how to do this effectively.

Thanks!

+16  A: 

I'd likely just open it as a binary stream, seek to the end, then back up looking for line breaks. Back up 10 (or 11 depending on that last line) to find your 10 lines, then just read to the end and use Encoding.GetString on what you read to get it into a string format. Split as desired.

ctacke
+5  A: 

You should be able to use FileStream.Seek() to move to the end of the file, then work your way backwards, looking for \n until you have enough lines.

Lolindrath
+15  A: 

Tail? Tail is a unix command that will display the last few lines of a file. There is a Windows version in the Windows 2003 Server resource kit.

w4g3n3r
His tags indicate he's after a C# solution
ctacke
I noticed that. I just thought I'd throw it out there anyway.
w4g3n3r
tip : see tail version in C# at https://tail.svn.codeplex.com/svn/
lsalamon
+3  A: 

That is what unix tail command does. See http://en.wikipedia.org/wiki/Tail_(Unix)

There is lots of open source implementations on internet and here is one for win32: Tail for WIn32

zendar
+1  A: 

You could use the windows version of the tail command and just pype it's output to a text file with the > symbol or view it on the screen depending on what your needs are.

Jared
+9  A: 

As the others have suggested, you can go to the end of the file and read backwards, effectively. However, it's slightly tricky - particularly because if you have a variable-length encoding (such as UTF-8) you need to be cunning about making sure you get "whole" characters.

Jon Skeet
+1  A: 

If you open the file with FileMode.Append it will seek to the end of the file for you. Then you could seek back the number of bytes you want and read them. It might not be fast though regardless of what you do since that's a pretty massive file.

Steven Behnke
+1  A: 

One useful method is FileInfo.Length. It gives the size of a file in bytes.

What structure is your file? Are you sure the last 10 lines will be near the end of the file? If you have a file with 12 lines of text and 10GB of 0s, then looking at the end won't really be that fast. Then again, you might have to look through the whole file.

If you are sure that the file contains numerous short strings each on a new line, seek to the end, then check back until you've counted 11 end of lines. Then you can read forward for the next 10 lines.

biozinc
+26  A: 

Read to the end of the file, then seek backwards until you find ten newlines, and then read forward to the end taking into consideration various encodings. Be sure to handle cases where the number of lines in the file is less than ten. Below is an implementation (in C# as you tagged this), generalized to find the last numberOfTokens in the file located at path encoded in encoding where the token separator is represented by tokenSeparator; the result is returned as a string (this could be improved by returning an IEnumerable<string> that enumerates the tokens).

public static string ReadEndTokens(string path, Int64 numberOfTokens, Encoding encoding, string tokenSeparator) {

    int sizeOfChar = encoding.GetByteCount("\n");
    byte[] buffer = encoding.GetBytes(tokenSeparator);


    using (FileStream fs = new FileStream(path, FileMode.Open)) {
        Int64 tokenCount = 0;
        Int64 endPosition = fs.Length / sizeOfChar;

        for (Int64 position = sizeOfChar; position < endPosition; position += sizeOfChar) {
            fs.Seek(-position, SeekOrigin.End);
            fs.Read(buffer, 0, buffer.Length);

            if (encoding.GetString(buffer) == tokenSeparator) {
                tokenCount++;
                if (tokenCount == numberOfTokens) {
                    byte[] returnBuffer = new byte[fs.Length - fs.Position];
                    fs.Read(returnBuffer, 0, returnBuffer.Length);
                    return encoding.GetString(returnBuffer);
                }
            }
        }

        // handle case where number of tokens in file is less than numberOfTokens
        fs.Seek(0, SeekOrigin.Begin);
        buffer = new byte[fs.Length];
        fs.Read(buffer, 0, buffer.Length);
        return encoding.GetString(buffer);
    }
}
Jason
That assumes an encoding where the size of the character is always the same. It could get tricky in other encodings.
Jon Skeet
And, as Skeet informed me once, the Read method is not guaranteed to read the requested number of bytes. You have to check the return value to determine if you're done reading...
Will
@Jon: Variable-length character encoding. Oh joy.
Jason
@Will: There are several places where error checking should be added to the code. Thank you, though, for reminding me of one of the nasty facts about Stream.Read.
Jason
I've noticed this procedure is quite timely when executed on a file ~4MB. Any suggested improvements? Or other C# examples on tailing files?
GONeale
+5  A: 

I'm not sure how efficient it will be, but in Windows PowerShell getting the last ten lines of a file is as easy as

Get-Content file.txt | Select-Object -last 10
Eric Ness
This method is quite sluggish already at ~20 MB files.
Jan Wikholm
+1  A: 

I think the other posters have all shown that there is no real shortcut.

You can either use a tool such as tail (or powershell) or you can write some dumb code that seeks end of file and then looks back for n newlines.

There are plenty of implementations of tail out there on the web - take a look at the source code to see how they do it. Tail is pretty efficient (even on very very large files) and so they must have got it right when they wrote it!

Fortyrunner
A: 

Open the file and start reading lines. After you've read 10 lines open another pointer, starting at the front of the file, so the second pointer lags the first by 10 lines. Keep reading, moving the two pointers in unison, until the first reaches the end of the file. Then use the second pointer to read the result. It works with any size file including empty and shorter than the tail length. And it's easy to adjust for any length of tail. The drawback, of course, is that you end up reading the entire file and that may be exactly what you're trying to avoid.

Sisiutl
if the file is 10GB, I think its safe to say that's exactly what he's trying to avoid :-)
gbjbaanb
A: 

Why not use file.readalllines which returns a string[]?

Then you can get the last 10 lines (or members of the array) which would be a trivial task.

This approach isn't taking into account any encoding issues and I'm not sure on the exact efficiency of this approach (time taken to complete method, etc).

dotnetdev
the man asking about large file > 10 GB!!!
Ahmed Said
+3  A: 

I think the following code will solve the prblem with subtle changes regrading encoding

        StreamReader reader = new StreamReader(@"c:\test.txt",Encoding.ASCII);
        reader.BaseStream.Seek(0, SeekOrigin.End);
        int count = 0;
        while (count <= 10)
        {
            reader.BaseStream.Position--;
            int c=reader.Read();
            reader.BaseStream.Position--;
            if (c == '\n')
            {
                ++count;    
            }
        }
        string str = reader.ReadToEnd();
        string[] arr = str.Split('\n');
        reader.Close();
Ahmed Said
A: 

If you have a file that has a even format per line (such as a daq system), you just use streamreader to get the length of the file, then take one of the lines, (readline()).

Divide the total length by the length of the string. Now you have a general long number to represent the number of lines in the file.

The key is that you use the readline() prior to getting your data for your array or whatever. This is will ensure that you will start at the beginning of a new line, and not get any leftover data from the previous one.

      StreamReader leader = new StreamReader(GetReadFile);
  leader.BaseStream.Position = 0;
  StreamReader follower = new StreamReader(GetReadFile);

  int count = 0;
  string tmper = null;
  while (count <= 12)
  {
   tmper = leader.ReadLine();
   count++;
  }

  long total = follower.BaseStream.Length; // get total length of file
  long step = tmper.Length; // get length of 1 line
  long size = total / step; // divide to get number of lines
  long go = step * (size - 12); // get the bit location

  long cut = follower.BaseStream.Seek(go, SeekOrigin.Begin); // Go to that location
  follower.BaseStream.Position = go;

  string led = null;
  string[] lead = null ;
  List<string[]> samples = new List<string[]>();

  follower.ReadLine();

  while (!follower.EndOfStream)
  {
   led = follower.ReadLine();
   lead = Tokenize(led);
   samples.Add(lead);
  }
Gabe
A: 

Well duh, obviously you need a bit of code like this...

for( line = 0 to 10000000000; line++ ){
   t = file.readline( checkEachCharForSpecialEncoding = true )
   if( line > 9999999990 ) output t;
}

problem solved. take a quick holiday and wait for your answer.

needanewname