views:

5626

answers:

8

I'm writing a log file viewer for a web application and for that I want to paginate through the lines of the log file. The items in the file are line based with the newest item on the bottom.

So I need a tail() method that can read n lines from the bottom and supports an offset. What I came up with looks like this:

def tail(f, n, offset=0):
    """Reads a n lines from f with an offset of offset lines."""
    avg_line_length = 74
    to_read = n + offset
    while 1:
        try:
            f.seek(-(avg_line_length * to_read), 2)
        except IOError:
            # woops.  apparently file is smaller than what we want
            # to step back, go to the beginning instead
            f.seek(0)
        pos = f.tell()
        lines = f.read().splitlines()
        if len(lines) >= to_read or pos == 0:
            return lines[-to_read:offset and -offset or None]
        avg_line_length *= 1.3

Is this a reasonable approach? What is the recommended way to tail log files with offsets?

A: 

Assumes a unix-like system.

import os
def tail(f, n, offset=0):
  stdin,stdout = os.popen2("tail -n "+n+offset+" "+f)
  stdin.close()
  lines = stdout.readlines(); stdout.close()
  return lines[:,-offset]
Mark
Should be platform independent. Besides, if you read the question you will see that f is a file like object.
Armin Ronacher
the question doesn't say platform dependence is unacceptable. i fail to see why this deserves two downvotes when it provides a very unixy (may be what you're looking for... certainly was for me) way of doing exactly what the question asks.
Shabbyrobe
+1  A: 

For efficiency with very large files (common in logfile situations where you may want to use tail), you generally want to avoid reading the whole file (even if you do do it without reading the whole file into memory at once) However, you do need to somehow work out the offset in lines rather than characters. One possibility is reading backwards with seek() char by char, but this is very slow. Instead, its better to process in larger blocks.

I've a utility function I wrote a while ago to read files backwards that can be used here.

import os, itertools

def rblocks(f, blocksize=4096):
    """Read file as series of blocks from end of file to start.

    The data itself is in normal order, only the order of the blocks is reversed.
    ie. "hello world" -> ["ld","wor", "lo ", "hel"]
    Note that the file must be opened in binary mode.
    """
    if 'b' not in f.mode.lower():
        raise Exception("File must be opened using binary mode.")
    size = os.stat(f.name).st_size
    fullblocks, lastblock = divmod(size, blocksize)

    # The first(end of file) block will be short, since this leaves 
    # the rest aligned on a blocksize boundary.  This may be more 
    # efficient than having the last (first in file) block be short
    f.seek(-lastblock,2)
    yield f.read(lastblock)

    for i in range(fullblocks-1,-1, -1):
        f.seek(i * blocksize)
        yield f.read(blocksize)

def tail(f, nlines):
    buf = ''
    result = []
    for block in rblocks(f):
        buf = block + buf
        lines = buf.splitlines()

        # Return all lines except the first (since may be partial)
        if lines:
            result.extend(lines[1:]) # First line may not be complete
            if(len(result) >= nlines):
                return result[-nlines:]

            buf = lines[0]

    return ([buf]+result)[-nlines:]


f=open('file_to_tail.txt','rb')
for line in tail(f, 20):
    print line

[Edit] Added more specific version (avoids need to reverse twice)

Brian
A quick tests shows that this performs a lot worse than my version from above. Probably because of your buffering.
Armin Ronacher
I suspect it's because I'm doing multiple seeks backwards, so aren't getting as good use of the readahead buffer. However, I think it may do better when your guess at the line length isn't accurate (eg. very large lines), as it avoids having to re-read data in this case.
Brian
+13  A: 

This may be quicker than yours. Makes no assumptions about line length. Backs through the file one block at a time till it's found the right number of '\n' characters.

def tail( f, window=20 ):
    f.seek( 0, 2 )
    bytes= f.tell()
    size= window
    block= -1
    while size > 0 and bytes+block*1024  > 0:
        # If your OS is rude about small files, you need this check
        # If your OS does 'the right thing' then just f.seek( block*1024, 2 )
        # is sufficient
        if (bytes+block*1024 > 0):
            ##Seek back once more, if possible
            f.seek( block*1024, 2 )
        else:
            #Seek to the beginning
            f.seek(0, 0)
        data= f.read( 1024 )
        linesFound= data.count('\n')
        size -= linesFound
        block -= 1
    f.seek( block*1024, 2 )
    f.readline() # find a newline
    lastBlocks= list( f.readlines() )
    print lastBlocks[-window:]

I don't like tricky assumptions about line length when -- as a practical matter -- you can never know things like that.

Generally, this will locate the last 20 lines on the first or second pass through the loop. If your 74 character thing is actually accurate, you make the block size 2048 and you'll tail 20 lines almost immediately.

Also, I don't burn a lot of brain calories trying to finesse alignment with physical OS blocks. Using these high-level I/O packages, I doubt you'll see any performance consequence of trying to align on OS block boundaries. If you use lower-level I/O, then you might see a speedup.

S.Lott
Nice. At least one poster that read the question and the code in there :)
Armin Ronacher
This works really well. I just pulled it into a script to read the last 4000 lines of a log file before I parse them. It works quickly and makes sense. Thanks!
Jeff Hellman
This fails on small logfiles -- IOError: invalid argument -- f.seek( block*1024, 2 )
ohnoes
A: 

On second thought, this is probably just as fast as anything here.

def tail( f, window=20 ):
    lines= ['']*window
    count= 0
    for l in f:
        lines[count%window]= l
        count += 1
    print lines[count%window:], lines[:count%window]

It's a lot simpler. And it does seem to rip along at a good pace.

S.Lott
Because nearly everything here doesn't work with log files with more than 30 MB or so without loading the same amount of memory into the RAM ;) Your first version is a lot better, but for the test files here it performs slightly worse than mine and it doesn't work with different newline characters.
Armin Ronacher
I was wrong. Version 1 took 0.00248908996582 for 10 tails through the dictionary. Version 2 took 1.2963051796 for 10 tails through the dictionary. I'd almost vote myself down.
S.Lott
"doesn't work with different newline characters." Replace datacount('\n') with len(data.splitlines()) if it matters.
S.Lott
+2  A: 

If reading the whole file is acceptable then use a deque.

from collections import deque
deque(f, maxlen=n)

Prior to 2.6, deques didn't have a maxlen option, but it's easy enough to implement.

import itertools
def maxque(items, size):
    items = iter(items)
    q = deque(itertools.islice(items, size))
    for item in items:
     del q[0]
     q.append(item)
    return q

If it's a requirement to read the file from the end, then use a gallop (a.k.a exponential) search.

def tail(f, n):
    assert n >= 0
    pos, lines = n+1, []
    while len(lines) <= n:
     try:
      f.seek(-pos, 2)
     except IOError:
      f.seek(0)
      break
     finally:
      lines = list(f)
     pos *= 2
    return lines[-n:]
Coady
+5  A: 

The code I ended up using. I think this is the best so far:

def tail(f, n, offset=None):
    """Reads a n lines from f with an offset of offset lines.  The return
    value is a tuple in the form ``(lines, has_more)`` where `has_more` is
    an indicator that is `True` if there are more lines in the file.
    """
    avg_line_length = 74
    to_read = n + (offset or 0)

    while 1:
        try:
            f.seek(-(avg_line_length * to_read), 2)
        except IOError:
            # woops.  apparently file is smaller than what we want
            # to step back, go to the beginning instead
            f.seek(0)
        pos = f.tell()
        lines = f.read().splitlines()
        if len(lines) >= to_read or pos == 0:
            return lines[-to_read:offset and -offset or None], \
                   len(lines) > to_read or pos > 0
        avg_line_length *= 1.3
Armin Ronacher
A: 

whats the best way to read a 1 GB file that gets time series data logged in it and generate a real time graph with two of its columns (one time and other a number)? I see that you have different ways of tailign the file.

Since this isn't an answer to the question asked here, you should post it as a new question ("Ask Question" button in the top right of the page). Also more people would see it that way and try to answer.
sth
A: 

based on S.Lott's top voted answer (Sep 25 '08 at 21:43), but fixed for small files.

def tail(the_file, lines_2find=20):  
    the_file.seek(0, 2)                         #go to end of file
    bytes_in_file = the_file.tell()             
    lines_found, total_bytes_scanned = 0, 0
    while lines_2find+1 > lines_found and bytes_in_file > total_bytes_scanned: 
        byte_block = min(1024, bytes_in_file-total_bytes_scanned)
        the_file.seek(-(byte_block+total_bytes_scanned), 2)
        total_bytes_scanned += byte_block
        lines_found += the_file.read(1024).count('\n')
    the_file.seek(-total_bytes_scanned, 2)
    line_list = list(the_file.readlines())
    return line_list[-lines_2find:]

    #we read at least 21 line breaks from the bottom, block by block for speed
    #21 to ensure we don't get a half line

Hope this is useful.

Eyecue