ansaurus

Question

More pythonic way of skipping header lines

Answer 1

+6 A:

for line in open('data.txt'):
    if line.startswith('#'):
        continue
    # work with line

of course, if your commented lines are only at the beginning of the file, you might use some optimisations.

SilentGhost 2009-11-13 17:15:07

+1 Clear and explicit. If there are more conditions to filter out lines, you just add next check like this and it stays clear. Unlike stacking filters.

Tomek Szpakowicz 2009-12-02 07:35:30

Answer 2

+6 A:

If you want to filter out all comment lines (not just those at the start of the file):

for line in file("data.txt"):
  if not line.startswith("#"):
    # process line

If you only want to skip those at the start then see ephemient's answer using itertools.dropwhile

Wim 2009-11-13 17:15:12

Answer 3

+4 A:

You could make a generator that loops over the file that skips those lines:

fin = open("data.txt")
fileiter = (l for l in fin if not l.startswith('#'))

for line in fileiter:
   ...

ʞɔıu 2009-11-13 17:16:57

Answer 4

+5 A:

You could use a generator function

def readlines(filename):
    fin = open(filename)
    for line in fin:
        if not line.startswith("#"):
            yield line

and use it like

for line in readlines("data.txt"):
    # do things
    pass

Depending on exactly where the files come from, you may also want to strip() the lines before the startswith() check. I once had to debug a script like that months after it was written because someone put in a couple of space characters before the '#'

iWerner 2009-11-13 17:19:55

This filters all lines that start with `#`, not just those at the start ("head") of the file -- OP is not completely clear on the desired behavior.

ephemient 2009-11-13 17:23:54

Besides, you could use a generator expression: `for line in (line for line in open('data.txt') if not line.startswith('#')):`

ephemient 2009-11-13 19:07:05

See my answer for a version of this that only drops '#' lines from the beginning of the file, not from the whole file.

steveha 2009-11-14 22:32:02

Answer 5

+8 A:

from itertools import dropwhile
for line in dropwhile(lambda line: line.startswith('#'), file('data.txt')):
    pass

ephemient 2009-11-13 17:27:28

Answer 6

+1 A:

You could do something like

def drop(n, seq):
    for i, x in enumerate(seq):
        if i >= n:
            yield x

And then say

for line in drop(1, file(filename)):
    # whatever

Corey Porter 2009-11-13 17:29:41

Answer 7

+7 A:

At this stage in my arc of learning Python, I find this most Pythonic:

def iscomment(s):
   return s.startswith('#')

from itertools import dropwhile
with open(filename, 'r') as f:
    for line in dropwhile(iscomment, f):
       # do something with line

to skip all of the lines at the top of the file starting with #. To skip all lines starting with #:

from itertools import ifilterfalse
with open(filename, 'r') as f:
    for line in ifilterfalse(iscomment, f):
       # do something with line

That's almost all about readability for me; functionally there's almost no difference between:

for line in ifilterfalse(iscomment, f))

and

for line in (x for x in f if not x.startswith('#'))

Breaking out the test into its own function makes the intent of the code a little clearer; it also means that if your definition of a comment changes you have one place to change it.

Robert Rossney 2009-11-13 18:23:31

those `while`s should be `with`s, yes?

Autoplectic 2009-11-16 21:04:25

Yikes. Fixed, thanks.

Robert Rossney 2009-11-17 00:05:27

Answer 8

+1 A:

I like @iWerner's generator function idea. One small change to his code and it does what the question asked for.

def readlines(filename):
    f = open(filename)
    # discard first lines that start with '#'
    for line in f:
        if not line.lstrip().startswith("#"):
            break
    yield line

    for line in f:
        yield line

and use it like

for line in readlines("data.txt"):
    # do things
    pass

But here is a different approach. This is almost very simple. The idea is that we open the file, and get a file object, which we can use as an iterator. Then we pull the lines we don't want out of the iterator, and just return the iterator. This would be ideal if we always knew how many lines to skip. The problem here is we don't know how many lines we need to skip; we just need to pull lines and look at them. And there is no way to put a line back into the iterator, once we have pulled it.

So: open the iterator, pull lines and count how many have the leading '#' character; then use the .seek() method to rewind the file, pull the correct number again, and return the iterator.

One thing I like about this: you get the actual file object back, with all its methods; you can just use this instead of open() and it will work in all cases. I renamed the function to open_my_text() to reflect this.

def open_my_text(filename):
    f = open(filename, "rt")
    # count number of lines that start with '#'
    count = 0
    for line in f:
        if not line.lstrip().startswith("#"):
            break
        count += 1

    # rewind file, and discard lines counted above
    f.seek(0)
    for _ in range(count):
        f.readline()

    # return file object with comment lines pre-skipped
    return f

Instead of f.readline() I could have used f.next() (for Python 2.x) or next(f) (for Python 3.x) but I wanted to write it so it was portable to any Python.

EDIT: Okay, I know nobody cares and I"m not getting any upvotes for this, but I have re-written my answer one last time to make it more elegant.

You can't put a line back into an iterator. But, you can open a file twice, and get two iterators; given the way file caching works, the second iterator is almost free. If we imagine a file with a megabyte of '#' lines at the top, this version would greatly outperform the previous version that calls f.seek(0).

def open_my_text(filename):
    # open the same file twice to get two file objects
    # (We are opening the file read-only so this is safe.)
    ftemp = open(filename, "rt")
    f = open(filename, "rt")

    # use ftemp to look at lines, then discard from f
    for line in ftemp:
        if not line.lstrip().startswith("#"):
            break
        f.readline()

    # return file object with comment lines pre-skipped
    return f

This version is much better than the previous version, and it still returns a full file object with all its methods.

steveha 2009-11-13 20:44:53

Instead of count, why not use `f.tell()` in your loop to save the actual place in the file? Replace `count = 0` with `loc = 0`, `count += 1` with `loc = f.tell()`, and `f.seek(0)` with `f.seek(loc)` and remove your `for _ in range(count)` loop altogether.

Paul McGuire 2009-11-14 15:09:27

I like the suggestion, but I just tried it and it doesn't work. The `.tell()` method does not track with the iterator; my short test file got slurped in completely and `.tell()` returned the end-of-file every time I called it. If `.tell()` did track with the iterator, I'd absolutely do it your way; it's cleaner. My code is messier, but has the advantage of actually working... :-)

steveha 2009-11-14 18:41:31

Answer 9

+1 A:

As a practical matter if I knew I was dealing with reasonable sized text files (anything which will comfortably fit in memory) then I'd problem go with something like:

f = open("data.txt")
lines = [ x for x in f.readlines() if x[0] != "#" ]

... to snarf in the whole file and filter out all lines that begin with the octothorpe.

As others have pointed out one might want ignore leading whitespace occurring before the octothorpe like so:

lines = [ x for x in f.readlines() if not x.lstrip().startswith("#") ]

I like this for its brevity.

This assumes that we want to strip out all of the comment lines.

We can also "chop" the last characters (almost always newlines) off the end of each using:

lines = [ x[:-1] for x in ... ]

... assuming that we're not worried about the infamously obscure issue of a missing final newline on the last line of the file. (The only time a line from the .readlines() or related file-like object methods might NOT end in a newline is at EOF).

In reasonably recent versions of Python one can "chomp" (only newlines) off the ends of the lines using a conditional expression like so:

lines = [ x[:-1] if x[-1]=='\n' else x for x in ... ]

... which is about as complicated as I'll go with a list comprehension for legibility's sake.

If we were worried about the possibility of an overly large file (or low memory constraints) impacting our performance or stability, and we're using a version of Python that's recent enough to support generator expressions (which are more recent additions to the language than the list comprehensions I've been using here), then we could use:

for line in (x[:-1] if x[-1]=='\n' else x for x in
  f.readlines() if x.lstrip().startswith('#')):

    # do stuff with each line

... is at the limits of what I'd expect anyone else to parse in one line a year after the code's been checked in.

If the intent is only to skip "header" lines then I think the best approach would be:

f = open('data.txt')
for line in f:
    if line.lstrip().startswith('#'):
        continue

... and be done with it.

Jim Dennis 2009-11-14 11:51:55

ansaurus

tags:

views:

answers:

More pythonic way of skipping header lines

related questions