tags:

views:

369

answers:

5

I´m reading a file in Python where each record is separated by an empty new line. If the file ends in two or more new lines, the last record is processed as expected, but if the file ends in a single new line it´s not processed. Here´s the code:

def fread():
    record = False
    for line in open('somefile.txt'):
        if line.startswith('Record'):
            record = True
            d = SomeObject()

        # do some processing with line
        d.process(line)

        if not line.strip() and record:
            yield d
            record = False

for record in fread():
    print(record)

In this data sample, everything works as expected ('---' is an empty line):

Record 1
data a
data b
data c
\n
Record 2
data a
data b
data c
\n
\n

But in this, the last record isn´t returned:

Record 1
data a
data b
data c
\n
Record 2
data a
data b
data c
\n

How can I preserve the last new line from the file to get the last record?

PS.: I´m using the term "preserve" as I couldn´t find a better name.

Thanks.

Edit The original code was a stripped version, just to illustrate the problem, but it seems that I stripped too much. Now I posted all function´s code.

A little more explanation: The object SomeObject is created for each record in the file and the records are separated by empty new lines. At the end of the record it yields back the object so I can use it (save to a db, compare to another objects, etc).

The main problem when the file ends in a single new line, the last record isn´t yielded. It seems that Python does not read the last line when it´s blank.

A: 

line.strip() will result in an empty string on an empty line. An empty string is False, so you swallow the empty line

>>> bool("\n".strip())
False
>>> bool("\n")
True
f3lix
A: 

If you call readline repeatedly (in a loop) on your file object (instead of using in) it should work as you expect. Compare these:

>>> x = open('/tmp/xyz')
>>> x.readline()
'x\n'
>>> x.readline()
'\n'
>>> x.readline()
'y\n'
>>> x.readline()
''
>>> open('/tmp/xyz').readlines()
['x\n', '\n', 'y\n']
Jacob Gabrielson
+4  A: 

You might find a slight twist in a more classically pythonic direction improves the predicability of the code:

def fread():
    for line in open('text.txt'):
        if line.strip():
            d = SomeObject()
            yield d

    raise StopIteration

for record in fread():
    print record

The preferred way to end a generator in Python, though often not strictly necessary, is with the StopIteration exception. Using if line.strip() simply means that you'll do the yield if there's anything remaining in line after stripping whitespace. The construction of SomeObject() can be anywhere... I just happened to move it in case construction of SomeObject was expensive, or had side-effects that shouldn't happen if the line is empty.

EDIT: I'll leave my answer here for posterity's sake, but DNS below got the original intent right, where several lines contribute to the same SomeObject() record (which I totally glossed over).

Jarret Hardie
Your code does match the author's code, but from the wording of his question, and his sample data, it doesn't look like he wants to create a new SomeObject on every non-empty line. He wants a block of lines to contribute to one SomeObject.
DNS
Ah yes.. you are right.
Jarret Hardie
Edited my post to refer to yours, DNS.
Jarret Hardie
+1 for pointing out StopIteration as the best way to exit a generator.
Luiz Damim
+5  A: 

The way it's written now probably doesn't work anyway; with d = SomeObject() inside your loop, a new SomeObject is being created for every line. Yet, if I understand correctly, what you want is for all of the lines in between empty lines to contribute to that one object. You could do something like this instead:

def fread():
    d = None
    for line in open('somefile.txt'):

        if d is None:
            d = SomeObject()

        if line.strip():
            # do some processing
        else:
            yield d
            d = None

    if d: yield d

This isn't great code, but it does work; that last object that misses its empty line is yielded when the loop is done.

DNS
You understood correctly and I edited my post to be more clear of what I want. Your approach is very good and solves my problem, thank you very much, but, still, why doesn´t Python read the last line when it´s empty?
Luiz Damim
A: 

replace open('somefile.txt'): with open('somefile.txt').read().split('\n'): and your code will work.

But Jarret Hardie's answer is better.

tgray
I can´t read all the file in memory, it´s +100k lines.
Luiz Damim