views:

137

answers:

5

Hey guys,

I need a functions that iterates over all the lines in the file.
Here's what I have so far:

def LineFeed(file):
    ret = ""
    for byte in file:
     ret = ret + str(byte)
     if str(byte) == '\r':
      yield ret
      ret = ""

All the lines in the file end with \r (not \n), and I'm reading it in "rb" mode, (I have to read this file in binary). The yield doesn't work and returns nothing. Maybe there's a problem with the comparison? I'm just not sure how you represent a byte/char in python.

I'm getting the idea that if you for loop on a "rb" file it still tries to iterate over lines not bytes..., How can I iterate over bytes? My problem is that I don't have standard line endings. Also my file is filled with 0x00 bytes and I would like to get rid of them all, so I think I would need a second yeild fuction, how could I implement that, I just don't know how to represent the 0x00 byte in python or the NULL char.

Help please.

+2  A: 

I think that you are confused with what "for x in file" does. Assuming you got your handle like "file = open(file_name)", byte in this case will be an entire line, not a single character. So you are only calling yield when the entire line consists of a single carriage return. Try changing "byte" to "line" and iterating over that with a second loop.

danben
A: 

Edit:

  • string1 += string2 string concatenation is slow. Try joining a list of strings.

  • ddaa is right--You shouldn't need the struct package if the binary file only contains ASCII. Also, my generator returns the string after the final '\r', before EOF. With these two minor fixes, my code is suspiciously similar (practically identical) to this more recent answer.

Code snip:

def LineFeed(f):
    ret = []
    while True:
        oneByte = f.read(1)
        if not oneByte: break
        # Return everything up to, but not including the carriage return
        if oneByte == '\r':
            yield ''.join(ret)
            ret = []
        else:
            ret.append(oneByte)
    if oneByte:
        yield ''.join(ret)
if __name__ == '__main__':
    lf = LineFeed( open('filename','rb') )

    for something in lf:
        doSomething(something)
Pete
There is no point is using struct here. Also, the function has a bug: it discards any text after the final \r.
ddaa
@ddaa: Good points; fixed!
Pete
A: 

So, your problem is iterating over the lines of a file open in binary mode that use '\r' as a line separator. Since the file is in binary mode, you cannot use the universal newline feature, and it turns out that '\r' is not interpreted as a line separator in binary mode.

Reading a file char by char is a terribly inefficient thing to do in Python, but here's how you could iterate over your lines:

def cr_lines(the_file):
    line = []
    while True:
        byte = the_file.read(1)
        if not byte:
            break
        line.append(byte)
        if byte == '\r':
            yield ''.join(line)
            line = []
    if line:
        yield ''.join(line)

To be more efficient, you would need to read bigger chunks of text and handle buffering in your iterator. Keeping in mind that you could get strange bugs if seeking while iterating. Preventing those bugs would require a subclass of file so you can purge the buffer on seek.

Note the use of the ''.join(line) idiom. Accumulating a string with += has terrible performance and is common mistake made by beginning programmers.

ddaa
Why do you say you can't use the universal newline? Universal newline is incompatible with writing, but open('cr_terminated.bin', 'Urb') works fine.
Jeffrey Harris
Because the OP states he needs to open the file in binary mode and the documentation says: "supplying 'U' opens the file as a text file". http://docs.python.org/library/functions.html#open
ddaa
+2  A: 

Perhaps if you were to explain what this file represents, why it has lots of '\x00', why you think you need to read it in binary mode, we could help you with your underlying problem.

Otherwise, try the following code; it avoids any dependence on (or interference from) your operating system's line-ending convention.

lines = open("the_file", "rb").read().split("\r")
for line in lines:
    process(line)

Edit: the ASCII NUL (not "NULL") byte is "\x00".

John Machin
+1 A sane answer to an absurd question.
hughdbrown
+1  A: 

If you're in control of how you open the file, I'd recommend opening it with universal newlines, since \r isn't recognized as a linefeed character if you just use 'rb' mode, but it is if you use 'Urb'.

This will only work if you aren't including \n as well as \r in your binary file somewhere, since the distinction between \r and \n is lost when using universal newlines.

Assuming you want your yielded lines to still be \r terminated:

NUL = '\x00'
def lines_without_nulls(path):
    with open(path, 'Urb') as f:
        for line in f:
            yield line.replace(NUL, '').replace('\n', '\r')
Jeffrey Harris
Note that its been so long since I used Python on windows that I forgot what binary means. Generally speaking it means preserving \r and \n, which is only really necessary on Windows. So universal + binary is kind of a contradiction in terms (once you're using universal you aren't using binary, you're munging newlines). I have no idea why anyone ever would *have* to use binary, but if you really do, my solution doesn't make sense, since I believe the 'Urb' mode is equivalent to 'Ub' mode.
Jeffrey Harris