ansaurus

Question

Answer 1

+2 A:

I think that you are confused with what "for x in file" does. Assuming you got your handle like "file = open(file_name)", byte in this case will be an entire line, not a single character. So you are only calling yield when the entire line consists of a single carriage return. Try changing "byte" to "line" and iterating over that with a second loop.

danben 2009-12-17 22:31:32

Answer 2

A:

Edit:

string1 += string2 string concatenation is slow. Try joining a list of strings.
ddaa is right--You shouldn't need the struct package if the binary file only contains ASCII. Also, my generator returns the string after the final '\r', before EOF. With these two minor fixes, my code is suspiciously similar (practically identical) to this more recent answer.

Code snip:

def LineFeed(f):
    ret = []
    while True:
        oneByte = f.read(1)
        if not oneByte: break
        # Return everything up to, but not including the carriage return
        if oneByte == '\r':
            yield ''.join(ret)
            ret = []
        else:
            ret.append(oneByte)
    if oneByte:
        yield ''.join(ret)
if __name__ == '__main__':
    lf = LineFeed( open('filename','rb') )

    for something in lf:
        doSomething(something)

Pete 2009-12-17 22:35:35

There is no point is using struct here. Also, the function has a bug: it discards any text after the final \r.

ddaa 2009-12-17 23:32:03

@ddaa: Good points; fixed!

Pete 2009-12-18 00:19:02

Answer 3

A:

So, your problem is iterating over the lines of a file open in binary mode that use '\r' as a line separator. Since the file is in binary mode, you cannot use the universal newline feature, and it turns out that '\r' is not interpreted as a line separator in binary mode.

Reading a file char by char is a terribly inefficient thing to do in Python, but here's how you could iterate over your lines:

def cr_lines(the_file):
    line = []
    while True:
        byte = the_file.read(1)
        if not byte:
            break
        line.append(byte)
        if byte == '\r':
            yield ''.join(line)
            line = []
    if line:
        yield ''.join(line)

To be more efficient, you would need to read bigger chunks of text and handle buffering in your iterator. Keeping in mind that you could get strange bugs if seeking while iterating. Preventing those bugs would require a subclass of file so you can purge the buffer on seek.

Note the use of the ''.join(line) idiom. Accumulating a string with += has terrible performance and is common mistake made by beginning programmers.

ddaa 2009-12-17 23:06:37

Why do you say you can't use the universal newline? Universal newline is incompatible with writing, but open('cr_terminated.bin', 'Urb') works fine.

Jeffrey Harris 2009-12-18 02:18:52

Because the OP states he needs to open the file in binary mode and the documentation says: "supplying 'U' opens the file as a text file". http://docs.python.org/library/functions.html#open

ddaa 2009-12-18 08:15:27

Answer 4

+2 A:

Perhaps if you were to explain what this file represents, why it has lots of '\x00', why you think you need to read it in binary mode, we could help you with your underlying problem.

Otherwise, try the following code; it avoids any dependence on (or interference from) your operating system's line-ending convention.

lines = open("the_file", "rb").read().split("\r")
for line in lines:
    process(line)

Edit: the ASCII NUL (not "NULL") byte is "\x00".

John Machin 2009-12-17 23:14:46

+1 A sane answer to an absurd question.

hughdbrown 2009-12-18 00:39:28

Answer 5

+1 A:

If you're in control of how you open the file, I'd recommend opening it with universal newlines, since \r isn't recognized as a linefeed character if you just use 'rb' mode, but it is if you use 'Urb'.

This will only work if you aren't including \n as well as \r in your binary file somewhere, since the distinction between \r and \n is lost when using universal newlines.

Assuming you want your yielded lines to still be \r terminated:

NUL = '\x00'
def lines_without_nulls(path):
    with open(path, 'Urb') as f:
        for line in f:
            yield line.replace(NUL, '').replace('\n', '\r')

Jeffrey Harris 2009-12-18 02:36:13

Note that its been so long since I used Python on windows that I forgot what binary means. Generally speaking it means preserving \r and \n, which is only really necessary on Windows. So universal + binary is kind of a contradiction in terms (once you're using universal you aren't using binary, you're munging newlines). I have no idea why anyone ever would *have* to use binary, but if you really do, my solution doesn't make sense, since I believe the 'Urb' mode is equivalent to 'Ub' mode.

Jeffrey Harris 2010-01-12 21:26:15

ansaurus

tags:

views:

answers:

Specifying chars in python

related questions