views:

426

answers:

6

I need to loop until I hit the end of a file-like object, but I'm not finding an "obvious way to do it", which makes me suspect I'm overlooking something, well, obvious. :-)

I have a stream (in this case, it's a StringIO object, but I'm curious about the general case as well) which stores an unknown number of records in "<length><data>" format, e.g.:

data = StringIO("\x07\x00\x00\x00foobar\x00\x04\x00\x00\x00baz\x00")

Now, the only clear way I can imagine to read this is using (what I think of as) an initialized loop, which seems a little un-Pythonic:

len_name = data.read(4)

while len_name != "":
    len_name = struct.unpack("<I", len_name)[0]
    names.append(data.read(len_name))

    len_name = data.read(4)

In a C-like language, I'd just stick the read(4) in the while's test clause, but of course that won't work for Python. Any thoughts on a better way to accomplish this?

+6  A: 

Have you seen how to iterate over lines in a text file?

for line in file_obj:
  use(line)

You can do the same thing with your own generator:

def read_blocks(file_obj, size):
  while True:
    data = file_obj.read(size)
    if not data:
      break
    yield data

for block in read_blocks(file_obj, 4):
  use(block)

See also:

Roger Pate
You can also structure your loop as the while loop in the generator. Use whatever is most readable.
Roger Pate
+2  A: 

I see, as predicted, that the typical and most popular answer are using very specialized generators to "read 4 bytes at a time". Sometimes generality isn't any harder (and much more rewarding;-), so, I've suggested instead the following very general solution:

import operator
def funlooper(afun, *a, **k):
  wearedone = k.pop('wearedone', operator.not_)
  while True:
    data = afun(*a, **k)
    if wearedone(data): break
    yield data

Now your desired loop header is just: for len_name in funlooper(data.read, 4):.

Edit: made much more general by the wearedone idiom since a comment accused my slightly less general previous version (hardcoding the exit test as if not data:) of having "a hidden dependency", of all things!-)

The usual swiss army knife of looping, itertools, is fine too, of course, as usual:

import itertools as it

for len_name in it.takewhile(bool, it.imap(data.read, it.repeat(4))): ...

or, quite equivalently:

import itertools as it

def loop(pred, fun, *args):
  return it.takewhile(pred, it.starmap(fun, it.repeat(args)))

for len_name in loop(bool, data.read, 4): ...
Alex Martelli
Though there's a hidden dependency, as funlooper requires the function to return a non-true result to indicate the end.
Roger Pate
@R.Pate, you can of course trivially add to funlooper a `wearedone` predicate argument defaulting to `operator.not_` and change the `if` to `if wearedone(data): break` -- I just didn't think it worth it further generalizing the answer with this trivial code when I was sure (and correct in that;-) that the other answers would be WAY excessivel specialized (to no benefit). Ah well, since the excessive specialization is winning the day anyway, let me edit the answer to show that generality isn't any harder (and much more rewarding) in this case;-).
Alex Martelli
I think you misinterpreted me in the wrong direction: IMHO the original funlooper was *too* general. Since we're already depending on the return value having a specific form, it's reasonable here to depend on this part of the file-like interface (the read method), instead of trying to pass a generic callable. Failing that, the user must at least be aware of the dependency.
Roger Pate
"Boolean false" is not "a specific form" -- it's a very general one that many, many kinds of Python objects can satisfy. One of your two answers to this question (!), the non-accepted one that currently as more upvotes, has exactly the same length and general structure as my original one (per normal SO etiquette, I **don't** post many answers to one question!), so nothing is gained from its extreme specialization. (The other one of you answer, the accepted one with iter and sentinel, is more concise -- not quite as general as mine w/itertools, but simpler).
Alex Martelli
If you didn't agree that it's depending on a specific form of return value, why did you change the answer? What other way did you see a dependency? (I liked your answer better originally.)
Roger Pate
I changed the answer to make it, essentially, "universal", rather than _just_ "very general". If the clear, obvious choice of "true or false" is to be criticized as "a hidden dependency" (one of the most horrible, killing defects a software component could possibly have!), then clearly it's necessary to rub in the critics' face the obvious technical superiority of this solution -- by making obvious the (previously implied) equivalence to (e.g.) `filter` (which takes a `None` to mean "true or false, as obvious", or otherwise a predicate).
Alex Martelli
That comparison with filter is pretty much exactly my point: filter has an explict value that indicates this behavior, instead of it being implied and, thus, hidden. Thanks for clarifying why you changed it, we'll just have to agree to disagree.
Roger Pate
Alex Martelli
+1  A: 

The EOF marker in python is an empty string so what you have is pretty close to the best you are going to get without writing a function to wrap this up in an iterator. I could be written in a little more pythonic way by changing the while like:

while len_name:
    len_name = struct.unpack("<I", len_name)[0]
    names.append(data.read(len_name))
    len_name = data.read(4)
Tendayi Mawushe
This requires duplicating the assignment to len_name before the loop (which you left out), and it's almost always desired to avoid this duplication.
Roger Pate
+4  A: 

I prefer the already mentioned iterator-based solution to turn this into a for-loop. Another solution written directly is Knuth's "loop-and-a-half"

while 1:
    len_name = data.read(4)
    if not len_name:
        break
    names.append(data.read(len_name))

You can see by comparison how that's easily hoisted into its own generator and used as a for-loop.

Andrew Dalke
In this particular case, I think I like the `iter()` solution better, but I feel quite foolish for not having thought of this. A well deserved +1 for you. ;-)
Ben Blank
Wow. Yeah, that iter() solution is nice. Combined with a "lambda :" and depending on closures makes it a bit harder to understand, but sweet none-the-less.
Andrew Dalke
+7  A: 

You can combine iteration through iter() with a sentinel:

for block in iter(lambda: file_obj.read(4), ""):
  use(block)
Roger Pate
Definitly the best anwser. You got me on this one, I forget this so usefull sentinel.
e-satis
I think I like this one best as well; what it's doing is very clear because there's so little code. Thanks for the help!
Ben Blank
A: 

I'd go with Tendayi's suggestion re function and iterator for readability:

def read4():
    len_name = data.read(4)
    if len_name:
        len_name = struct.unpack("<I", len_name)[0]
        return data.read(len_name)
    else:
        raise StopIteration

for d in iter(read4, ''):
    names.append(d)
John Keyes
No reason, just something I put together quickly. I've modified the snippet.
John Keyes