views:

878

answers:

4

In Python the interface of an iterable is a subset of the iterator interface. This has the advantage that in many cases they can be treated in the same way. However, there is an important semantic difference between the two, since for an iterable __iter__ returns a new iterator object and not just self. How can I test that an iterable is really an iterable and not an iterator? Conceptually I understand iterables to be collections, while an iterator only manages the iteration (i.e. keeps track of the position) but is not a collection itself.

The difference is for example important when one wants to loop multiple times. If an iterator is given then the second loop will not work since the iterator was already used up and directly raises StopIteration.

It is tempting to test for a next method, but this seems dangerous and somehow wrong. Should I just check that the second loop was empty?

Is there any way to do such a test in a more pythonic way? I know that this sound like a classic case of LBYL against EAFP, so maybe I should just give up? Or am I missing something?

Edit: S.Lott says in his answer below that this is primarily a problem of wanting to do multiple passes over the iterator, and that one should not do this in the first place. However, in my case the data is very large and depending on the situation has to be passed over multiple times for data processing (there is absolutely no way around this).

The iterable is also provided by the user, and for situations where a single pass is enough it will work with an iterator (e.g. created by a generator for simplicity). But it would be nice to safeguard against the case were a user provides only an iterator when multiple passes are needed.

Edit 2: Actually this is a very nice Example for Abstract Base Classes. The __iter__ methods in an iterator and an iterable have the same name but are sematically different! So hasattr is useless, but isinstance provides a clean solution.

+7  A: 

" However, there is an important semantic difference between the two..."

Not really semantic or important. They're both iterable -- they both work with a for statement.

"The difference is for example important when one wants to loop multiple times."

When does this ever come up? You'll have to be more specific. In the rare cases when you need to make two passes through an iterable collection, there are often better algorithms.

For example, let's say you're processing a list. You can iterate through a list all you want. Why did you get tangled up with an iterator instead of the iterable? Okay that didn't work.

Okay, here's one. You're reading a file in two passes, and you need to know how to reset the iterable. In this case, it's a file, and seek is required; or a close and a reopen. That feels icky. You can readlines to get a list which allows two passes with no complexity. So that's not necessary.

Wait, what if we have a file so big we can't read it all into memory? And, for obscure reasons, we can't seek, either. What then?

Now, we're down to the nitty-gritty of two passes. On the first pass, we accumulated something. An index or a summary or something. An index has all the file's data. A summary, often, is a restructuring of the data. With a small change from "summary" to "restructure", we've preserved the file's data in the new structure. In both cases, we don't need the file -- we can use the index or the summary.

All "two-pass" algorithms can be changed to one pass of the original iterator or iterable and a second pass of a different data structure.

This is neither LYBL or EAFP. This is algorithm design. You don't need to reset an iterator -- YAGNI.


Edit

Here's an example of an iterator/iterable issue. It's simply a poorly-designed algorithm.

it = iter(xrange(3))
for i in it: print i,; #prints 1,2,3 
for i in it: print i,; #prints nothing

This is trivially fixed.

it = range(3)
for i in it: print i
for i in it: print i

The "multiple times in parallel" is trivially fixed. Write an API that requires an iterable. And when someone refuses to read the API documentation or refuses to follow it after having read it, their stuff breaks. As it should.

The "nice to safeguard against the case were a user provides only an iterator when multiple passes are needed" are both examples of insane people writing code that breaks our simple API.

If someone is insane enough to read most (but not all of the API doc) and provide an iterator when an iterable was required, you need to find this person and teach them (1) how to read all the API documentation and (2) follow the API documentation.

The "safeguard" issue isn't very realistic. These crazy programmers are remarkably rare. And in the few cases when it does arise, you know who they are and can help them.


Edit 2

The "we have to read the same structure multiple times" algorithms are a fundamental problem.

Do not do this.

for element in someBigIterable:
    function1( element )
for element in someBigIterable:
    function2( element )
...

Do this, instead.

for element in someBigIterable:
    function1( element )
    function2( element )
    ...

Or, consider something like this.

for element in someBigIterable:
    for f in ( function1, function2, function3, ... ):
        f( element )

In most cases, this kind of "pivot" of your algorithms results in a program that might be easier to optimize and might be a net improvement in performance.

S.Lott
What about multiple times in parallel? E.g. several threads iterating over the same collection? Or even one thread, such as an easily-imagined naive implementation of "does this collection have the same element twice?".
Edmund
Thanks, I added an explanation to the question. You have a valid point, but in my case I belief this does not work.
nikow
"remarkably rare". I'd disagree, programmers that can't tell iterable from iterator are not by any means rare."you know who they are and can help them." That's usually not your job, and in corporation "helping them" would not be very well perceived, especially if it's another department.
vartec
@vartec: it's your application/library/framework, you need to support it. Helping the crazy programmers who refuse to read the API and can't figure out why it broke when they didn't follow the rules is support as I understand it. It *is* well perceived in my experience.
S.Lott
@S.Lott: In a perfect world you're right. In corporative politics saying, that the other department's code isn't correct results in conflict. If the other dept. has more political influence, your help will be perceived as "trying to cover incompetence". And it doesn't mater if your right or wrong.
vartec
+1 for your edit. I was teached, this is programming by contract. If one party doesn't comply, the other party doesn't need to comply, too.
unbeknown
@S.Lott: In this context, may I call upon your attention to this question: http://stackoverflow.com/questions/701088/py3k-memory-conservation-by-returning-iterators-rather-than-lists Thanks!
Lakshman Prasad
@becomingGuru: preoccupation with memory management can become silly. My point is that many "2-pass" algorithms do significant data reduction on the first pass; the second pass is not necessary because it's working on a smaller data structure.
S.Lott
@vartec: if your organization's corporate politics are so dysfunctional that help == conflict, you should find a better organization to work for. Writing useless code to work around organizational problems is an epic fail waiting to happen.
S.Lott
@heikogerlach: more importantly, if one party won't comply, the other party can't coerce compliance. If they won't comply, they wrote the bug; you can't fix their refusal to comply.
S.Lott
@S.Lott: I already did. But as far as I know it's pretty much the same in most big corporations. Big bureaucracies are always inefficient.
vartec
@vartec: Over the last 30 years, I've never worked at a place where helping someone become conflict or was perceived badly. An API contract has never been a problem in 100's of locations. Convoluted code to help the crazy programmers who can't follow the API is -- simply -- bad.
S.Lott
+1 for addressing the question of 2 passes over a large data at such length. I agree - if it seems like you need to iterate over the exact same, unchanged, entire data twice, there's a design issue that needs to be addressed.
Jarret Hardie
+10  A: 
'iterator' if obj is iter(obj) else 'iterable'
vartec
Wow, this seems to be the answer that I have been looking for, thanks! I will wait a little before accepting it, in case somebody can point out a problem with this.
nikow
Well, the problem is one "wasted" call to obj.__iter__(), but I don't see other reliable way to do it.
vartec
Although I don't know a counter-example, this is not *guaranteed* to work.
ΤΖΩΤΖΙΟΥ
@ΤΖΩΤΖΙΟΥ: well you could imagine objects, that doesn't have .next(), but has __iter__(self) = lambda x: x
vartec
@ΤΖΩΤΖΙΟΥ: but then again, what would be point of such object?
vartec
I never said anything about objects not having .next. Your premise is `iter(obj) is obj`, which AFAIK is true, but it's not guaranteed.
ΤΖΩΤΖΙΟΥ
A: 

Because of Python's duck typing,

Any object is iterable if it defines the next() and __iter__() method returns itself.

If the object itself doesnt have the next() method, the __iter__() can return any object, that has a next() method

You could refer this question to see Iterability in Python

Lakshman Prasad
Try this: class A(object): def __iter__(self): return iter([1,2,3]) def next(self): yield 7
vartec
Actually this is a problem of duck typing: it can hide a semantic / conceptual difference. It allows us to write for i in range(3) instead of for i in iter(range(3)), but can cause subtle problems.
nikow
Sorry, I did not exactly get the point? Something wrong?
Lakshman Prasad
+1  A: 
import itertools

def process(iterable):
    work_iter, backup_iter= itertools.tee(iterable)

    for item in work_iter:
        # bla bla
        if need_to_startover():
            for another_item in backup_iter:

That damn time machine that Raymond borrowed from Guido…

ΤΖΩΤΖΙΟΥ