views:

29

answers:

2

HI,

I'm have this short spider code:

class TestSpider(CrawlSpider):
    name = "test"
    allowed_domains = ["google.com", "yahoo.com"]
    start_urls = [
        "http://google.com"
    ]

    def parse2(self, response, i):
        print "page2, i: ", i
        # traceback.print_stack()


    def parse(self, response):
        for i in range(5):
            print "page1 i : ", i
            link = "http://www.google.com/search?q=" + str(i)
            yield Request(link, callback=lambda r:self.parse2(r, i))

and I would expect the output like this:

page1 i :  0
page1 i :  1
page1 i :  2
page1 i :  3
page1 i :  4

page2 i :  0
page2 i :  1
page2 i :  2
page2 i :  3
page2 i :  4

, however, the actual output is this:

page1 i :  0
page1 i :  1
page1 i :  2
page1 i :  3
page1 i :  4

page2 i :  4
page2 i :  4
page2 i :  4
page2 i :  4
page2 i :  4

so, the arguemnt I pass in callback=lambda r:self.parse2(r, i) is somehow wrong.

What's wrong with the code ?

A: 

The lambdas are accessing i which is being held in closure so they are all referencing the same value (the value of i in youre parse function when the lambdas are called). A simpler reconstruction of the phenomenon is:

>>> def do(x):
...     for i in range(x):
...         yield lambda: i
... 
>>> delayed = list(do(3))
>>> for d in delayed:
...     print d()
... 
2
2
2

You can see that the i's in the lambdas are all bound to the value of i in the function do. They will return whatever value it currently has and python will keep that scope alive as long as any of the lambdas are alive to preserve the value for it. This is what's referred to as a closure.

A simple but ugly work around is

>>> def do(x):
...     for i in range(x):
...         yield lambda i=i: i
... 
>>> delayed = list(do(3))
>>> for d in delayed:
...     print d()
... 
0
1
2

This works because, in the loop, the current value of i is bound to the paramater i of the lambda. Alternatively (and maybe a little bit clearer) lambda r, x=i: (r, x). The important part is that by making an assignment outside the body of the lambda (which is only executed later) you are binding a variable to the current value of i instead of the value that it takes at the end of the loop. This makes it so that the lambdas are not closed over i and can each have their own value.

So all you need to do is change the line

yield Request(link, callback=lambda r:self.parse2(r, i))

to

yield Request(link, callback=lambda r, i=i:self.parse2(r, i))

and you're cherry.

aaronasterling
@Aaron: <curious> What would do you do instead of `i=i` if the lambda didn't take any arguments, but merely used `i` in its body?
Manoj Govindan
@Manoj. It creates a default parameter that gets bound at the time that the lambda is created. This is one way to get an explicit binding of `i` in the body of the lambda to the value of `i` at the time that the lambda is created instead of its final value. It just removes the need for a closure all together. Your method gives each lambda it's own closure over the value of `value` in `make_f`.
aaronasterling
@Aaron: Understood that. I was wondering about the case if it were `lambda: foo(i)` to start with.
Manoj Govindan
@Manoj, sorry, misread your question. Not at my best. This is the same as my first example. It would be `lambda i=i: foo(i)`. Just stick the parameter on and, because it's a default, the caller doesn't need to know or care about it.
aaronasterling
@Aaron: No problemo. The default parameter makes sense. Thanks.
Manoj Govindan
A: 

lambda r:self.parse2(r, i) binds the variable name i, not the value of i. Later when the lambda is evaluated the current value of i in the closure i.e. the last value of i is used. This can be easily demonstrated.

>>> def make_funcs():
    funcs = []
    for x in range(5):
        funcs.append(lambda: x)
    return funcs

>>> f = make_funcs()
>>> f[0]()
4
>>> f[1]()
4
>>> 

Here make_funcs is a function that returns a list of functions, each bound to x. You'd expect the functions when called to print values 0 to 4 respectively. And yet they all return 4 instead.

All is not lost however. There is a solution(s?).

>>> def make_f(value):
    def _func():
        return value
    return _func

>>> def make_funcs():
    funcs = []
    for x in range(5):
        funcs.append(make_f(x))
    return funcs

>>> f = make_funcs()
>>> f[0]()
0
>>> f[1]()
1
>>> f[4]()
4
>>> 

I am using an explicit, named function here instead of lambda. In this case the variable's value gets bound rather than the name. Consequently the individual functions behave as expected.

I see that @Aaron has given you an answer for changing your lambda. Stick with that and you'll be good to go :)

Manoj Govindan