views:

77

answers:

0

Dear everyone, I am using scrapy for scrapping I decided to write my own scheduler middleware to store some request to reduce the size of that within memory.

Here is my code:

def enqueue_request_into_scheduler(self, spider, request):
    print "ENQUEUE SCHEDULER with request %s" % str(request)
    scrapyengine.scheduler.enqueue_request( spider, request)


def enqueue_request(self, spider, request):
    size = self.check_size(spider)
    self.enqueue_request_into_buffer( spider, request)

    if size >= SCHEDULER_PER_DOMAIN_UPPER_THRESHOLD:
        if len( self.buffer[spider]) >= BUFFER_LIMIT:
            self.flush_buffer(spider)
        # Note that if there's too much then return None
        return None     

    elif size <= SCHEDULER_PER_DOMAIN_LOWER_THRESHOLD:
        # Flush buffer
        for buffered_request in self.buffer[spider]:
            print "CHECK RECURSION"
            self.enqueue_request_into_scheduler( spider, buffered_request)

        # Enqueue one file into scheduler
        for filed_request in self.unfile_requests( spider):
            self.enqueue_request_into_scheduler( spider, filed_request)

    else:   # Anyway if within the boundaries, keep going
        pass

When running, my log keeps printing:

ENQUEUE SCHEDULER with request <http://product.pconline.com.cn/dc/sony/300872.html&gt;
ENQUEUE SCHEDULER with request <http://product.pconline.com.cn/dc/sony/300872.html&gt;
CHECK RECURSION
ENQUEUE SCHEDULER with request <http://product.pconline.com.cn/dc/sony/300872.html&gt;
ENQUEUE SCHEDULER with request <http://product.pconline.com.cn/dc/sony/300872.html&gt;
CHECK RECURSION
ENQUEUE SCHEDULER with request <http://product.pconline.com.cn/dc/sony/300872.html&gt;
ENQUEUE SCHEDULER with request <http://product.pconline.com.cn/dc/sony/300872.html&gt;
CHECK RECURSION
ENQUEUE SCHEDULER with request <http://product.pconline.com.cn/dc/sony/300872.html&gt;
ENQUEUE SCHEDULER with request <http://product.pconline.com.cn/dc/sony/300872.html&gt;
CHECK RECURSION
ENQUEUE SCHEDULER with request <http://product.pconline.com.cn/dc/sony/300872.html&gt;

Any help is appreciated as to why. I tried to read the source code for scheduler's enqueue_request, it calls middleware's enqueue_request, then from then on I am completely lost in the code ( I am not personally familiar with Twisted to that extent)