Dear everyone, I am using scrapy for scrapping I decided to write my own scheduler middleware to store some request to reduce the size of that within memory.
Here is my code:
def enqueue_request_into_scheduler(self, spider, request):
print "ENQUEUE SCHEDULER with request %s" % str(request)
scrapyengine.scheduler.enqueue_request( spider, request)
def enqueue_request(self, spider, request):
size = self.check_size(spider)
self.enqueue_request_into_buffer( spider, request)
if size >= SCHEDULER_PER_DOMAIN_UPPER_THRESHOLD:
if len( self.buffer[spider]) >= BUFFER_LIMIT:
self.flush_buffer(spider)
# Note that if there's too much then return None
return None
elif size <= SCHEDULER_PER_DOMAIN_LOWER_THRESHOLD:
# Flush buffer
for buffered_request in self.buffer[spider]:
print "CHECK RECURSION"
self.enqueue_request_into_scheduler( spider, buffered_request)
# Enqueue one file into scheduler
for filed_request in self.unfile_requests( spider):
self.enqueue_request_into_scheduler( spider, filed_request)
else: # Anyway if within the boundaries, keep going
pass
When running, my log keeps printing:
ENQUEUE SCHEDULER with request <http://product.pconline.com.cn/dc/sony/300872.html>
ENQUEUE SCHEDULER with request <http://product.pconline.com.cn/dc/sony/300872.html>
CHECK RECURSION
ENQUEUE SCHEDULER with request <http://product.pconline.com.cn/dc/sony/300872.html>
ENQUEUE SCHEDULER with request <http://product.pconline.com.cn/dc/sony/300872.html>
CHECK RECURSION
ENQUEUE SCHEDULER with request <http://product.pconline.com.cn/dc/sony/300872.html>
ENQUEUE SCHEDULER with request <http://product.pconline.com.cn/dc/sony/300872.html>
CHECK RECURSION
ENQUEUE SCHEDULER with request <http://product.pconline.com.cn/dc/sony/300872.html>
ENQUEUE SCHEDULER with request <http://product.pconline.com.cn/dc/sony/300872.html>
CHECK RECURSION
ENQUEUE SCHEDULER with request <http://product.pconline.com.cn/dc/sony/300872.html>
Any help is appreciated as to why. I tried to read the source code for scheduler's enqueue_request, it calls middleware's enqueue_request, then from then on I am completely lost in the code ( I am not personally familiar with Twisted to that extent)