Hello. I'm writing a simple crawler in Python using the threading and Queue modules. I fetch a page, check links and put them into a queue, when a certain thread has finished processing page, it grabs the next one from the queue. I'm using an array for the pages I've already visited to filter the links I add to the queue, but if there are more than one threads and they get the same links on different pages, they put duplicate links to the queue. So how can I find out whether some url is already in the queue to avoid putting it there again?
The way I solved this (actually I did this in Scala, not Python) was to use both a Set and a Queue, only adding links to the queue (and set) if they did not already exist in the set.
Both the set and queue were encapsulated in a single thread, exposing only a queue-like interface to the consumer threads.
Edit: someone else suggested SQLite and that is also something I am considering, if the set of visited URLs needs to grow large. (Currently each crawl is only a few hundred pages so it easily fits in memory.) But the database is something that can also be encapsulated within the set itself, so the consumer threads need not be aware of it.
SQLite is so simple to use and would fit perfectly... just a suggestion.
Also, instead of a set you might try using a dictionary. Operations on sets tend to get rather slow when they're big, whereas a dictionary lookup is nice and quick.
My 2c.
Why only use the array (ideally, a dictionary would be even better) to filter things you've already visited? Add things to your array/dictionary as soon as you queue them up, and only add them to the queue if they're not already in the array/dict. Then you have 3 simple separate things:
- Links not yet seen (neither in queue nor array/dict)
- Links scheduled to be visited (in both queue and array/dict)
- Links already visited (in array/dict, not in queue)
If you don't care about the order in which items are processed, I'd try a subclass of Queue
that uses set
internally:
class SetQueue(Queue):
def _init(self, maxsize):
self.maxsize = maxsize
self.queue = set()
def _put(self, item):
self.queue.add(item)
def _get(self):
return self.queue.pop()
As Paul McGuire pointed out, this would allow adding a duplicate item after it's been removed from the "to-be-processed" set and not yet added to the "processed" set. To solve this, you can store both sets in the Queue
instance, but since you are using the larger set for checking if the item has been processed, you can just as well go back to queue
which will order requests properly.
class SetQueue(Queue):
def _init(self, maxsize):
Queue._init(self, maxsize)
self.all_items = set()
def _put(self, item):
if item not in self.all_items:
Queue._put(self, item)
self.all_items.add(item)
The advantage of this, as opposed to using a set separately, is that the Queue
's methods are thread-safe, so that you don't need additional locking for checking the other set.
instead of "array of pages already visited" make an "array of pages already added to the queue"