ansaurus

Question

How check if a task is already in python Queue?

Answer 1

+1 A:

The way I solved this (actually I did this in Scala, not Python) was to use both a Set and a Queue, only adding links to the queue (and set) if they did not already exist in the set.

Both the set and queue were encapsulated in a single thread, exposing only a queue-like interface to the consumer threads.

Edit: someone else suggested SQLite and that is also something I am considering, if the set of visited URLs needs to grow large. (Currently each crawl is only a few hundred pages so it easily fits in memory.) But the database is something that can also be encapsulated within the set itself, so the consumer threads need not be aware of it.

Ben James 2009-10-17 10:26:06

Answer 2

+2 A:

SQLite is so simple to use and would fit perfectly... just a suggestion.

jldupont 2009-10-17 10:27:08

With the added advantage of giving you persistence if you choose to use an on disk database. If you hit an unhandled exception you can fix the error and continue where you left off

gnibbler 2009-10-17 11:24:48

Answer 3

A:

Also, instead of a set you might try using a dictionary. Operations on sets tend to get rather slow when they're big, whereas a dictionary lookup is nice and quick.

My 2c.

sam 2009-10-17 10:32:23

This is incorrect, the `set` type is a hash table just like the `dict` type.

Lukáš Lalinský 2009-10-17 10:48:45

Answer 4

+1 A:

use:

url in q.queue

which returns True iff url is in the queue

Guy 2009-10-17 10:34:52

Which doesn't help if it's be dequeued and processed already.

S.Lott 2009-10-17 12:34:07

Answer 5

+1 A:

Why only use the array (ideally, a dictionary would be even better) to filter things you've already visited? Add things to your array/dictionary as soon as you queue them up, and only add them to the queue if they're not already in the array/dict. Then you have 3 simple separate things:

Links not yet seen (neither in queue nor array/dict)
Links scheduled to be visited (in both queue and array/dict)
Links already visited (in array/dict, not in queue)

Amber 2009-10-17 10:36:59

It is important to keep the list of all previously-queued entries (I'd use a set, not a list, not sure what @sam's problem is with set). If you just search the queue for duplicates, you may reprocess an entry that was previously queued and *already* processed, thus removed from the queue.

Paul McGuire 2009-10-17 11:23:43

Yes, my answer assumed a second data structure in addition to the queue (hence things like 'in both queue and array/dict' and 'in array/dict, not in queue'). You add items to the 'seen' data structure before you queue them. You don't search the queue, you search your 'seen' array. By definition anything in the 'seen' array is either in the queue or already visited; neither of those cases need to be queued again. The main trick is making sure that the check-'seen'-and-queue-if-not-found is atomic.

Amber 2009-10-17 12:15:43

Answer 6

+2 A:

If you don't care about the order in which items are processed, I'd try a subclass of Queue that uses set internally:

class SetQueue(Queue):

    def _init(self, maxsize):
        self.maxsize = maxsize
        self.queue = set()

    def _put(self, item):
        self.queue.add(item)

    def _get(self):
        return self.queue.pop()

As Paul McGuire pointed out, this would allow adding a duplicate item after it's been removed from the "to-be-processed" set and not yet added to the "processed" set. To solve this, you can store both sets in the Queue instance, but since you are using the larger set for checking if the item has been processed, you can just as well go back to queue which will order requests properly.

class SetQueue(Queue):

    def _init(self, maxsize):
        Queue._init(self, maxsize) 
        self.all_items = set()

    def _put(self, item):
        if item not in self.all_items:
            Queue._put(self, item) 
            self.all_items.add(item)

The advantage of this, as opposed to using a set separately, is that the Queue's methods are thread-safe, so that you don't need additional locking for checking the other set.

Lukáš Lalinský 2009-10-17 10:46:37

This runs the risk of reprocessing an entry after it has been popped.

Paul McGuire 2009-10-17 11:24:36

Sure, you could store also the set of all items in the "queue" and modify `_put` to first check that set. It's protected by Queue's locking, so there are no race conditions.

Lukáš Lalinský 2009-10-17 11:59:23

This is so elegant. Very nice, even with the drawback of the first version.

e-satis 2009-10-17 15:23:18

Answer 7

A:

instead of "array of pages already visited" make an "array of pages already added to the queue"

nosklo 2009-10-17 15:13:21

ansaurus

tags:

views:

answers:

How check if a task is already in python Queue?

related questions