tags:

views:

307

answers:

5

I'm working on a simple web crawler in python and I wan't to make a simple queue class, but I'm not quite sure the best way to start. I want something that holds only unique items to process, so that the crawler will only crawl each page once per script run (simply to avoid infinite looping). Can anyone give me or point me to a simple queue example that I could run off of?

+3  A: 

I'd just use a set, it doesn't maintain order but it will help you maintain uniqueness:

>>> q = set([9, 8, 7, 7, 8, 5, 4, 1])
>>> q.pop()
1
>>> q.pop()
4
>>> q.pop()
5
>>> q.add(3)
>>> q.add(3)
>>> q.add(3)
>>> q.add(3)
>>> q
set([3, 7, 8, 9]
Aaron Maenpaa
@S.Lott yes i see that now, but originally his answer stated heapq
The.Anti.9
@S.Lott he was responding to my initial answer which was to suggest heapq, because I misread his question and though he wanted a priority queue.
Aaron Maenpaa
But what if the crawler finds a link to page 3 a while later, when it has already been removed from the set ?
Wookai
@Wookai: That's what I addressed with my solution
Kamil Kisiel
+1  A: 

Why not use a list if you need order (or even a heapq, as was formerly suggested by zacherates before a set was suggested instead) and also use a set to check for duplicates?

Devin Jeanpierre
+2  A: 

A very simple example would be to stuff each item's URL into a dict, but as the key, not as the value. Then only process the next item if it's url is not in that dict's keys:

visited = {}
# grab next url from somewhere
if url not in visited.keys():
  # process url
  visited[url] = 1 # or whatever, the value is unimportant
# repeat with next url

You can get more efficient, of course, but this would be simple.

That's essentially an implementation of a set-- in fact, python's built-in set is implemented in a similar manner.
Devin Jeanpierre
Yeah, but at the time I posted it, the previous poster used a heapq, not a set.
It's better than a set, because it keeps track of the already visited urls.
Wookai
Well, instead of "visited[url] = 1" you'd have "visited.add(url)", if visited was a set - other than that, it'd be the same.
+1  A: 

If I understand correctly, you want to visit each page only once. I think the best way to do this would be to keep a queue of pages still to visit, and a set of visited pages. The problem with the other posted solution is that once you pop a page from the queue, you no longer have a record of whether or not you've been there.

I'd use a combination of a set and a list:

visited = set()
to_visit = []

def queue_page(url):
    if url not in visited:
        to_visit.append(url)

def visit(url):
    visited.add(url)
    ... # some processing

    # Add all found links to the queue
    for link in links:
        queue_page(link)

def page_iterator(start_url):
    visit(start_url)
    try:
        yield to_visit.pop(0)
    except IndexError:
        raise StopIteration

for page in page_iterator(start):
    visit(page)

Of course this a bit of a contrived example, and you'd probably be best off encapsulating this in some way, but it illustrates the concept.

Kamil Kisiel
A: 

I would extend the list class to add unique-testing code to whatever methods of the list you are using. This could range from simply adding a .append_unique(item) to the class, or overriding all of append, insert, extend, __setitem__, __setslice__, etc, to throw an exception (or be silent, if you wish) in the case of a non-unique item.

For example, if you just wanted to make sure the append method maintained uniqueness:

class UniqueList(list):
    def append(self, item):
        if item not in self:
            list.append(self, item)
Mike Boers
Wouldn't a set be simpler? While a list preserves order, what's necessary about preserving the order?
S.Lott
I had assumed that a queue would imply it being ordered, but I guess that is what the "non-priority" part is about...
Mike Boers