ansaurus

Question

Answer 1

+3 A:

I'd just use a set, it doesn't maintain order but it will help you maintain uniqueness:

>>> q = set([9, 8, 7, 7, 8, 5, 4, 1])
>>> q.pop()
1
>>> q.pop()
4
>>> q.pop()
5
>>> q.add(3)
>>> q.add(3)
>>> q.add(3)
>>> q.add(3)
>>> q
set([3, 7, 8, 9]

Aaron Maenpaa 2009-02-14 18:48:41

@S.Lott yes i see that now, but originally his answer stated heapq

The.Anti.9 2009-02-14 18:57:08

@S.Lott he was responding to my initial answer which was to suggest heapq, because I misread his question and though he wanted a priority queue.

Aaron Maenpaa 2009-02-14 18:57:19

But what if the crawler finds a link to page 3 a while later, when it has already been removed from the set ?

Wookai 2009-02-14 20:32:46

@Wookai: That's what I addressed with my solution

Kamil Kisiel 2009-02-14 22:01:46

Answer 2

+1 A:

Why not use a list if you need order (or even a heapq, as was formerly suggested by zacherates before a set was suggested instead) and also use a set to check for duplicates?

Devin Jeanpierre 2009-02-14 18:54:19

Answer 3

+2 A:

A very simple example would be to stuff each item's URL into a dict, but as the key, not as the value. Then only process the next item if it's url is not in that dict's keys:

visited = {}
# grab next url from somewhere
if url not in visited.keys():
  # process url
  visited[url] = 1 # or whatever, the value is unimportant
# repeat with next url

You can get more efficient, of course, but this would be simple.

2009-02-14 18:57:16

That's essentially an implementation of a set-- in fact, python's built-in set is implemented in a similar manner.

Devin Jeanpierre 2009-02-14 19:01:40

Yeah, but at the time I posted it, the previous poster used a heapq, not a set.

2009-02-14 19:10:34

It's better than a set, because it keeps track of the already visited urls.

Wookai 2009-02-14 20:34:00

Well, instead of "visited[url] = 1" you'd have "visited.add(url)", if visited was a set - other than that, it'd be the same.

2009-02-14 23:17:25

Answer 4

+1 A:

If I understand correctly, you want to visit each page only once. I think the best way to do this would be to keep a queue of pages still to visit, and a set of visited pages. The problem with the other posted solution is that once you pop a page from the queue, you no longer have a record of whether or not you've been there.

I'd use a combination of a set and a list:

visited = set()
to_visit = []

def queue_page(url):
    if url not in visited:
        to_visit.append(url)

def visit(url):
    visited.add(url)
    ... # some processing

    # Add all found links to the queue
    for link in links:
        queue_page(link)

def page_iterator(start_url):
    visit(start_url)
    try:
        yield to_visit.pop(0)
    except IndexError:
        raise StopIteration

for page in page_iterator(start):
    visit(page)

Of course this a bit of a contrived example, and you'd probably be best off encapsulating this in some way, but it illustrates the concept.

Kamil Kisiel 2009-02-14 19:10:31

Answer 5

A:

I would extend the list class to add unique-testing code to whatever methods of the list you are using. This could range from simply adding a .append_unique(item) to the class, or overriding all of append, insert, extend, __setitem__, __setslice__, etc, to throw an exception (or be silent, if you wish) in the case of a non-unique item.

For example, if you just wanted to make sure the append method maintained uniqueness:

class UniqueList(list):
    def append(self, item):
        if item not in self:
            list.append(self, item)

Mike Boers 2009-02-14 21:17:40

Wouldn't a set be simpler? While a list preserves order, what's necessary about preserving the order?

S.Lott 2009-02-15 02:31:01

I had assumed that a queue would imply it being ordered, but I guess that is what the "non-priority" part is about...

Mike Boers 2009-02-15 17:42:05

ansaurus

tags:

views:

answers:

Simple unique non-priority queue system

related questions