I'm working on a simple web crawler in python and I wan't to make a simple queue class, but I'm not quite sure the best way to start. I want something that holds only unique items to process, so that the crawler will only crawl each page once per script run (simply to avoid infinite looping). Can anyone give me or point me to a simple queue example that I could run off of?
I'd just use a set, it doesn't maintain order but it will help you maintain uniqueness:
>>> q = set([9, 8, 7, 7, 8, 5, 4, 1])
>>> q.pop()
1
>>> q.pop()
4
>>> q.pop()
5
>>> q.add(3)
>>> q.add(3)
>>> q.add(3)
>>> q.add(3)
>>> q
set([3, 7, 8, 9]
Why not use a list if you need order (or even a heapq, as was formerly suggested by zacherates before a set was suggested instead) and also use a set to check for duplicates?
A very simple example would be to stuff each item's URL into a dict, but as the key, not as the value. Then only process the next item if it's url is not in that dict's keys:
visited = {}
# grab next url from somewhere
if url not in visited.keys():
# process url
visited[url] = 1 # or whatever, the value is unimportant
# repeat with next url
You can get more efficient, of course, but this would be simple.
If I understand correctly, you want to visit each page only once. I think the best way to do this would be to keep a queue of pages still to visit, and a set of visited pages. The problem with the other posted solution is that once you pop a page from the queue, you no longer have a record of whether or not you've been there.
I'd use a combination of a set and a list:
visited = set()
to_visit = []
def queue_page(url):
if url not in visited:
to_visit.append(url)
def visit(url):
visited.add(url)
... # some processing
# Add all found links to the queue
for link in links:
queue_page(link)
def page_iterator(start_url):
visit(start_url)
try:
yield to_visit.pop(0)
except IndexError:
raise StopIteration
for page in page_iterator(start):
visit(page)
Of course this a bit of a contrived example, and you'd probably be best off encapsulating this in some way, but it illustrates the concept.
I would extend the list class to add unique-testing code to whatever methods of the list you are using. This could range from simply adding a .append_unique(item)
to the class, or overriding all of append
, insert
, extend
, __setitem__
, __setslice__
, etc, to throw an exception (or be silent, if you wish) in the case of a non-unique item.
For example, if you just wanted to make sure the append method maintained uniqueness:
class UniqueList(list):
def append(self, item):
if item not in self:
list.append(self, item)