ansaurus

Question

Scrapy - how to identify already scraped urls

Answer 1

A:

As previously said, make sure that your start_urls does not contain duplicate and maintain a list of previously scraped urls in your parse method. This way, you can only return urls that you haven't seen from the parse method.

Eric Fortin 2010-10-06 11:09:39

Answer 2

A:

This is straight forward. Maintain all your previously crawled urls in python dict. So when you try to try them next time, see if that url is there in the dict. else crawl.

def load_urls(prev_urls):
    prev = dict()
    for url in prev_urls:
        prev[url] = True
    return prev

def fresh_crawl(prev_urls, new_urls):
    for url in new_urls:
        if url not in prev_urls:
            crawl(url)
    return

def main():
    purls = load_urls(prev_urls)
    fresh_crawl(purls, nurls)
    return

The above code was typed in SO text editor aka browser. Might have syntax errors. You might also need to make a few changes. But the logic is there...

NOTE: But beware that some websites constantly keep changing their content. So sometimes you might have to recrawl a particular webpage (i.e. same url) just to get the updated content.

MovieYoda 2010-10-06 11:20:05

ansaurus

tags:

views:

answers:

Scrapy - how to identify already scraped urls

related questions