views:

36

answers:

2

Hi,

Im using scrapy to crawl a news website on a daily basis. How do i restrict scrapy from scraping already scraped URLS. Also is there any clear documentation or examples on SgmlLinkExtractor.

-Avinash

A: 

As previously said, make sure that your start_urls does not contain duplicate and maintain a list of previously scraped urls in your parse method. This way, you can only return urls that you haven't seen from the parse method.

Eric Fortin
A: 

This is straight forward. Maintain all your previously crawled urls in python dict. So when you try to try them next time, see if that url is there in the dict. else crawl.

def load_urls(prev_urls):
    prev = dict()
    for url in prev_urls:
        prev[url] = True
    return prev

def fresh_crawl(prev_urls, new_urls):
    for url in new_urls:
        if url not in prev_urls:
            crawl(url)
    return

def main():
    purls = load_urls(prev_urls)
    fresh_crawl(purls, nurls)
    return

The above code was typed in SO text editor aka browser. Might have syntax errors. You might also need to make a few changes. But the logic is there...

NOTE: But beware that some websites constantly keep changing their content. So sometimes you might have to recrawl a particular webpage (i.e. same url) just to get the updated content.

MovieYoda