Hi,
Im using scrapy to crawl a news website on a daily basis. How do i restrict scrapy from scraping already scraped URLS. Also is there any clear documentation or examples on SgmlLinkExtractor.
-Avinash
Hi,
Im using scrapy to crawl a news website on a daily basis. How do i restrict scrapy from scraping already scraped URLS. Also is there any clear documentation or examples on SgmlLinkExtractor.
-Avinash
As previously said, make sure that your start_urls does not contain duplicate and maintain a list of previously scraped urls in your parse method. This way, you can only return urls that you haven't seen from the parse method.
This is straight forward. Maintain all your previously crawled urls in python dict. So when you try to try them next time, see if that url is there in the dict. else crawl.
def load_urls(prev_urls):
prev = dict()
for url in prev_urls:
prev[url] = True
return prev
def fresh_crawl(prev_urls, new_urls):
for url in new_urls:
if url not in prev_urls:
crawl(url)
return
def main():
purls = load_urls(prev_urls)
fresh_crawl(purls, nurls)
return
The above code was typed in SO text editor aka browser. Might have syntax errors. You might also need to make a few changes. But the logic is there...
NOTE: But beware that some websites constantly keep changing their content. So sometimes you might have to recrawl a particular webpage (i.e. same url) just to get the updated content.