I haven't tested this!
I have to confess that I haven't tried using the Django models in Scrapy, but here goes:
The simplest way I imagine would be to create a new spider for the deleted.rss
file by extending the XMLFeedSpider (Copied from the scrapy documentation, then modified). I suggest you do create a new spider because very little of the following logic is related to the logic used for scraping the site:
from scrapy import log
from scrapy.contrib.spiders import XMLFeedSpider
from myproject.items import DeletedUrlItem
class MySpider(XMLFeedSpider):
domain_name = 'example.com'
start_urls = ['http://www.example.com/deleted.rss']
iterator = 'iternodes' # This is actually unnecesary, since it's the default value
itertag = 'item'
def parse_node(self, response, url):
url['url'] = node.select('#path/to/url').extract()
return url # return an Item
SPIDER = MySpider()
This is not a working spider for you to use, but IIRC the RSS files are pure XML. I'm not sure how the deleted.rss
looks like but I'm sure you can figure out how to extract the URLs from the XML. Now, this example imports myproject.items.DeletedUrlItem
which is just a string in this example, but you need to create t he DeletedUrlItem using something like the code below:
You need to create the DeletedUrlItem:
class DeletedUrlItem(Item):
url = Field()
Instead of saving, you delete the items using Django's Model API in a Scrapy's ItemPipeline - I assume you're using a DjangoItem:
# we raise a DropItem exception so Scrapy
# doesn't try to process the item any further
from scrapy.core.exceptions import DropItem
# import your model
import django.Model.yourModel
class DeleteUrlPipeline(item):
def process_item(self, spider, item):
if item['url']:
delete_item = yourModel.objects.get(url=item['url'])
delete_item.delete() # actually delete the item!
raise DropItem("Deleted: %s" % item)
Notice the delete_item.delete()
.
I'm aware that this answer may contain errors, it's written by memory :-) but I will definitely update if you've got comments or cannot figure this out.