ansaurus

Question

How to remove expired items from database with Scrapy

Answer 1

A:

If you have a HTTP URL which you suspect might not be valid at all any more (because you found it in a "deleted" feed, or just because you haven't checked it in a while), the simplest, fastest way to check is to send an HTTP HEAD request for that URL. In Python, that's best done with the httplib module of the standard library: make a connection object c to the host of interest with HTTPConnection (if HTTP 1.1, it may be reusable to check multiple URLs with better performance and lower systrem load), then do one (or more, if feasible, i.e. if HTTP 1.1 is in use) calls of c's request method, first argument 'HEAD', second argument the URL you're checking (without the host part of course;-).

After each request you call c.getresponse() to get an HTTPResponse object, whose status attribute will tell you if the URL is still valid.

Yes, it's a bit low-level, but exactly for this reason it lets you optimize your task a lot better, with just a little knowledge of HTTP;-).

Alex Martelli 2010-01-18 03:55:30

I think this is a bad answer given the circumstances, it's like telling a programmer using Django to do manual HTTP responses... That said, it's a perfectly valid method to check an URL for validity using Python.

Hannson 2010-01-18 04:47:34

@Hannson, I disagree that it's "bad" to use a slightly lower layer of abstraction when using that layer provides important improvements over operating at a higher layer -- especially if the delete feed is not 100% relied on to be complete (and in some sites it's not even there!), periodically re-scraping everything (with HTTP `GET`s, implicitly) instead of using HTTP `HEAD` for just-checking is simply a wanton waste of precious resources.

Alex Martelli 2010-01-18 05:57:38

True, lower level != bad but for a high-level web crawling framework it might not be the best way. I'm under the assumption that this particular site has a delete.rss feed (I might be wrong), but my point is that your answer wasn't what the questioner was (probably) looking for - I could be wrong though; while your answer wasn't incorrect it wasn't "correct" IMO either. For the record I didn't vote for or against your answer - it's valid either way.

Hannson 2010-01-18 06:22:15

I somehow missed this: 'I disagree that it's "bad" to use a slightly lower layer of abstraction when using that layer provides important improvements over operating at a higher layer' and I fully agree! When lower level is better it's just better; enough said!

Hannson 2010-01-18 06:24:36

Answer 2

+1 A:

I haven't tested this!
I have to confess that I haven't tried using the Django models in Scrapy, but here goes:

The simplest way I imagine would be to create a new spider for the deleted.rss file by extending the XMLFeedSpider (Copied from the scrapy documentation, then modified). I suggest you do create a new spider because very little of the following logic is related to the logic used for scraping the site:

from scrapy import log
from scrapy.contrib.spiders import XMLFeedSpider
from myproject.items import DeletedUrlItem

class MySpider(XMLFeedSpider):
    domain_name = 'example.com'
    start_urls = ['http://www.example.com/deleted.rss']
    iterator = 'iternodes' # This is actually unnecesary, since it's the default value
    itertag = 'item'

    def parse_node(self, response, url):
        url['url'] = node.select('#path/to/url').extract()

        return url # return an Item 

SPIDER = MySpider()

This is not a working spider for you to use, but IIRC the RSS files are pure XML. I'm not sure how the deleted.rss looks like but I'm sure you can figure out how to extract the URLs from the XML. Now, this example imports myproject.items.DeletedUrlItem which is just a string in this example, but you need to create t he DeletedUrlItem using something like the code below:

You need to create the DeletedUrlItem:

class DeletedUrlItem(Item):
    url = Field()

Instead of saving, you delete the items using Django's Model API in a Scrapy's ItemPipeline - I assume you're using a DjangoItem:

# we raise a DropItem exception so Scrapy
# doesn't try to process the item any further
from scrapy.core.exceptions import DropItem

# import your model
import django.Model.yourModel

class DeleteUrlPipeline(item):

    def process_item(self, spider, item):
        if item['url']:
            delete_item = yourModel.objects.get(url=item['url'])
            delete_item.delete() # actually delete the item!
            raise DropItem("Deleted: %s" % item)

Notice the delete_item.delete().

I'm aware that this answer may contain errors, it's written by memory :-) but I will definitely update if you've got comments or cannot figure this out.

Hannson 2010-01-18 05:45:11

Nice answer. Since I'm going to have `DeleteUrlItems` as well as `VideoItems`, would you do an `isinstance` check inside the DeleteUrlPipeline to make sure it only runs on `DeletedUrlItems`?

Gattster 2010-01-18 21:48:07

yeah, it's better to be safe than sorry, right?

Hannson 2010-01-20 04:28:12

ansaurus

tags:

views:

answers:

How to remove expired items from database with Scrapy

2010-01-18 Update

related questions