views:

456

answers:

2

I am using spidering a video site that expires content frequently. I am considering using scrapy to do my spidering, but am not sure how to delete expired items.

Strategies to detect if an item is expired are:

  1. Spider the site's "delete.rss".
  2. Every few days, try reloading the contents page and making sure it still works.
  3. Spider every page of the site's content indexes, and remove the video if it's not found.

Please let me know how to remove expired items in scrapy. I will be storing my scrapy items in a mysql DB via django.

2010-01-18 Update

I have found a solution that is working, but still may not be optimal. I am maintaining a "found_in_last_scan" flag on every video that I sync. When the spider starts, it sets all the flags to False. When it finishes, it deletes videos who still have the flag set to False. I did this by attaching to the signals.spider_opened and signals.spider_closed Please confirm this is a valid strategy and there are no problems with it.

A: 

If you have a HTTP URL which you suspect might not be valid at all any more (because you found it in a "deleted" feed, or just because you haven't checked it in a while), the simplest, fastest way to check is to send an HTTP HEAD request for that URL. In Python, that's best done with the httplib module of the standard library: make a connection object c to the host of interest with HTTPConnection (if HTTP 1.1, it may be reusable to check multiple URLs with better performance and lower systrem load), then do one (or more, if feasible, i.e. if HTTP 1.1 is in use) calls of c's request method, first argument 'HEAD', second argument the URL you're checking (without the host part of course;-).

After each request you call c.getresponse() to get an HTTPResponse object, whose status attribute will tell you if the URL is still valid.

Yes, it's a bit low-level, but exactly for this reason it lets you optimize your task a lot better, with just a little knowledge of HTTP;-).

Alex Martelli
I think this is a bad answer given the circumstances, it's like telling a programmer using Django to do manual HTTP responses... That said, it's a perfectly valid method to check an URL for validity using Python.
Hannson
@Hannson, I disagree that it's "bad" to use a slightly lower layer of abstraction when using that layer provides important improvements over operating at a higher layer -- especially if the delete feed is not 100% relied on to be complete (and in some sites it's not even there!), periodically re-scraping everything (with HTTP `GET`s, implicitly) instead of using HTTP `HEAD` for just-checking is simply a wanton waste of precious resources.
Alex Martelli
True, lower level != bad but for a high-level web crawling framework it might not be the best way. I'm under the assumption that this particular site has a delete.rss feed (I might be wrong), but my point is that your answer wasn't what the questioner was (probably) looking for - I could be wrong though; while your answer wasn't incorrect it wasn't "correct" IMO either. For the record I didn't vote for or against your answer - it's valid either way.
Hannson
I somehow missed this: 'I disagree that it's "bad" to use a slightly lower layer of abstraction when using that layer provides important improvements over operating at a higher layer' and I fully agree! When lower level is better it's just better; enough said!
Hannson
+1  A: 

I haven't tested this!
I have to confess that I haven't tried using the Django models in Scrapy, but here goes:

The simplest way I imagine would be to create a new spider for the deleted.rss file by extending the XMLFeedSpider (Copied from the scrapy documentation, then modified). I suggest you do create a new spider because very little of the following logic is related to the logic used for scraping the site:

from scrapy import log
from scrapy.contrib.spiders import XMLFeedSpider
from myproject.items import DeletedUrlItem

class MySpider(XMLFeedSpider):
    domain_name = 'example.com'
    start_urls = ['http://www.example.com/deleted.rss']
    iterator = 'iternodes' # This is actually unnecesary, since it's the default value
    itertag = 'item'

    def parse_node(self, response, url):
        url['url'] = node.select('#path/to/url').extract()

        return url # return an Item 

SPIDER = MySpider()

This is not a working spider for you to use, but IIRC the RSS files are pure XML. I'm not sure how the deleted.rss looks like but I'm sure you can figure out how to extract the URLs from the XML. Now, this example imports myproject.items.DeletedUrlItem which is just a string in this example, but you need to create t he DeletedUrlItem using something like the code below:

You need to create the DeletedUrlItem:

class DeletedUrlItem(Item):
    url = Field()

Instead of saving, you delete the items using Django's Model API in a Scrapy's ItemPipeline - I assume you're using a DjangoItem:

# we raise a DropItem exception so Scrapy
# doesn't try to process the item any further
from scrapy.core.exceptions import DropItem

# import your model
import django.Model.yourModel

class DeleteUrlPipeline(item):

    def process_item(self, spider, item):
        if item['url']:
            delete_item = yourModel.objects.get(url=item['url'])
            delete_item.delete() # actually delete the item!
            raise DropItem("Deleted: %s" % item)

Notice the delete_item.delete().


I'm aware that this answer may contain errors, it's written by memory :-) but I will definitely update if you've got comments or cannot figure this out.

Hannson
Nice answer. Since I'm going to have `DeleteUrlItems` as well as `VideoItems`, would you do an `isinstance` check inside the DeleteUrlPipeline to make sure it only runs on `DeletedUrlItems`?
Gattster
yeah, it's better to be safe than sorry, right?
Hannson