views:

196

answers:

1

I am using Scrapy to fetch some data from iTunes' AppStore database. I start with this list of apps: http://itunes.apple.com/us/genre/mobile-software-applications/id36?mt=8

In the following code I have used the simplest regex which targets all apps in the US store.

from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule

class AppStoreSpider(CrawlSpider):
    domain_name = 'itunes.apple.com'
    start_urls = ['http://itunes.apple.com/us/genre/mobile-software-applications/id6015?mt=8']

    rules = (
        Rule(SgmlLinkExtractor(allow='itunes\.apple\.com/us/app'),
            'parse_app', follow=True,
        ),
    )

def parse_app(self, response):
    ....

SPIDER = AppStoreSpider()

When I run it I receive the following:

 [itunes.apple.com] DEBUG: Crawled (200) <GET http://itunes.apple.com/us/genre/mobile-software-applications/id6015?mt=8&gt; (referer: None)
 [itunes.apple.com] DEBUG: Filtered offsite request to 'itunes.apple.com': <GET http://itunes.apple.com/us/app/bloomberg/id281941097?mt=8&gt;

As you can see, when it starts crawling the first page it says: "Filtered offsite request to 'itunes.apple.com'". and then the spider stops.. it also returns this message:

[ScrapyHTTPPageGetter,client] /usr/lib/python2.5/cookielib.py:1577: exceptions.UserWarning: cookielib bug!

Traceback (most recent call last): File "/usr/lib/python2.5/cookielib.py", line 1575, in make_cookies parse_ns_headers(ns_hdrs), request) File "/usr/lib/python2.5/cookielib.py", line 1532, in _cookies_from_attrs_set cookie = self._cookie_from_cookie_tuple(tup, request) File "/usr/lib/python2.5/cookielib.py", line 1451, in _cookie_from_cookie_tuple if version is not None: version = int(version) ValueError: invalid literal for int() with base 10: '"1"'

I have used the same script for other website and I didn't have this problem.

Any suggestion? 

A: 

Hi Eric,

When I hit that link in a browser, it automatically tries to open iTunes locally. That could be the "offsite request" mentioned in the error.

I would try:

1) Remove "?mt=8" from the end of the URL. It looks like it's not needed anyway and it could have something to do with the request.

2) Try the same request in the Scrapy Shell. It's a much easier way to debug your code and try new things. More details here: http://doc.scrapy.org/topics/shell.html?highlight=interactive

ChrisCast