Hi, I'm very new to Scrapy. Here my spider to crawl twistedweb.
class TwistedWebSpider(BaseSpider):
name = "twistedweb3"
allowed_domains = ["twistedmatrix.com"]
start_urls = [
"http://twistedmatrix.com/documents/current/web/howto/",
]
rules = (
Rule(SgmlLinkExtractor(),
'parse',
follow=True,
),
)
def parse(self, response):
print response.url
filename = response.url.split("/")[-1]
filename = filename or "index.html"
open(filename, 'wb').write(response.body)
When I run scrapy-ctl.py crawl twistedweb3
It fetched index.html only.
Getting the index.html content and tried using SgmlLinkExtractor, it extract links as I expected but these links can not be followed.
Can you show me the wrong?
Suppose I want to get css, javascript file. How do I achieve this? I mean get full website?