Hello. I'm trying to wrote parsing script using python/scrapy. How can I remove [] and u' from strings in result file?
Now I have text like this:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.utils.markup import remove_tags
from googleparser.items import GoogleparserItem
import sys
class GoogleparserSpider(BaseSpider):
name = "google.com"
allowed_domains = ["google.com"]
start_urls = [
"http://www.google.com/search?q=this+is+first+test&num=20&hl=uk&start=0",
"http://www.google.com/search?q=this+is+second+test&num=20&hl=uk&start=0"
]
def parse(self, response):
print "===START======================================================="
hxs = HtmlXPathSelector(response)
qqq = hxs.select('/html/head/title/text()').extract()
print qqq
print "---DATA--------------------------------------------------------"
sites = hxs.select('/html/body/div[5]/div[3]/div/div/div/ol/li/h3')
i = 1
items = []
for site in sites:
try:
item = GoogleparserItem()
title1 = site.select('a').extract()
title2=str(title1)
title=remove_tags(title2)
link=site.select('a/@href').extract()
item['num'] = i
item['title'] = title
item['link'] = link
i= i+1
items.append(item)
except:
print 'EXCEPTION'
return items
print "===END========================================================="
SPIDER = GoogleparserSpider()
and I have result like this after running
python scrapy-ctl.py crawl google.com
2010-07-25 17:44:44+0300 [-] Log opened.
2010-07-25 17:44:44+0300 [googleparser] DEBUG: Enabled extensions: CoreStats, CloseSpider, WebService, TelnetConsole, MemoryUsage
2010-07-25 17:44:44+0300 [googleparser] DEBUG: Enabled scheduler middlewares: DuplicatesFilterMiddleware
2010-07-25 17:44:44+0300 [googleparser] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloaderStats, UserAgentMiddleware, RedirectMiddleware, DefaultHeadersMiddleware, CookiesMiddleware, HttpCompressionMiddleware, RetryMiddleware
2010-07-25 17:44:44+0300 [googleparser] DEBUG: Enabled spider middlewares: UrlLengthMiddleware, HttpErrorMiddleware, RefererMiddleware, OffsiteMiddleware, DepthMiddleware
2010-07-25 17:44:44+0300 [googleparser] DEBUG: Enabled item pipelines: CsvWriterPipeline
2010-07-25 17:44:44+0300 [-] scrapy.webservice.WebService starting on 6080
2010-07-25 17:44:44+0300 [-] scrapy.telnet.TelnetConsole starting on 6023
2010-07-25 17:44:44+0300 [google.com] INFO: Spider opened
2010-07-25 17:44:45+0300 [google.com] DEBUG: Crawled (200) <GET http://www.google.com/search?q=this+is+first+test&num=20&hl=uk&start=0> (referer: None)
===START=======================================================
[u'this is first test - \u041f\u043e\u0448\u0443\u043a Google']
---DATA--------------------------------------------------------
2010-07-25 17:52:42+0300 [google.com] DEBUG: Scraped GoogleparserItem(num=1, link=[u'http://www.amazon.com/First-Protector-Small-Tamora-Pierce/dp/0679889175'], title=u"[u'Amazon.com: First Test (Protector of the Small) (9780679889175 ...']") in <http://www.google.com/search?q=this+is+first+test&num=100&hl=uk&start=0>
and this text in file:
1,[u'Amazon.com: First Test (Protector of the Small) (9780679889175 ...'],[u'http://www.amazon.com/First-Protector-Small-Tamora-Pierce/dp/0679889175']