views:

263

answers:

2

Hi,

I need to write a program to scrape forums.

Should I write the program in Python using the Scrapy framework or should I use Php cURL? Also is there a Php equivalent to Scrapy?

Thanks

+1  A: 

I wouldn't use PHP for a new application that I'm writing. I don't like the language for various reasons.

Also, it's strength is as a server side scripting language to deliver dynamic pages over the web. Not as a general purpose programming language. That's another minus point. I'd stick with Python.

As for which framework to use, there are lots of them around. Harvestman, Scrapy etc. There's also the 80legs cloud based crawler than you might be able to use.

Update : People have been downvoting this answer probably because I said I didn't like PHP. Here's a list of reasons why. Not entirely accurate but a decent summary nevertheless http://wiki.python.org/moin/PythonVsPhp

Noufal Ibrahim
Are you kidding? Just because you don't like PHP doesn't mean PHP wouldn't be perfect for this. You can get the page with cURL and then just scrape it with DOMDocument class. I have done similar things before.
AntonioCS
My disklike of PHP is due to language issues. Poor OO support, poor namespacing support, a less than optimal type system etc. Both the languages are turing complete so you *can* do this with either. That doesn't make PHP a better choice. Also, large scale scraping/harvesting is a tad more complex than fetching a URL and using a parser on it as I'm sure you'll appreciate.
Noufal Ibrahim
Noufal can you expand on this:"Large scale scraping/harvesting is a tad more complex than fetching a URL and using a parser on it as I'm sure you'll appreciate". thanks.
seanieb
Reading some about DOMDocument it doesn't look like a particularly high-quality parser; parsers like html5lib, lxml/libxml2 and BeautifulSoup all parse quite closely to how a browser parses a page. Depending on the software that generates the HTML you may or may not have problems, but in my experience parse problems are a real drag on doing scraping.
Ian Bicking
+3  A: 

I would choose Python due to superior libxml2 bindings, specifically things like lxml.html and pyQuery. Scrapy has its own libxml2 bindings, I haven't looked at them to test them, though skimming the Scrapy documentation didn't leave me very impressed (I've done lots of scraping just using these parsers and manual coding). With any of these you get a truly superior HTML parser, querying via XPath, and with lxml.html and pyquery (also built on lxml) you get CSS selectors.

If you are doing a small job scraping a forum, I'd skip a scraping framework and just do it by hand -- it's easy and parallelizing etc is not really needed.

Ian Bicking