Hi,
I need to write a program to scrape forums.
Should I write the program in Python using the Scrapy framework or should I use Php cURL? Also is there a Php equivalent to Scrapy?
Thanks
Hi,
I need to write a program to scrape forums.
Should I write the program in Python using the Scrapy framework or should I use Php cURL? Also is there a Php equivalent to Scrapy?
Thanks
I wouldn't use PHP for a new application that I'm writing. I don't like the language for various reasons.
Also, it's strength is as a server side scripting language to deliver dynamic pages over the web. Not as a general purpose programming language. That's another minus point. I'd stick with Python.
As for which framework to use, there are lots of them around. Harvestman, Scrapy etc. There's also the 80legs cloud based crawler than you might be able to use.
Update : People have been downvoting this answer probably because I said I didn't like PHP. Here's a list of reasons why. Not entirely accurate but a decent summary nevertheless http://wiki.python.org/moin/PythonVsPhp
I would choose Python due to superior libxml2 bindings, specifically things like lxml.html and pyQuery. Scrapy has its own libxml2 bindings, I haven't looked at them to test them, though skimming the Scrapy documentation didn't leave me very impressed (I've done lots of scraping just using these parsers and manual coding). With any of these you get a truly superior HTML parser, querying via XPath, and with lxml.html and pyquery (also built on lxml) you get CSS selectors.
If you are doing a small job scraping a forum, I'd skip a scraping framework and just do it by hand -- it's easy and parallelizing etc is not really needed.