views:

56

answers:

2

We're designing a large scale web scraping/parsing project. Basically, the script needs to go through a list of web pages, extract the contents of a particular tag, and store it in a database. What language would you recommend for doing this on a large scale(tens of millions of pages?). .

We're using MongoDB for the database, so anything with solid MongoDB drivers is a plus.

So far, we have been using(don't laugh) PHP, curl, and Simple HTML DOM Parser but I don't think that's scalable to millions of pages, especially as PHP doesn't have proper multithreading.

We need something that is easy to develop in, can run on a Linux server, has a robust HTML/DOM parser to easily extract that tag, and can easily download millions of webpages in a reasonable amount of time. We're not really looking for a web crawler, because we don't need to follow links and index all content, we just need to extract one tag from each page on a list.

A: 

I do something similar using Java with the HttpClient commons library. Although I avoid the DOM parser because I'm looking for a specific tag which can be found easily from a regex.

The slowest part of the operation is making the http requests.

Quotidian
+2  A: 

If you're really talking about large scale, then you'll probably want something that lets you scale horizontally, e.g., a Map-Reduce framework like Hadoop. You can write Hadoop jobs in a number of languages, so you're not tied to Java. Here's an article on writing Hadoop jobs in Python, for instance. BTW, this is probably the language I'd use, thanks to libs like httplib2 for making the requests and lxml for parsing the results.

If a Map-Reduce framework is overkill, you could keep it in Python and use multiprocessing.

UPDATE: If you don't want a MapReduce framework, and you prefer a different language, check out the ThreadPoolExecutor in Java. I would definitely use the Apache Commons HTTP client stuff, though. The stuff in the JDK proper is way less programmer-friendly.

Hank Gay