We're designing a large scale web scraping/parsing project. Basically, the script needs to go through a list of web pages, extract the contents of a particular tag, and store it in a database. What language would you recommend for doing this on a large scale(tens of millions of pages?). .
We're using MongoDB for the database, so anything with solid MongoDB drivers is a plus.
So far, we have been using(don't laugh) PHP, curl, and Simple HTML DOM Parser but I don't think that's scalable to millions of pages, especially as PHP doesn't have proper multithreading.
We need something that is easy to develop in, can run on a Linux server, has a robust HTML/DOM parser to easily extract that tag, and can easily download millions of webpages in a reasonable amount of time. We're not really looking for a web crawler, because we don't need to follow links and index all content, we just need to extract one tag from each page on a list.