views:

27

answers:

3

Hi guys,

Looking for some guidance. I've got a requirement to get article content from specific sources that will be used for data analysis in a nutshell. So we've got to get the latest articles, and store them in our database for processing later on.

I'm not sure really sure of the best approach. Our code for current news retrieval (from a newsfeed provider) runs from C on UNIX. Basically using CURL and parsing the XML for storage in a database.

But the solution I need now is different. Every website is different obviously. Basically I just want to be able to have a cron job that will call something that will get the latest articles from the relevant website as required.

Any ideas appreciated. I'm also currently looking at AutomationAnywhere perhaps as a quick solution if it works for us.

Thanks!

Manoj

A: 

iMacros is a good solution for web scraping.

You can run iMacros for Firefox (free/open-source) on Linux and control it via the command line.

On Windows you can also use the paid Scripting Edition, which gives you extracting wizards and support for Flash automation etc.

FrankJK
A: 

Take a look at the IRobotSoft visual web scraper. It will give you a quick start.

seagulf
A: 

use their RSS feeds

Plumo