Need to retrieve content from specific news sources / blogs etc. Third party software, or build my own? | ansaurus

tags:

views:

27

answers:

3

Q:

Need to retrieve content from specific news sources / blogs etc. Third party software, or build my own?

Hi guys,

Looking for some guidance. I've got a requirement to get article content from specific sources that will be used for data analysis in a nutshell. So we've got to get the latest articles, and store them in our database for processing later on.

I'm not sure really sure of the best approach. Our code for current news retrieval (from a newsfeed provider) runs from C on UNIX. Basically using CURL and parsing the XML for storage in a database.

But the solution I need now is different. Every website is different obviously. Basically I just want to be able to have a cron job that will call something that will get the latest articles from the relevant website as required.

Any ideas appreciated. I'm also currently looking at AutomationAnywhere perhaps as a quick solution if it works for us.

Thanks!

Manoj

A:

iMacros is a good solution for web scraping.

You can run iMacros for Firefox (free/open-source) on Linux and control it via the command line.

On Windows you can also use the paid Scripting Edition, which gives you extracting wizards and support for Flash automation etc.

FrankJK 2010-10-22 21:24:45

A:

Take a look at the IRobotSoft visual web scraper. It will give you a quick start.

seagulf 2010-10-23 19:52:39

A:

use their RSS feeds

Plumo 2010-10-26 04:26:41

related questions

How do you screen scrape ajax pages?

How can I scrape an HTML table to CSV?

Download image file from the HTML page source using python?

Scrape a dynamic website

Is there a PHP equivalent of Perl's WWW::Mechanize?

Perl: HTML Scraping from an Authenticated website

How do screen scrapers work?

What are some good methods to hinder screen scrapers from grabbing specific pieces of content off my site?

What is the best way to parse a web page in Ruby?

Reading and posting to web pages using C#

What's a good tool to screen-scrape with Javascript support?

Add RSS to any website?

Is there another way to do screen scaping apart from regular expressions?

screen scraping a command window using .net managed code

how to save a public html page with all media and preserve structure

Saving HTML tables to a Database

Getting HTML from a page behind a login

Export ASPX to HTML

Python regular expression for HTML parsing (BeautifulSoup)

How to use webclient in a secure site?

Extract Address Information from a Web Page

HTML Scraping in Php.

How to fetch HTML in Java

How to implement a web scraper in PHP?

Options for HTML scraping?