views:

56

answers:

2

I'm looking for an aggregator for the editoral and op-ed pages of a bunch of English language newspapers I want to follow. The objective is to generate an HTML that is just a collection of editorial pieces from the dozen newspapers I want to follow internationally, so that I can print them off in the morning. Since this is a very narrow requirement, I couldn't find anything already available so I'm thinking of writing one on my own.

Now, I used to be a programmer for ~8 years in my previous life (and now have been swayed to the "Dark Side" that is Wall Street after my MBA). I'm not knowledgeable enough today about programming to make a good choice on a scripting language so am unsure which the best language for this would be (performance is not a key issue, libraries for parsing HTML, text handling as well as getting data off live web pages are more important).

PS: I don't mind learning a new language (previously I worked extensively with x86 ASM, C and Visual C++/MFC) almost exclusively in Win32 environments.

+1  A: 

interpreted languages do well with code generation, you should think about Perl or Ruby

+1  A: 

Use Python and the excellent lxml library for scraping HTML. It supports CSS selectors, which is a huge convenience, and it's rather fast. It handles broken HTML well too.

Wahnfrieden
Make sure you look at the lxml.html module. The documentation can be a little confusing, so just try playing around with it in an interactive Python shell - that's how I learned to use it.
Wahnfrieden