views:

25

answers:

1

I'm writing some code that calculates certain statistics about word usages.

Does anyone know where I can find a database of raw news articles from various topics over a period of (say) the last year? Preferably they would be either in plain text format or XML. Trying to scrape content from random web sites isn't a good option.

I know going forward I could probably archive them myself. However, I need to kick start the process with a bunch of existing articles... the more the merrier.

Any other ideas for corpus data-sets that are easily available in simple to parse form would also be appreciated.

A: 

You might try the Internet Archive. They have a text section but I don't know if it has news. You might also be able to use their Wayback machine to pull up news articles from major site using their RSS feeds.

DMKing
Thanks, those are nice ideas.To be honest I was a bit surprised not to have immediately found a raw dump of news articles ready to go just by Googling. I guess it must be copyright related... but then when did that ever stop anyone.
octonion
Someone else on the programming subreddit also suggested WikiNews. For what I'm doing, that might actually be more appropriate right now.Now I just need to figure out how to extract the articles from MediaWiki XML - hopefully shouldn't be too hard.
octonion