tags:

views:

12

answers:

0

I'm not interested in binaries - I care only about text.

I'd like to download several entire newsgroups, as far back in time as I can, and put it into a database for data mining/analysis. Something on the order of several hundred-thousand to a few-million text messages.

I'm something of a beginner to newsgroups, especially interacting with them programmaticly, so I'm looking on opinions and advice in several areas:

  • What are the more robust NNTP libraries out there? Language doesn't matter, I'm comfortable is whatever. There's Ruby, Python, Perl, and more I'm sure (although I can't link to them all). Does anyone have opinions on these?

  • I was planning on using Giganews since they're one of, if not the, largest provider out there. Are there specific sources for archives of messages in the common newsgroups (like comp.lang.c) in the 80s and early 90s? Would giganews have them?

  • What technical considerations haven't I thought of? I was planning on a standard MySQL two-table structure - group, date, select headers, and an id in one table and the message in the other. Run my download script nonstop for however long as it takes me to get the entire archive (shouldn't be more than a TB, it is text), then write analysis programs that download a chunk of compressed data from a service running on the DB machine, do the processing, send back results.

  • Anyone know of any public projects that are or have done this, that I might be able to borrow code from?