views:

288

answers:

3

I think there is a wealth of natural language data associated with sites like reddit or digg or news.google.com.

I have done a little bit of research with text mining, but can't find how I could use those tools to parse something like reddit.

What kind of applications can you come up with?

+1  A: 

I'd start on the RSS, and after that I might use Nutch; what to actually do with the data is more your call.

dlamblin
+1  A: 

I have found in the past that the best way to mine data on sites like Reddit or Digg is to first use the developer API that they provide. Typically you have a focused interest in either a topic or trend, and the only way to get that data is through an established public interface. You can also parse feeds, and combine them both to uncover 90% of what you would want to know. If you want to do deep research on data not available through an API, then you should be prepared to spend a significant amount of time writing custom wrappers around a tool like cURL. If you have the budget you can also call them and ask if they offer paid research data on users.

hal10001
A: 

These are good ideas. I can get the data, but what applications can be built around it?

Berlin Brown