views:

101

answers:

2

Any examples, tips, guidance for the following scenario?

I have retrieved updates from several different news websites. I then analyse that information to predict on current trend in the world.

I could only find the information on data mining when searching for above idea, but it is for database systems. While data mining is similar to what i am trying to do, data mining in databases information is more specific than what I have retrieved from websites. So could someone guide me on this aspect? I really appreciate any help you can give on this.

Thanks.

A: 

First of all, you need some training data from the past. Meaning, a collection of old news and the state of the trend to analyze at different points in time.

Then, you have to decide how to quantify this information. If the trend is something like "Sold mobile phones", you can just take the number of sold mobiles. The news are harder to quantify. For example, you could measure the word frequency in the training news and take the n least frequent words as features (similar to SPAM filters).

After that, you train a classifier on these features and trend from the past. A good one is the "Random Forest" algorithm, since it is practically parameter-free.

You will need a lot of background knowledge to actually implement this plan. "The Elements of Statistical Learning" by Hastie, Tibshirani and Friedmann is a good book to learn from. It can be downloaded for free on the authors' homepage.

Bernhard Kausler
"The news are harder to quantify." Thats the core of the problem, find a way to quantify how likely a trend will be picked up or how much each of the pieces of information found in the news impact each trend.
Gastoni
A: 

If you are looking for data extraction algorithms you should check out cluster analysis and "non-negative matrix factorization".
You can extract general topics with that. Getting the current trend from that is relatively easy.
But which (if any) of the other topics will get the next trend calls for magic or neural nets.

David R