ansaurus

Question

Newbie: where to start given a problem to predict future success or not

Answer 1

+3 A:

Features

The first thing you'll need to do is decide what information you'll use as evidence to classify a user's prediction as being accurate or not. For example, you could start with simple stuff like the identity of the user making the prediction, and their historical accuracy when making predictions on the same or similar goods. This information will be provided to downstream machine learning tools as features that will be used to classify the users' predictions.

Training, Development, and Test Data

You'll want to split your 100k historical examples into three parts: training, development, and test. You should put most of the data, say 80% of it, in your training set. This will be the dataset you use to train your prediction accuracy classifier. Generally speaking the more data you use to train your classifier the more accurate the resulting model will be.

The two other data sets, development and test, will be used to evaluate the performance of your classifier. You'll use the development set to evaluate the accuracy of different configurations of your classifier or variations in the feature representation. It's called the development set since you use it to continuously evaluate classification performance as you develop your model or system.

Later, after you've built a model that achieves good performance on the development data, you'll probably want an unbiased estimated of how well your classifier will perform on new data. For this you'll use the test set to evaluate how well the classifier does on data other than what you used to develop it.

Classifier/ML Packages

After you have your preliminary feature set and you've split the data into training, development, and test, you're ready to choose a machine learning package and classifier. A few good packages that support numerous types of classifiers include:

Weka (Java)
Rapid Miner (Java)
Orange (Python)

Which classifier you should use depends on many factors including what kind of predictions you'd like to make (e.g., binary, multiclass), what kinds of features you'd like to use, and the amount of training data you want to use.

For example, if you just what to make a binary classification of whether a user's predication is probably accurate or not, you might want to try support-vector-machines (SVMs). Their basic formulation is limited to doing binary predications. But, if that is all you need, they are often a good choice since they can result in very accurate models.

However, the time required to train a SVM scales poorly with the size of the training data. To train on substantial amounts data, you might decide to use something like random forests. When random forests and SVMs are trained on the same size data sets, random forests will typically produce a model that is either as accurate or nearly as accurate as a SVM model. However, random forests can allow you to use more training data and using more training data will typically increase the accuracy of your model.

Digging Deeper

Here are a few pointers to other good places to get started with machine learning

Video Lectures from Andrew Ng's machine learning course at Stanford
Andrew Moore's machine learning tutorials
Hastie's The Elements of Statistical Learning - Hastie has posted a PDF of the book here.

dmcer 2010-09-24 23:01:02

Thank you I had been following andrew ng's lectures understanding bit by bit. But Anrews tutorials have been very informative. Thank you...

akaphenom 2010-09-27 14:06:51

ansaurus

tags:

views:

answers:

Newbie: where to start given a problem to predict future success or not

related questions