ansaurus

Question

A good machine learning technique to weed out good URLs from bad

Answer 1

+2 A:

If I understand correctly you just want to learn if a URL is good or bad.

A SVM is not appropriate, SVM's are only appropriate if the dataset is very complex and many of the information points are close to the hyperplane. You'd use a SVM to add extra dimensions to the data.

You'd want a few thousand URL's ideally to train your dataset. The more the better, obviously you could do it with just 100 but your results may not produce good classifications.

I'd suggest you build your data set first and use Weka http://www.cs.waikato.ac.nz/ml/weka/

You can measure which algorithm gives you the best results.

steve 2010-03-11 14:20:09

+1 for Weka. It makes it easy to try a bunch of approaches to find out what works best.

Nate Kohl 2010-03-12 11:43:08

so you'd waste huge amounts of cpu cycles because the complexity is irrelevant?

steve 2010-06-28 11:57:59

Answer 2

+2 A:

I don't agree with steve that an SVM is a bad choice here, although I also don't think there's much reason to think it will do any better than any other discriminative learning algorithm.
You are going to need to at least think about designing features. This is one of the most important parts of making a machine learning algorithms work well on a certain problem. It's hard to know what to suggest without more idea of the problem. I guess you could start with counts character n-grams present in the URL as features.
Nobody really knows how much data you need for any specific problem. The general approach is to get some data, learn a model, see if more training data helps, repeat until you don't get any more significant improvement.
Kernels are a tricky business. Some SVM libraries has string kernels which allow you to train on strings without any feature extraction (I'm thinking of SVMsequel, there may be others). Otherwise, you need to compute numerical or binary features from your data and use the linear, polynomial or RBF kernel. There's no harm in trying them all and it's worth spending some time finding the best settings for the kernel parameters. Your data is also obviously structured and there's no point in letting the learning algorithm try and figure the structure of URLs (unless you care about invalid URLs). You should at least split the URL up according to the separators '/', '?', '.', '='.
I don't know what you mean by 'keep it up to date'. Retrain the model with whatever new data you have.
This depends on the library you use, in svmlight there is a program called svm_classify that takes a model and an example and gives you a class label (good or bad). I'm sure it's going to be straightforward to do in any library.

StompChicken 2010-03-12 11:27:48

Answer 3

+3 A:

I think that steve and StompChicken both make excellent points:

Picking the best algorithm is tricky, even for machine learning experts. Using a general-purpose package like Weka will let you easily compare a bunch of different approaches to determine which works best for your data.
Choosing good features is often one of the most important factors in how well a learning algorithm will work.

It could also be useful to examine how other people have approached similar problems:

Qi, X. and Davison, B. D. 2009. Web page classification: Features and algorithms. ACM Computing Survey 41, 2 (Feb. 2009), 1-31.
Kan, M.Y. and H.O.N. Thi (2005). Fast webpage classification using URL features. In Proceedings of the 14th ACM International Conference on Information and Knowledge Management (CIKM ’05), New York, NY, pp. 325–326.
Devi, M. I., Rajaram, R., and Selvakuberan, K. 2007. Machine Learning Techniques for Automated Web Page Classification Using URL Features. In Proceedings of the international Conference on Computational intelligence and Multimedia Applications (ICCIMA 2007) - Volume 02 (December 13 - 15, 2007). Washington, DC, pp. 116-120.

Nate Kohl 2010-03-12 12:04:55

+1 For links to relevant articles that don't require a subscription to view.

2010-06-28 00:23:11

ansaurus

tags:

views:

answers:

A good machine learning technique to weed out good URLs from bad

related questions