views:

92

answers:

3

Just to use it as an example, StackOverflow users already associated tags to questions for a lot of questions.

Is there a .NET machine learning library that could use this historic data to 'learn' how to associate tags to newly created questions and suggest them to the user?

+1  A: 

This looks similar to spam filtering, but with more buckets.

A widely used technique for spam filtering is Bayesian filters. A Google search will give you a lot of options, including the first hit on CodeProject.

Albin Sunnanbo
+1 for interesting article in links.
Paul Hadfield
@Paul: When a question has no answer you shouldn't +1 for 'interesting article in links' as it removes the question from the unanswered questions list. I didn't check for it yet to see if answers the question.
Ciwee
@Ciwee, I don't thing that giving a +1 to a comment would remove the question from the unanswered question list. I think you're confusing that with accepting an answer
Neowizard
@Neowizard: The unanswered button on the top shows questions that hasn't been answered AND have no upvote, as you can see in the description on the top right hand corner of the page.
Ciwee
@Ciwee: In my opinion @Albin gave an interesting insight into how what you propose (which I think is a good idea) could be implemented. If you would like to check the SO FAQ it clearly says that the community owns your question/answers and if you don't like that then SO is not the place for you - this extends to allowing people to vote up/down anything they please.
Paul Hadfield
@Ciwee, what you're saying can't be right, because I browse question from the unanswered section all the time, and it contains many questions with up-voted not-accepted answers. I might show questions with accepted answered and no up-vote, but this is not the case bacause no answer has been accepted
Neowizard
A: 

The subject of machine learning is a very complex field, and if you really want to create such an application you'll need some research done no matter what lib you're using.

In any case, I'd suggest using SVM (support vector machines). I've used it in python for this exact purpose, and it's incredible. You'll need to find a C# implementation thou. The idea is to map features of text (like "words that end with .Net") to dimentions, use those features to create regions in the created space for tagging (anything in the sub-space X will be tagged as Y).

This is a really complex subject, and my explenation can only make it less clear, so I'll leave it up to you if you want, to read and use.

Here's something to get you started from Wikipedia - Support Vector maching (SVM)

Edit: It seems that LibSVM (the library I worked with in python) is also avelible for C# at from it's HomePage. Good luck

Neowizard
A: 

Since this is a many-to-many problem (multiple documents get multiple tags), the best approach may be to use a search engine library such as Lucene.NET . You would make an index of existing, tagged documents, then submit each new document's text as a query and check what similar documents come up. Count the tags of, say, the top ten retrieved documents and suggest the five (or however many you want) most popular tags among those.

The great thing about this is that the learning becomes incremental if you add each new document, with its tags (corrected by the user) to the index. I've built a system for email (multi-)classification that works like this and it performed pretty well. I wouldn't be surprised to learn that SO actually does this.

(Note: search engines are not commonly included in the category of machine learning programs, but Baeza-Yates and Ribeiro Neto have noted that is in fact a kind of clustering, as a cluster of documents is built around a query.)

larsmans