Has anyone used a Bayesian filter to let forum members classify posts and so over time only display interesting posts? A Bayesian filter seems to work well for detecting email spam. Is this a viable approach to filter forum posts for users?
The difficulty with trying to classify interesting/good forum posts via Bayesian classifiers or any other automated classification system is the probable lack of correlation between the words and/or word structure of postings vs. their relative value or utility.
SPAM filters work primarily because the word choices and structure are systematically unusual overall: the spammer is trying to promote a specific product, service, etc. There are reasonable correlations and patterns that can be learned, though spammers can try to increase the difficulty of doing so via various techniques.
Such word/structure patterns are unlikely to exist for good vs. bad forum posts. However, there is an alternative way to restructure the problem that might be useful:
- Allow users to classify posts as good or bad or otherwise rank them as you described.
- Use Bayesian classifiers or some other statistical inference method to identify forum users who have among the highest correlation with the ranking behavior of the overall community, i.e., the users who have the best taste and are good predictors for how the community as a whole would view the content.
- Use forum post rankings from the pool of good-predictor users identified in step #2 to filter forum posts. This requires that one or more such users actually rank the new content at some point, so this pool needs to be of some size and include regular users for such a filtering system to be useful.
- This classifier system will require periodic rebuilding as the community of users is presumably dynamic, has changing interests, etc.
How well the approach I've proposed would actually work on your problem depends a lot on the nature of the forum, how willing users are to rank content, and how much they have in common for how they perceive the value of posted content. Also, the overall size of the user community could be a factor: if it's too small, there might not be enough data to work with; if too large, you could have computational scaling problems running the classifier inference method against the ranking data.