bayesian

Bayesian filtering for spam

I was wondering if there is any good and clean oo implementation of bayesian filtering for spam and text classification? For learning purposes....

Is there an R package for learning a Dirichlet prior from counts data

I'm looking for a an R package which can be used to train a Dirichlet prior from counts data. I'm asking for a colleague who's using R, and don't use it myself, so I'm not too sure how to look for packages. It's a bit hard to search for, because "R" is such a nonspecific search string. There doesn't seem to be anything on CRAN, but ar...

Naive Bayesian spam filtering effectiveness

How effective is naive Bayesian filtering for filtering spam? I heard that spammers easily bypass them by stuffing extra non-spam-related words. What programming techniques can you use with Bayesian filters to prevent that? ...

Measuring the performance of classification algorithm

I've got a classification problem in my hand, which I'd like to address with a machine learning algorithm ( Bayes, or Markovian probably, the question is independent on the classifier to be used). Given a number of training instances, I'm looking for a way to measure the performance of an implemented classificator, with taking data overf...

probability interview question, random sampling

This is a good one because it's so counter-intuitive: Imagine an urn filled with balls, two-thirds of which are of one color and one-third of which are of another. One individual has drawn 5 balls from the urn and found that 4 are red and 1 is white. Another individual has drawn 20 balls and found that 12 are red and 8 are white. Whi...

Way to infer the size of the userbase of a site from sampling taken usernames

I just had a clever idea (I think). Suppose you wanted to estimate the size of a userbase of a site which does not publicize this information. People are more likely to have acquired different usernames with different probabilities. For instance, if the username 'nick' doesn't exist on the system, it's likely to have an extremely small...

What's the best open-source Java Bayesian spam filter library?

In other answers at Stackoverflow it's been suggested that Weka is good, but there are others (Classifier4j, jBNC, Naiban). Does anyone have actual experience with these? ...

Is there a Bayesian filter library for .NET

Is there a Bayesian filter library for .NET? I would like to setup a group of folders and have emails automatically moved to those folders based on what has been previously moved to the folder. If you are familiar with FogBugz auto-sort, that's exactly what I would like to do. ...

Analyzing, categorizing and indexing metadata

I have a large (~2.5M records) data base of image metadata. Each record represents an image and has a unique ID, a description field, a comma-separated list of keywords (say 20-30 keywords per image), and some other fields. There's no real database schema, and I have no way of knowing which keywords exists in the database without iterati...

Bayesian spam filtering library for Python

I am looking for a Python library which does Bayesian Spam Filtering. I looked at SpamBayes and OpenBayes, but both seem to be unmaintained (I might be wrong). Can anyone suggest a good Python (or Clojure, Common Lisp, even Ruby) library which implements Bayesian Spam Filtering? Thanks in advance. Clarification: I am actually looking ...

Python Bayesian text classification modules

A quick Google search reveals that there are a good number of Bayesian classifiers implemented as Python modules. If I want wrapped, high-level functionality similar to dbacl, which of those modules is right for me? Training % dbacl -l one sample1.txt % dbacl -l two sample2.txt Classification % dbacl -c one -c two sample3.txt -v one...

Calculating the probability of a token being spam in a Bayesian spam filter

I recently wrote a Bayesian spam filter, I used Paul Graham's article Plan for Spam and an implementation of it in C# I found on codeproject as references to create my own filter. I just noticed that the implementation on CodeProject uses the total number of unique tokens in calculating the probability of a token being spam (e.g. if the...

Naive bayes calculation in sql

I want to use naive bayes to classify documents into a relatively large number of classes. I'm looking to confirm whether an mention of an entity name in an article really is that entity, on the basis of whether that article is similar to articles where that entity has been correctly verified. Say, we find the text "General Motors" in a...

What are the best resources for learning how to implement Naive Bayes Classifiers in SSAS?

After asking this question, I've decided to try and implement some Naive Bayes Classifiers using SQL Server Analysis Services. Can anyone point me to a decent book, website or any other resource on how to implement Naive Bayes Classifiers in SSAS? Similarly, I would be interested in learning about Decision Trees. ...

Implementing Bayesian classifier in Ruby?

I would like to implement a simple Bayesian classification system to do rudimentary sentiment analysis on short messages. Practical suggestions for implementing in Ruby would be welcome. Suggestions for other approaches besides Bayes would also be welcome. ...

Simple Sentiment Analysis

It appears that the simplest, naivest way to do basic sentiment analysis is with a Bayesian classifier (confirmed by what I'm finding here on SO). Any counter-arguments or other suggestions? ...

How does Stackoverflow's homepage filtering work?

How does Stackoverflow's homepage filtering work? I believe the questions that appear on the homepage are specifically related to your interests, which are indicated by the tags that you look at, question ans answer. Does anyone know the name of the algorithm/technique or have some basic details (nothing that violated their IP) about h...

How to filter/sort/rank object model nodes?

I have some kind of object model and I need to filter and sort it's nodes for some kind of property. What kinds of automated systems exist to generate and select properties of the object model that correlate to what I want? (I'm intentionally being abstract and non-specific) I'm thinking of a system that works kind of like spam filters ...

R: MCMClogit confusion

Could anybody explain to me why simulatedCase <- rbinom(100,1,0.5) simDf <- data.frame(CASE = simulatedCase) posterior_m0 <<- MCMClogit(CASE ~ 1, data = simDf, b0 = 0, B0 = 1) always results in a MCMC acceptance ratio of 0? Any explanation would be greatly appreciated! ...

Algorithms to find stuff a user would like based on other users likes

I'm thinking of writing an app to classify movies in an HTPC based on what the family members like. I don't know statistics or AI, but the stuff here looks very juicy. I wouldn't know where to start do. Here's what I want to accomplish: Compose a set of samples from each users likes, rating each sample attribute separately. For examp...