classification

(human) Language of a document

Is there a way (a program, a library) to approximately know which language a document is written in? I have a bunch of text documents (~500K) in mixed languages to import in a i18n enabled CMS (Drupal).. I don't need perfect matches, only some guess. ...

Image Classification Algorithms Using Java

My goal is to implements different image classification methods to show how they function and the advantages and disadvantages behind such methods. The ones I want to try and implement using Java include; Minimum distance classifier k-nearest neighbour classifier. I was wondering what can be used to accomplish my task that already ex...

Best Java library for automatic language identification?

Which is the best Java library for automatic language identification/classification? Hypothetical syntax: String languageCode = LanguageIdentificationAPI.identifyLanguage("Hello world."); // languageCode would now contain "en" for English. Thanks a lot in advance! ...

Measuring the performance of classification algorithm

I've got a classification problem in my hand, which I'd like to address with a machine learning algorithm ( Bayes, or Markovian probably, the question is independent on the classifier to be used). Given a number of training instances, I'm looking for a way to measure the performance of an implemented classificator, with taking data overf...

What's the best open-source Java Bayesian spam filter library?

In other answers at Stackoverflow it's been suggested that Weka is good, but there are others (Classifier4j, jBNC, Naiban). Does anyone have actual experience with these? ...

Best approach to what I think is a machine learning problem

Hello.. I am wanting some expert guidance here on what the best approach is for me to solve a problem. I have investigated some machine learning, neural networks, and stuff like that. I've investigated weka, some sort of baesian solution.. R.. several different things. I'm not sure how to really proceed, though. Here's my problem. ...

NLP classify sentences/paragraph as funny

Is there a way to classify a particular sentence/paragraph as funny. There are very few pointers as to where one should go further on this. ...

Binarization in Natural Language Processing

Binarization is the act of transforming colorful features of of an entity into vectors of numbers, most often binary vectors, to make good examples for classifier algorithms. If we where to binarize the sentence "The cat ate the dog", we could start by assigning every word an ID (for example cat-1, ate-2, the-3, dog-4) and then simply r...

Free Topical Taxonomy (Classification System) for Commerical Use

I am looking for a free taxonomy that is totally free. In my research, Dewey has legal problem. Library of Congress Classification is copyrighted except in the USA. DMOZ requires update from users. Please correct me if I am wrong. So, is there any totally free taxonomy for commerical use? What I am looking for is something like a Googl...

Algorithm to classify a list of products? Take 2.

Hello all, I asked a question similar to this one a couple of weeks ago, but I did not ask the question correctly. So I am re-asking here the question with more details and I would like to get a more AI oriented answer. I have a list representing products which are more or less the same. For instance, in the list below, they are all S...

TDD and the Bayesian Spam Filter problem

It's well known that Bayesian classifiers are an effective way to filter spam. These can be fairly concise (our one is only a few hundred LoC) but all core code needs to be written up-front before you get any results at all. However, the TDD approach mandates that only the minimum amount of code to pass a test can be written, so given t...

Neural networks for email spam detection

Let's say you have access to an email account with the history of received emails from the last years (~10k emails) classified into 2 groups genuine email spam How would you approach the task of creating a neural network solution that could be used for spam detection - basically classifying any email either as spam or not spam? Let'...

Looking for approaches to categorize objects based on their properties.

I have a set of ~10K objects, each with approximately 150 distinct properties, about a quarter of which are multivalued and/or related to other properties. I have a set of about 120 categories that I would like to sort these objects into, with each category being defined as a 'template' object. If an instance matches the template exact...

Kernel methods for large scale dataset

Kernel-based classifier usually requires O(n^3) training time because of the inner-product computation between two instances. To speed up the training, inner-product values can be pre-computed and stored in a two-dimensional array. However when the no. of instances is very large, say over 100,000, there will not be sufficient memory to d...

Learning decision trees on huge datasets

I'm trying to build a binary classification decision tree out of huge (i.e. which cannot be stored in memory) datasets using MATLAB. Essentially, what I'm doing is: Collect all the data Try out n decision functions on the data Pick out the best decision function to separate the classes within the data Split the original dataset into 2...

Detecting on-topic text?

I'd like to sift text (in particular, Twitter messages) to see if they relate to a particular topic. Have you been down that road? If so, I'd love to hear what approach you'd use. For my case, just searching for topic keywords gets me useful text about 7% of the time; the keywords have multiple meanings, some of which aren't on topic. F...

How to filter/sort/rank object model nodes?

I have some kind of object model and I need to filter and sort it's nodes for some kind of property. What kinds of automated systems exist to generate and select properties of the object model that correlate to what I want? (I'm intentionally being abstract and non-specific) I'm thinking of a system that works kind of like spam filters ...

Out of the box spam filtering?

I work on a social media monitoring system. We don't crawl the web ourselves, we get feeds from aggregators like Spinn3r. In most cases, the "blogs" that are nothing but pages of links to porn sites are filtered, but we'd like something in-house that we can train on a quicker time frame than waiting for upstream providers to make changes...

Searching for Database of Entity Names (colleges, cities, personalities, countries...)

For an enterprise application research project me and another person are working on, we are looking to remove certain content from the page to keep the posted messages universal(meaning not offensive and essentially anonymous). Right now we want to take a message that a user has posted to a message board, and remove any type of name, nam...

Text Classification in Java

Hi, I need some sort of solution in Java for the following requirements: Search in a text for certain terms (each term can be 1-3 words). For example: {"hello world", "hello"}. The match need to be exact. There are about 500 types of terms groups each contains about 30 terms. Each text might contain up to 4000 words. performance is ...