Is there a way (a program, a library) to approximately know which language a document is written in?
I have a bunch of text documents (~500K) in mixed languages to import in a i18n enabled CMS (Drupal)..
I don't need perfect matches, only some guess.
...
My goal is to implements different image classification methods to show how they function and the advantages and disadvantages behind such methods. The ones I want to try and implement using Java include;
Minimum distance classifier
k-nearest neighbour classifier.
I was wondering what can be used to accomplish my task that already ex...
Which is the best Java library for automatic language identification/classification?
Hypothetical syntax:
String languageCode = LanguageIdentificationAPI.identifyLanguage("Hello world.");
// languageCode would now contain "en" for English.
Thanks a lot in advance!
...
I've got a classification problem in my hand, which I'd like to address with a machine learning algorithm ( Bayes, or Markovian probably, the question is independent on the classifier to be used). Given a number of training instances, I'm looking for a way to measure the performance of an implemented classificator, with taking data overf...
In other answers at Stackoverflow it's been suggested that Weka is good, but there are others (Classifier4j, jBNC, Naiban).
Does anyone have actual experience with these?
...
Hello..
I am wanting some expert guidance here on what the best approach is for me to solve a problem. I have investigated some machine learning, neural networks, and stuff like that. I've investigated weka, some sort of baesian solution.. R.. several different things. I'm not sure how to really proceed, though. Here's my problem.
...
Is there a way to classify a particular sentence/paragraph as funny. There are very few pointers as to where one should go further on this.
...
Binarization is the act of transforming colorful features of of an entity into vectors of numbers, most often binary vectors, to make good examples for classifier algorithms.
If we where to binarize the sentence "The cat ate the dog", we could start by assigning every word an ID (for example cat-1, ate-2, the-3, dog-4) and then simply r...
I am looking for a free taxonomy that is totally free. In my research, Dewey has legal problem. Library of Congress Classification is copyrighted except in the USA. DMOZ requires update from users. Please correct me if I am wrong.
So, is there any totally free taxonomy for commerical use?
What I am looking for is something like a Googl...
Hello all,
I asked a question similar to this one a couple of weeks ago, but I did not ask the question correctly. So I am re-asking here the question with more details and I would like to get a more AI oriented answer.
I have a list representing products which are more or less the same. For instance, in the list below, they are all S...
It's well known that Bayesian classifiers are an effective way to filter spam. These can be fairly concise (our one is only a few hundred LoC) but all core code needs to be written up-front before you get any results at all.
However, the TDD approach mandates that only the minimum amount of code to pass a test can be written, so given t...
Let's say you have access to an email account with the history of received emails from the last years (~10k emails) classified into 2 groups
genuine email
spam
How would you approach the task of creating a neural network solution that could be used for spam detection - basically classifying any email either as spam or not spam?
Let'...
I have a set of ~10K objects, each with approximately 150 distinct properties, about a quarter of which are multivalued and/or related to other properties.
I have a set of about 120 categories that I would like to sort these objects into, with each category being defined as a 'template' object. If an instance matches the template exact...
Kernel-based classifier usually requires O(n^3) training time because of the inner-product computation between two instances. To speed up the training, inner-product values can be pre-computed and stored in a two-dimensional array. However when the no. of instances is very large, say over 100,000, there will not be sufficient memory to d...
I'm trying to build a binary classification decision tree out of huge (i.e. which cannot be stored in memory) datasets using MATLAB. Essentially, what I'm doing is:
Collect all the data
Try out n decision functions on the data
Pick out the best decision function to separate the classes within the data
Split the original dataset into 2...
I'd like to sift text (in particular, Twitter messages) to see if they relate to a particular topic. Have you been down that road? If so, I'd love to hear what approach you'd use.
For my case, just searching for topic keywords gets me useful text about 7% of the time; the keywords have multiple meanings, some of which aren't on topic. F...
I have some kind of object model and I need to filter and sort it's nodes for some kind of property. What kinds of automated systems exist to generate and select properties of the object model that correlate to what I want? (I'm intentionally being abstract and non-specific)
I'm thinking of a system that works kind of like spam filters ...
I work on a social media monitoring system. We don't crawl the web ourselves, we get feeds from aggregators like Spinn3r. In most cases, the "blogs" that are nothing but pages of links to porn sites are filtered, but we'd like something in-house that we can train on a quicker time frame than waiting for upstream providers to make changes...
For an enterprise application research project me and another person are working on, we are looking to remove certain content from the page to keep the posted messages universal(meaning not offensive and essentially anonymous). Right now we want to take a message that a user has posted to a message board, and remove any type of name, nam...
Hi,
I need some sort of solution in Java for the following requirements:
Search in a text for certain terms (each term can be 1-3 words). For example: {"hello world", "hello"}. The match need to be exact.
There are about 500 types of terms groups each contains about 30 terms.
Each text might contain up to 4000 words.
performance is ...