nlp

which is better... GATE or RapidMiner....

hi, I've started to write a simple sentiment analysis tool currently i am looking @ GATE (http://gate.ac.uk) and RapidMiner (http://rapid-i.com/) Being a beginner not able to concentrate on both... could someone pls tell me which one will be better in terms of usage, learning curve, licensing etc Thx Shiv ...

Can you programmatically detect pluralizations of English words, and derive the singular form?

The title says it all: Given some (English) word that we shall assume is a plural, is it possible to derive the singular form? I'd like to avoid lookup/dictionary tables if possible. Some examples: Examples -> Example a simple 's' suffix Glitch -> Glitches 'es' suffix, as opposed to above Countries -> Country 'ies' suffix....

How to determine the (natural) language of a document?

I have a set of documents in two languages: English and German. There is no usable meta information about these documents, a program can look at the content only. Based on that, the program has to decide which of the two languages the document is written in. Is there any "standard" algorithm for this problem that can be implemented in a...

Google Wave Context-Aware Spell Checker

Is it possible to use the Google Wave Context-Aware Spell Checker via web services? If yes, can anyone please be kind enough to post a simple example? ...

Natural Language CFG builder Algorithm

Hello, I am working in a natural language processing project. It aims to build libraries for Arabic language. We working on a POS tagger and now I am thinking in grammar phase. Since Arabic language and many others have complicated grammar, so it is very hard to build their context free grammar (CFG). For this reason I had an idea for an...

What is a fast and unsupervised way of checking quality of pdf-extracted text?

I am working on a somewhat large corpus with articles numbering the tens of thousands. I am currently using PDFBox to extract with various success, and I am looking for a way to programatically check each file to see if the extraction was moderately successful or not. I'm currently thinking of running a spellchecker on each of them, but ...

How to automatically excerpt user generated content?

I run a website that allows users to write blog-post, I would really like to summarize the written content and use it to fill the <meta name="description".../>-tag for example. What methods can I employ to automatically summarize/describe the contents of user generated content? Are there any (preferably free) methods out there that have...

Correlation clustering in r

I'd like to use correlation clustering and I figure R is a good place to start. I can present the data to R as a set of large, sparse vectors or as a table with a pre-computed dissimilarity matrix. My question is are there existing R functions to turn this into a hierarchical cluster with agnes that uses correlation clustering? Will I ...

Books/resources for Natural Language Processing for non-academics

I found "Natural Language Processing with Python" today, and am wondering what other good, non-academic (the research papers tend to be too dry and/or specific to certain areas) NLP resources the SO community knows about. I'm starting-out in text processing for a couple hobby projects, and am keen to find good places to start :) ...

Classifying Text Based on Groups of Keywords?

I have a list of requirements for a software project, assembled from the remains of its predecessor. Each requirement should map to one or more categories. Each of the categories consists of a group of keywords. What I'm trying to do is find an algorithm that would give me a score ranking which of the categories each requirement is likel...

What’s the best profanity filter which supports Java integration?

What is the best profanity filter (free / open source or paid commercial) which supports Java integration? It needs to be able to take a string and return a clean string... Can be a web service and doesn't necessarily have to support Java... Happy programming... ...

Package to compare LSA, TFIDF, Cosine metrics and Language Models

Hi, I'm looking for a package (any language, really) that I can use on a corpus of 50 documents to perform interdocument similarity testing in various metrics, like tfidf, okapi, language models, lsa, etc. I want as a result a document similarity matrix, i.e. doc1 is x% similar to doc2, etc... This is for research purposes, not for pr...

How does AraMorph 1.2.1 work?

I have downloaded AraMorph 1.2.1 Perl version from SourceForge, but I do not know how to use it. Could someone explain to me how can I get it to work? ...

Lemmatization java

Hi, I am looking for a lemmatisation implementation for English in Java. I found a few already, but I need something that does not need to much memory to run (1 GB top). Thanks. I DO NOT NEED A STEMMER. ...

Integrating my program with a web2.0 website

I'm creating an ELIZA-like chatterbot, and I'd like to calibrate it with Omegle, using what the other person type as the input. If it was a regular HTML page, I could parse it and send back the response to some script, but checking the source code, I've noticed that the entire page is created using Javascript, but obfuscates the entire...

In Natural language processing, what is the purpose of chunking?

Does anyone know? Is this a place to ask Computer science questions or just programming? ...

Natural Language Processing in C++

I'm working on a project that already has a C++ base. I would like to have a plug-in for some natural language processing. I really like GATE but I'm not sure if it's worth launching the JVM and splitting the project into C++ and Java portions. I noticed UIMA has a C++ framework, but have not tried it but seems to have less features t...

Natural language command language

I'm interested in developing a natural language command language for a domain with existing rules. I was very impressed when Terry Winograd's SHRDLU showed the way (the conversation below is 40 years old! Astonishing). Can we do better now and if so where can I get examples? Person: Pick up a big red block. Computer: OK. Person: ...

NLTK tagging in German

I am using NLTK to extract nouns from a text-string starting with the following command: tagged_text = nltk.pos_tag(nltk.Text(nltk.word_tokenize(some_string))) It works fine in English. Is there an easy way to make it work for German as well? (I have no experience with natural language programming, but I managed to use the python nl...

extract grammar features from sentence on Google App Engine

Hello, For my GAE app I need to do some natural language processing to extract the subject and object from an input sentence. Apparently NLTK can't be installed (easily) on GAE so I am looking for another solution. I noticed GAE comes with Antlr3 but from browsing their documentation it solves a different kind of grammar problem. Any...