nlp

Natural Language Processing: Find obscenities in English?

Given a set of words tagged for part of speech, I want to find those that are obscenities in mainstream English. How might I do this? Should I just make a huge list, and check for the presence of anything in the list? Should I try to use a regex to capture a bunch of variations on a single root? If it makes it easier, I don't want to fi...

Calculating context-sensitive text correlation

Suppose I want to match address records (or person names or whatever) against each other to merge records that are most likely referring to the same address. Basically, I guess I would like to calculate some kind of correlation between the text values and merge the records if this value is over a certain threshold. Example: "West Lawnm...

Java Stanford NLP: Spell checking

I'm trying to check spelling accuracy of text samples using the Stanford NLP. It's just a metric of the text, not a filter or anything, so if it's off by a bit it's fine, as long as the error is uniform. My first idea was to check if the word is known by the lexicon: private static LexicalizedParser lp = new LexicalizedParser("englishP...

Java: Multidimensional Scaling?

I'm doing a Natural Language Processing project where I compute a bunch of attributes of a text, giving me a vector of values for each text. I want to compare these vectors with multidimensional scaling. What Java libraries/toolkits do you recommend for doing this? ...

Java Stanford NLP: ArrayIndexOutOfBounds after loading second lexicon

I am using the Stanford Natural Language processing toolkit. I've been trying to find spelling errors with Lexicon's isKnown method, but it produces quite a few false positives. So I thought I'd load a second lexicon, and check that too. However, that causes a problem. private static LexicalizedParser lp = new LexicalizedParser(Constant...

How to define person's names in text (Java)

I have some input text, which contains one or more human person names. I do not have any dictionary for these names. Which Java library can help me to define names from my input text? I looked through OpenNLP, but did not find any example or guide or at least description of how it can be applied into my code. (I saw javadoc, but it is pr...

What is the difference between Foward-backward algorithm and viterbi algorithm?

What is the difference between Foward-backward algorithm on n-gram model and viterbi algorithm on HMM model? When I review the implementation of these two algorithms, only thing I found is that the transaction probability is coming from different probabilistic models. Is there a difference between these 2 algorithms? ...

NLTK - how to find out what corpora are installed from within python?

I'm trying to load some corpora I installed with the NLTK installer but I got a: >>> from nltk.corpus import machado Traceback (most recent call last): File "<stdin>", line 1, in <module> ImportError: cannot import name machado But in the download manager (nltk.download()) the package machado is marked as installed a...

How to Convert English to Cron?

I did some searching but haven't landed anything that looks useful yet but I am wondering if anyone knows of something (tool,lib etc) that can parse English phrases and translate them into a cron string. For example: Every Tuesday at 15:00 converts to 0 15 * * 2 It seems like something that would have lots of gotchas and it would be pr...

Using Aiksaurus for NLP

What is the original source for the thesaurus data in Aiksaurus? Is it possible to get data about antonyms for a word from Aiksaurus? ...

What is a good tool for Natural Language Detection in Java?

I need to do natural language detection (with confidence scores), preferably in Java, I'd really not introduce more platforms/technologies at this stage of the project. I have previously used the Google API for this in a PoC, but I now need to scale up to very large amounts of data, so any web-based solution won't cut it, (also Google ar...

Detect Proper Nouns with WordNet?

I'm using JAWS to access WordNet. Given a word, is there any way to detect if it is a proper noun? It looks like the synsets have pretty coarse lexical categories. To clarify, there is no context for the words - they are just presented individually. If a word could conceivably be used as a common noun, it is acceptable. So "mark" is fin...

What are the prerequisites to learning natural language processing?

Hi, all. I am planning to learn natural language processing this year. But when I start reading introductory books on this topic, I found that I miss a lot of points relating mainly to mathematics. So I'm here searching for what I should learn before I can learn nlp, well, more smoothly? Thanks in advance. ...

How does Amazon's Statistically Improbable Phrases work?

How does something like Statistically Improbable Phrases work? According to amazon: Amazon.com's Statistically Improbable Phrases, or "SIPs", are the most distinctive phrases in the text of books in the Search Inside!™ program. To identify SIPs, our computers scan the text of all books in the Search Inside! program. If ...

Matching substrings from a dictionary to other string: suggestions?

Hellow Stack Overflow people. I'd like some suggestions regarding the following problem. I am using Java. I have an array #1 with a number of Strings. For example, two of the strings might be: "An apple fell on Newton's head" and "Apples grow on trees". On the other side, I have another array #2 with terms like (Fruits => Apple, Orange...

Training set - proportion of pos / neg / neutral sentences

Hello, I am hand tagging twitter messages as Positive, Negative, Neutral. I am try to appreciate is there some logic one can use to identify of the training set what proportion of message should be positive / negative and neutral ? So for e.g. if I am training a Naive Bayes classifier with 1000 twitter messages should the proportion o...

Morphophoneme processing library in Java

Are there any good Java libraries with prebuilt dictionaries that I can use to try and extract word roots from input words? I asked a more general question which supersedes this question. It is here. Please vote to close this question. ...

Natural Language Parsing tools: what is out there and what is not?

I'm looking for various NLP tools for a project I'm working on and right now I've found most useful the Stanford NLP projects. Does anyone know if there are other tools that are out there that would be useful for a language understander? And more importantly, are there tools that are NOT out there? Most specifically, I'm looking fo...

How to choose a Feature Selection Algorithm? - advice

Is there a research paper/book that I can read which can tell me for the problem at hand what sort of feature selection algorithm would work best. I am trying to simply identify twitter messages as pos/neg (to begin with). I started out with Frequency based feature selection (having started with NLTK book) but soon realised that for a ...

General frameworks for preparing training data?

As a student of computational linguistics, I frequently do machine learning experiments where I have to prepare training data from all kinds of different resources like raw or annotated text corpora or syntactic tree banks. For every new task and every new experiment I write programs (normally in Python and sometimes Java) to extract the...