Given a set of words tagged for part of speech, I want to find those that are obscenities in mainstream English. How might I do this? Should I just make a huge list, and check for the presence of anything in the list? Should I try to use a regex to capture a bunch of variations on a single root?
If it makes it easier, I don't want to fi...
Suppose I want to match address records (or person names or whatever) against each other to merge records that are most likely referring to the same address. Basically, I guess I would like to calculate some kind of correlation between the text values and merge the records if this value is over a certain threshold.
Example:
"West Lawnm...
I'm trying to check spelling accuracy of text samples using the Stanford NLP. It's just a metric of the text, not a filter or anything, so if it's off by a bit it's fine, as long as the error is uniform.
My first idea was to check if the word is known by the lexicon:
private static LexicalizedParser lp = new LexicalizedParser("englishP...
I'm doing a Natural Language Processing project where I compute a bunch of attributes of a text, giving me a vector of values for each text. I want to compare these vectors with multidimensional scaling. What Java libraries/toolkits do you recommend for doing this?
...
I am using the Stanford Natural Language processing toolkit. I've been trying to find spelling errors with Lexicon's isKnown method, but it produces quite a few false positives. So I thought I'd load a second lexicon, and check that too. However, that causes a problem.
private static LexicalizedParser lp = new LexicalizedParser(Constant...
I have some input text, which contains one or more human person names. I do not have any dictionary for these names. Which Java library can help me to define names from my input text?
I looked through OpenNLP, but did not find any example or guide or at least description of how it can be applied into my code. (I saw javadoc, but it is pr...
What is the difference between Foward-backward algorithm on n-gram model and viterbi algorithm on HMM model?
When I review the implementation of these two algorithms, only thing I found is that the transaction probability is coming from different probabilistic models.
Is there a difference between these 2 algorithms?
...
I'm trying to load some corpora I installed with the NLTK installer but I got a:
>>> from nltk.corpus import machado
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ImportError: cannot import name machado
But in the download manager (nltk.download()) the package machado is marked as installed a...
I did some searching but haven't landed anything that looks useful yet but I am wondering if anyone knows of something (tool,lib etc) that can parse English phrases and translate them into a cron string.
For example: Every Tuesday at 15:00 converts to 0 15 * * 2
It seems like something that would have lots of gotchas and it would be pr...
What is the original source for the thesaurus data in Aiksaurus?
Is it possible to get data about antonyms for a word from Aiksaurus?
...
I need to do natural language detection (with confidence scores), preferably in Java, I'd really not introduce more platforms/technologies at this stage of the project. I have previously used the Google API for this in a PoC, but I now need to scale up to very large amounts of data, so any web-based solution won't cut it, (also Google ar...
I'm using JAWS to access WordNet. Given a word, is there any way to detect if it is a proper noun? It looks like the synsets have pretty coarse lexical categories.
To clarify, there is no context for the words - they are just presented individually. If a word could conceivably be used as a common noun, it is acceptable. So "mark" is fin...
Hi, all.
I am planning to learn natural language processing this year.
But when I start reading introductory books on this topic, I found that I miss a lot of points relating mainly to mathematics.
So I'm here searching for what I should learn before I can learn nlp, well, more smoothly?
Thanks in advance.
...
How does something like Statistically Improbable Phrases work?
According to amazon:
Amazon.com's Statistically Improbable
Phrases, or "SIPs", are the most
distinctive phrases in the text of
books in the Search Inside!™ program.
To identify SIPs, our computers scan
the text of all books in the Search
Inside! program. If ...
Hellow Stack Overflow people. I'd like some suggestions regarding the following problem. I am using Java.
I have an array #1 with a number of Strings. For example, two of the strings might be: "An apple fell on Newton's head" and "Apples grow on trees".
On the other side, I have another array #2 with terms like (Fruits => Apple, Orange...
Hello,
I am hand tagging twitter messages as Positive, Negative, Neutral. I am try to appreciate is there some logic one can use to identify of the training set what proportion of message should be positive / negative and neutral ?
So for e.g. if I am training a Naive Bayes classifier with 1000 twitter messages should the proportion o...
Are there any good Java libraries with prebuilt dictionaries that I can use to try and extract word roots from input words?
I asked a more general question which supersedes this question. It is here. Please vote to close this question.
...
I'm looking for various NLP tools for a project I'm working on and right now I've found most useful the Stanford NLP projects.
Does anyone know if there are other tools that are out there that would be useful for a language understander?
And more importantly, are there tools that are NOT out there?
Most specifically, I'm looking fo...
Is there a research paper/book that I can read which can tell me for the problem at hand what sort of feature selection algorithm would work best.
I am trying to simply identify twitter messages as pos/neg (to begin with). I started out with Frequency based feature selection (having started with NLTK book) but soon realised that for a ...
As a student of computational linguistics, I frequently do machine learning experiments where I have to prepare training data from all kinds of different resources like raw or annotated text corpora or syntactic tree banks. For every new task and every new experiment I write programs (normally in Python and sometimes Java) to extract the...