nlp

Where can I find get a dump of raw text on the web?

I am looking to do some text analysis in a program I am writing. I am looking for alternate sources of text in its raw form similar to what is provided in the Wikipedia dumps (download.wikimedia.com). I'd rather not have to go through the trouble of crawling websites, trying to parse the html , extracting text etc.. ...

Extracting a set of words with the Python/NLTK, then comparing it to a standard English dictionary.

I have: from __future__ import division import nltk, re, pprint f = open('/home/a/Desktop/Projects/FinnegansWake/JamesJoyce-FinnegansWake.txt') raw = f.read() tokens = nltk.wordpunct_tokenize(raw) text = nltk.Text(tokens) words = [w.lower() for w in text] f2 = open('/home/a/Desktop/Projects/FinnegansWake/catted-several-long-Russian-nov...

effective way to determine if a message is spam?

Is there a way to determine if the given message is a spam? For example those who posts on forums and advertise their own sites for various products. ...

Detect English verb tenses using NTLK

I am looking for a way given an English text count verb phrases in it in past, present and future tenses. For now I am using NLTK, do a POS (Part-Of-Speech) tagging, and then count say 'VBD' to get past tenses. This is not accurate enough though, so I guess I need to go further and use chunking, then analyze VP-chunks for specific tense ...

What's needed for NLP?

Hello, assuming that I know nothing about everything and that I'm starting in programming TODAY what do you say would be necessary for me to learn in order to start working with Natural Language Processing? I've been struggling with some string parsing methods but so far it is just annoying me and making me create ugly code. I'm lookin...

Extract inconsistently formatted date from string (date parsing, NLP)

I have a large list of files, some of which have dates embedded in the filename. The format of the dates is inconsistent and often incomplete, e.g. "Aug06", "Aug2006", "August 2006", "08-06", "01-08-06", "2006", "011004" etc. In addition to that, some filenames have unrelated numbers that look somewhat like dates, e.g. "20202010". In ...

Tools for getting intent from Twitter statuses?

I am considering a project in which a publication's content is augmented by relevant, publicly available tweets from people in the area. But how could I programmatically find the relevant Tweets? I know that generating a structure representing the meaning of natural language is pretty much the holy grail of NLP, but perhaps there's some ...

Break/Decompose complex and compound sentences in nltk

Is there a way to decompose complex sentences into simple sentences in nltk or other natural language processing libraries? For example: The park is so wonderful when the sun is setting and a cool breeze is blowing ==> The sun is setting. a cool breeze is blowing. The park is so wonderful. ...

What’s a good Python profanity filter library?

Like http://stackoverflow.com/questions/1521646/best-profanity-filter, but for Python — and I’m looking for libraries I can run and control myself locally, as opposed to web services. (And whilst it’s always great to hear your fundamental objections of principle to profanity filtering, I’m not specifically looking for them here. I know ...

I have text files in multiple languages. How to selectively delete one language in NLTK?

Maybe this is just impossible and I should give up all hope. Or maybe there's a really clever way to do it that I haven't thought of. Here's two examples of what I've got: يَبِسَ - يَيْبَسُ (yabisa, yaybasu)[y-b-s][ي-ب-س] (To become dry, stiff, rigid) 20:77 yabasan = dry. يَسَّرَ - يُيَسِّرُ (yassara, yuyassiru)[y-s-r][ي-س-ر...

How to find out if a sentence is a question (interrogative)?

Is there an open source Java library/algorithm for finding if a particular piece of text is a question or not? I am working on a question answering system that needs to analyze if the text input by user is a question. I think the problem can probably be solved by using opensource NLP libraries but its obviously more complicated than s...

algorithm to calculate similarity between texts

Hello all, I am trying to score similarity between posts from social networks, but didn't find any good algorithms for that, thoughts? I just tried Levenshtein, JaroWinkler, and others, but those one are more used to compare texts without sentiments. In posts we can get one text saying "I really love dogs" and an other saying "I really...

Changing the words keeping its meaning intact...

Hi We have a requirement in which we need to change change the words or phrases in the sentence while keeping its meaning intact. This application is going to provide suggestions to users who are involved in copy-writing. I don't know where should I start... we have not yet finalized the technology but would like to do it in a Python o...

Transforming early modern English into 20th century spelling using the NLTK

I have a list of strings that are all early modern English words ending with 'th.' These include hath, appointeth, demandeth, etc. -- they are all conjugated for the third person singular. As part of a much larger project (using my computer to convert the Gutenberg etext of Gargantua and Pantagruel into something more like 20th century ...

Open source libraries for generating automated summaries

Hello All, I was looking for a open source library for generating automated summaries out of few words. For ex: if two qualities are given of a person a) good thinking skills b) bad handwriting, i need to generate a sentence like "Bob has good thinking skills however needs to improve on his handwriting". I need to know if any open sourc...

Topic modeling using mallet

Hey guys, I'm trying to use topic modeling with Mallet but have a question. How do I know when do I need to rebuild the model? For instance I have this amount of documents I crawled from the web, using topic modeling provided by Mallet I might be able to create the models and infer documents with it. But overtime, with new data that I...

Automatic text translation

What tools or web services are available for machine text translation. For example ENGLISH TEXT > SERVER or LIB > GERMAN TEXT Libraries are also acceptable. Is Google language API the only one ? ...

How to efficiently filter a string against a long list of words in Python/Django?

Stackoverflow implemented its "Related Questions" feature by taking the title of the current question being asked and removing from it the 10,000 most common English words according to Google. The remaining words are then submitted as a fulltext search to find related questions. I want to do something similar in my Django site. What is ...

How do I get of the most common words in various languages?

Stackoverflow implemented its "Related Questions" feature by taking the title of the current question being asked and removing from it the 10,000 most common English words according to Google. The remaining words are then submitted as a fulltext search to find related questions. How do I get such a list of the most common English words?...

How to use the pretrained MaltParser parsing models for english

I am trying to use the pretrained parsing model for English of the MaltParser by following the steps in the following page, but repeatedly getting a null pointer exception. http://maltparser.org/mco/english_parser/engmalt.html I am trying this on a MaltParser version 1.4 and Java version 6 on a Windows machine. I think the model was tra...