I am looking to do some text analysis in a program I am writing. I am looking for alternate sources of text in its raw form similar to what is provided in the Wikipedia dumps (download.wikimedia.com).
I'd rather not have to go through the trouble of crawling websites, trying to parse the html , extracting text etc..
...
I have:
from __future__ import division
import nltk, re, pprint
f = open('/home/a/Desktop/Projects/FinnegansWake/JamesJoyce-FinnegansWake.txt')
raw = f.read()
tokens = nltk.wordpunct_tokenize(raw)
text = nltk.Text(tokens)
words = [w.lower() for w in text]
f2 = open('/home/a/Desktop/Projects/FinnegansWake/catted-several-long-Russian-nov...
Is there a way to determine if the given message is a spam? For example those who posts on forums and advertise their own sites for various products.
...
I am looking for a way given an English text count verb phrases in it in past, present and future tenses. For now I am using NLTK, do a POS (Part-Of-Speech) tagging, and then count say 'VBD' to get past tenses. This is not accurate enough though, so I guess I need to go further and use chunking, then analyze VP-chunks for specific tense ...
Hello,
assuming that I know nothing about everything and that I'm starting in programming TODAY what do you say would be necessary for me to learn in order to start working with Natural Language Processing?
I've been struggling with some string parsing methods but so far it is just annoying me and making me create ugly code. I'm lookin...
I have a large list of files, some of which have dates embedded in the filename. The format of the dates is inconsistent and often incomplete, e.g. "Aug06", "Aug2006", "August 2006", "08-06", "01-08-06", "2006", "011004" etc. In addition to that, some filenames have unrelated numbers that look somewhat like dates, e.g. "20202010".
In ...
I am considering a project in which a publication's content is augmented by relevant, publicly available tweets from people in the area. But how could I programmatically find the relevant Tweets? I know that generating a structure representing the meaning of natural language is pretty much the holy grail of NLP, but perhaps there's some ...
Is there a way to decompose complex sentences into simple sentences in nltk or other natural language processing libraries?
For example:
The park is so wonderful when the sun is setting and a cool breeze is blowing ==> The sun is setting. a cool breeze is blowing. The park is so wonderful.
...
Like http://stackoverflow.com/questions/1521646/best-profanity-filter, but for Python — and I’m looking for libraries I can run and control myself locally, as opposed to web services.
(And whilst it’s always great to hear your fundamental objections of principle to profanity filtering, I’m not specifically looking for them here. I know ...
Maybe this is just impossible and I should give up all hope. Or maybe there's a really clever way to do it that I haven't thought of.
Here's two examples of what I've got:
يَبِسَ - يَيْبَسُ (yabisa,
yaybasu)[y-b-s][ي-ب-س] (To become dry,
stiff, rigid) 20:77 yabasan = dry.
يَسَّرَ - يُيَسِّرُ (yassara,
yuyassiru)[y-s-r][ي-س-ر...
Is there an open source Java library/algorithm for finding if a particular piece of text is a question or not?
I am working on a question answering system that needs to analyze if the text input by user is a question.
I think the problem can probably be solved by using opensource NLP libraries but its obviously more complicated than s...
Hello all,
I am trying to score similarity between posts from social networks, but didn't find any good algorithms for that, thoughts?
I just tried Levenshtein, JaroWinkler, and others, but those one are more used to compare texts without sentiments. In posts we can get one text saying "I really love dogs" and an other saying "I really...
Hi
We have a requirement in which we need to change change the words or phrases in the sentence while keeping its meaning intact. This application is going to provide suggestions to users who are involved in copy-writing.
I don't know where should I start... we have not yet finalized the technology but would like to do it in a Python o...
I have a list of strings that are all early modern English words ending with 'th.' These include hath, appointeth, demandeth, etc. -- they are all conjugated for the third person singular.
As part of a much larger project (using my computer to convert the Gutenberg etext of Gargantua and Pantagruel into something more like 20th century ...
Hello All,
I was looking for a open source library for generating automated summaries out of few words. For ex: if two qualities are given of a person a) good thinking skills b) bad handwriting, i need to generate a sentence like "Bob has good thinking skills however needs to improve on his handwriting". I need to know if any open sourc...
Hey guys,
I'm trying to use topic modeling with Mallet but have a question.
How do I know when do I need to rebuild the model? For instance I have this amount of documents I crawled from the web, using topic modeling provided by Mallet I might be able to create the models and infer documents with it. But overtime, with new data that I...
What tools or web services are available for machine text translation.
For example
ENGLISH TEXT > SERVER or LIB > GERMAN TEXT
Libraries are also acceptable.
Is Google language API the only one ?
...
Stackoverflow implemented its "Related Questions" feature by taking the title of the current question being asked and removing from it the 10,000 most common English words according to Google. The remaining words are then submitted as a fulltext search to find related questions.
I want to do something similar in my Django site. What is ...
Stackoverflow implemented its "Related Questions" feature by taking the title of the current question being asked and removing from it the 10,000 most common English words according to Google. The remaining words are then submitted as a fulltext search to find related questions.
How do I get such a list of the most common English words?...
I am trying to use the pretrained parsing model for English of the MaltParser by following the steps in the following page, but repeatedly getting a null pointer exception.
http://maltparser.org/mco/english_parser/engmalt.html
I am trying this on a MaltParser version 1.4 and Java version 6 on a Windows machine. I think the model was tra...