nlp

It's probably simpler in awk, but how can I say this in Python?

I have: Rutsch is for rutterman ramping his roe which is a phrase from Finnegans Wake. The epic riddle book is full of leitmotives like this, such as 'take off that white hat,' and 'tip,' all which get mutated into similar sounding words depending on where you are in the book itself. All I want is a way to find obvious occurrences of t...

How to guess out the grammars of a list of sentences generated by some way?

I have a lost of sentences generated from http://www.ywing.net/graphicspaper.php, a random computer graphics paper title generator, some of example sentences sorted are as following: Abstract Ambient Occlusion using Texture Mapping Abstract Ambient Texture Mapping Abstract Anisotropic Soft Shadows Abstract Approximation Abstract Appr...

True definition of an English word?

What would be the best definition of an English word? What are the other cases of an English word than just \w+? Some may include \w+-\w+ or \w+'\w+; some may exclude cases like \b[0-9]+\b. But I haven't seen any general consensus on those cases. Do we have a formal defintion of such? Can any of you clarify? (Edit: broaden the questi...

How do I manipulate parse trees?

I've been playing around with natural language parse trees and manipulating them in various ways. I've been using Stanford's Tregex and Tsurgeon tools but the code is a mess and doesn't fit in well with my mostly Python environment (those tools are Java and aren't ideal for tweaking). I'd like to have a toolset that would allow for easy ...

Regexp for Tokenizing English Text

What would be the best regular expression for tokenizing an English text? By an English token, I mean an atom consisting of maximum number of characters that can be meaningfully used for NLP purposes. An analogy is a "token" in any programming language (e.g. in C, '{', '[', 'hello', '&', etc. can be tokens). There is one restriction: Th...

Wikipedia: pages across multiple languages

Hi, I want to use wikipedia dump for my project. The below information is required for my project. For an wikipedia entry, I want to know which other language contain the page? I want an downloadable data in csv or other common format. Is there a way to get this data? Thanks Bala ...

NLP research :: Clustering of english language words?

Hi, Is there a partition of english words into a high level categories like say sports, basketball etc... Its required for my project. Is this data available somewhere? I am okay with overlapping of words across categories. Thank you Bala ...

Dealing with the example.cpp in CRF++ toolkit

Hi, I am just starting to learn about the use of CRF++ toolkit. I downloaded the linux version of CRF++ 0.54 , When i try to compile the example.cpp under sdk/ with the command g++ -o example example.cpp there comes the problem: hpl@hpl-desktop:~/Documents/CRF/CRF++-0.54$ g++ -o a example.cpp /tmp/ccmJQgGu.o: In function main': exampl...

Wikipedia categories

Hi, I want to get a list of all the wikipedia categories. I can find them here : http://en.wikipedia.org/wiki/Special:Categories Is there a way to download all of them in xml/csv format. Thank you Bala ...

How to get POS tagging using Stanford Parser.

I'm using Stanford Parser to parse the dependence relations between pair of words, but I also need the tagging of words. However, in the ParseDemo.java, the program only output the Tagging Tree. I need each word's tagging like this: My/PRP$ dog/NN also/RB likes/VBZ eating/VBG bananas/NNS ./. not like this: (ROOT (S (NP (PRP$ My...

MALLET tokenizer

Hi I want to use MALLET's topic modeling but can i provide my own tokenizer or tokenized version of the text documents when i import the data into mallet? I find MALLET's tokenizer inadequate for my usage... ...

Using Nltk and Wordnet how do i convert simple tense verb into its present, past or past participle form?

Hi Using Nltk and Wordnet how do i convert simple tense verb into its present, past or past participle form? For example: I want to write a function which would give me verb in expected form as follows. v = 'go' present = present_tense(v) print present # prints "going" past = past_tense(v) print past # prints "went" Any suggestion...

Where can I learn more about the Google search "did you mean" algorithm?

Possible Duplicate: How do you implement a Did you mean? I am writing an application where I require functionality similar to Google's "did you mean?" feature used by their search engine: Is there source code available for such a thing or where can I find articles that would help me to build my own? ...

How can I make this Python2.6 function work with Unicode?

I've got this function, which I modified from material in chapter 1 of the online NLTK book. It's been very useful to me but, despite reading the chapter on Unicode, I feel just as lost as before. def openbookreturnvocab(book): fileopen = open(book) rawness = fileopen.read() tokens = nltk.wordpunct_tokenize(rawness) nltk...

Improving entity naming with custom file/code in NLTK

We've been working with the NLTK library in a recent project where we're mainly interested in the named entities part. In general we're getting good results using the NEChunkParser class. However, we're trying to find a way to provide our own terms to the parser, without success. For example, we have a test document where my name ...

Parser Generator or Library that Supports Suffix Agreement

Hi! I'm working on a syntactic parser for some language. But this language requires suffix agreement highly. For example in English a verb must agree with pronoun as I,we,you-do or he,she,it,this-does etc. In this language a verb has different forms for each pronoun. I know in literature this is handled by unification method. But I coul...

Ruby NLP Libraries

Hi, Does anyone know of any good NLP frameorks for ruby? I am considering using the Java open-nlp librabrary http://opennlp.sourceforge.net/ via JRuby. I am reluctant to go down the JRuby route for a few reasons and mainly because I have no Java background. Are there any ruby frameworks or should I go down the JRuby route with open-n...

Natural Language Processing Algorithm for mood of an email

Hi, One simple question (but I haven't quite found an obvious answer in the NLP stuff I've been reading, which I'm very new to): I want to classify emails with a probability along certain dimensions of mood. Is there an NLP package out there specifically dealing with this? Is there an obvious starting point in the literature I start re...

Natural Language Processing

I have thousands of sentences in a file. I want to find only right/useful English Language words. Is it possible with Natural Language Processing? Sample Sentence: ~@^.^@~ tic but sometimes world good famous tac Zorooooooooooo I just want to extract only English Words like tic world good famous Any Advice how can I achieve this. Th...

How to do a Python split() on languages (like Chinese) that don't use whtespace as word separator?

I want to split a sentence into a list of words. For English and European languages this is easy, just use split() >>> "This is a sentence.".split() ['This', 'is', 'a', 'sentence.'] But I also need to deal with sentences in languages such as Chinese that don't use whitespace as word separator. >>> u"这是一个句子".split() [u'\u8fd9\u662f\u...