nltk

How to avoid computation every time a python module is reloaded

I have a python module that makes use of a huge dictionary global variable, currently I put the computation code in the top section, every first time import or reload of the module takes more then one minute which is totally unacceptable. How can I save the computation result somewhere so that the next import/reload doesn't have to compu...

how do i use python libraries in C++?

I want to use the nltk libraries in c++. Is there a glue language/mechanism I can use to do this? Reason: I havent done any serious programming in c++ for a while and want to revise NLP concepts at the same time. Thanks ...

Practical examples of NLTK use

I'm playing about with the Natural Language Toolkit (NLTK). The documentation (Book and HOWTO) and is a little heavy going. Are there any good but basic examples of the use of NLTK? I'm thinking of things like the NTLK articles on the Stream Hacker blog. ...

Python: Replace string with prefixStringSuffix keeping original case, but ignoring case when searching for match

So what I'm trying to do is replace a string "keyword" with "<b>keyword</b>" in a larger string. Example: myString = "HI there. You should higher that person for the job. Hi hi." keyword = "hi" result I would want would be: result = "<b>HI</b> there. You should higher that person for the job. <b>Hi</b> <b>hi</b>." I will not...

Generating random sentences from custom text in Python's NLTK?

I'm having trouble with the NLTK under Python, specifically the .generate() method. generate(self, length=100) Print random text, generated using a trigram language model. Parameters: * length (int) - The length of text to generate (default=100) Here is a simplified version of what I am attempting. import nltk word...

Which word stemmer should I use in nltk?

My goal is to analyze some corpus (twitter for the now) for emotional content. Just today I realized it would make a bit of sense to search for word stems as opposed to having an exhaustive list of emotional word stems. And so I've been exploring nltk.stem only to realize that there are 4 different stemmers. I'd like to ask the stackover...

Using the Python NLTK (2.0b5) on the Google App Engine

I have been trying to make the NLTK (Natural Language Toolkit) work on the Google App Engine. The steps I followed are: Download the installer and run it (a .dmg file, as I am using a Mac). copy the nltk folder out of the python site-packages directory and place it as a sub-folder in my project folder. Create a python module in the fo...

What is the best artificial-intelligence library for Python?

I know of NLTK. What else is there that complements this library? Or can do AI? NLTK is great because I can learn it with the book that it came out. Is there a library for AI just like this? ...

NLTK tagging in German

I am using NLTK to extract nouns from a text-string starting with the following command: tagged_text = nltk.pos_tag(nltk.Text(nltk.word_tokenize(some_string))) It works fine in English. Is there an easy way to make it work for German as well? (I have no experience with natural language programming, but I managed to use the python nl...

tokenizer errors with nltk

I'm very new to Python, and am trying to learn in conjunction with using nltk. I've been following some examples and testing things out, but it seems I am very limited in what I can do due to errors being returned by python. I know nltk is installed and importing fine, because this code works from nltk.sem import chat80 print chat8...

What is the default chunker for NLTK toolkit in Python?

I am using their default POS tagging and default tokenization..and it seems sufficient. I'd like their default chunker too. I am reading the NLTK toolkit book, but it does not seem like they have a default chunker? ...

chunking/text parsing using NLTK

I am trying to parse some text and diagram it, like you would a sentence. I am new to NLTK and am trying to find something in NLTK that will help me accomplish this. So far, I have seen nltk.ne_chunk and nltk.pos_tag. I find them to be not very helpful and I am not able to find any good online documentation. I have also tried to use the...

How to make words into a category. (NLP)

I love to eat chicken. Today I went running, swimming and played basketball. My objective is to return FOOD and SPORTS just by analyzing these two sentences. How can you do that? I am familiar with NLP and Wordnet. But is there something more high-level/practical/modern technology?? Is there anything that automatically categorizes w...

what is the true difference between lemmatization vs stemming?

When do I use each ? Also...is the NLTK lemmatization dependent upon Parts of Speech? Wouldn't it be more accurate if it was? ...

Python and .NET integration

I'm currently looking at python because I really like the text parsing capabilities and the nltk library, but traditionally I am a .Net/C# programmer. I don't think IronPython is an integration point for me because I am using NLTK and presumably would need a port of that library to the CLR. I've looked a little at Python for .NET and w...

can NLTK/pyNLTK work "per language" (i.e. non-english), and how?

how can I tell nltk to treat the text in a particular language? BKG: once in a while i write a specialized NLP routine to do POS tagging, tokenizing etc. on a non-english (but still hindo-european) text domain. this question seem to address only different corpora, not the change in code / settings: http://stackoverflow.com/questions/16...

What is "entropy and information gain"?

I am reading this book (NLTK) and it is confusing. Entropy is defined as: Entropy is the sum of the probability of each label times the log probability of that same label How can I apply entropy and maximum entropy in terms of text mining? Can someone give me a easy, simple example (visual)? ...

NLTK - how to find out what corpora are installed from within python?

I'm trying to load some corpora I installed with the NLTK installer but I got a: >>> from nltk.corpus import machado Traceback (most recent call last): File "<stdin>", line 1, in <module> ImportError: cannot import name machado But in the download manager (nltk.download()) the package machado is marked as installed a...

Training set - proportion of pos / neg / neutral sentences

Hello, I am hand tagging twitter messages as Positive, Negative, Neutral. I am try to appreciate is there some logic one can use to identify of the training set what proportion of message should be positive / negative and neutral ? So for e.g. if I am training a Naive Bayes classifier with 1000 twitter messages should the proportion o...

How to choose a Feature Selection Algorithm? - advice

Is there a research paper/book that I can read which can tell me for the problem at hand what sort of feature selection algorithm would work best. I am trying to simply identify twitter messages as pos/neg (to begin with). I started out with Frequency based feature selection (having started with NLTK book) but soon realised that for a ...