nltk

Python NLTK code snippet to train a classifier (naive bayes) using feature frequency

Hello, I was wondering if anyone could help me through a code snippet that demonstrates how to train Naive Bayes classifier using a feature frequency method as opposed to feature presence. I presume the below as shown in Chap 6 link text refers to creating a featureset using Feature Presence (FP) - def document_features(document): ...

Difference between feature selection, feature extraction, feature weights ...

Hello, I am slightly confused as to what "feature selection / extractor / weights" mean and the difference between them. As I read the literature sometimes I feel lost as I find the term used quite loosely, my primary concerns are -- When people talk of Feature Frequency, Feature Presence - is it feature selection? When people talk ...

How can I create my own corpus in the Python Natural Language Toolkit?

I have recently expanded the names corpus in nltk and would like to know how I can turn the two files I have (male.txt, female.txt) in to a corpus so I can access them using the existing nltk.corpus methods. Does anyone have any suggestions? Many thanks, James. ...

Installing numpy broke NLTK (OS X 10.6.2, Python 2.6)

I had a working installation of NLTK (py26-nltk) on my Mac (OS X 10.6.2). Then I installed numpy. Now when I try to import nltk, I get this: >>> import nltk Traceback (most recent call last): File "<stdin>", line 1, in <module> File "nltk/__init__.py", line 83, in <module> from collocations import * File "nltk/collocations.py"...

What is the difference between running a script from the command line and from exec() with PHP?

I'm trying to run a Python script using exec() from within PHP. My command works fine when I run it directly using a cmd window, but it produces an error when I run it from exec() in PHP. My Python script uses NTLK to find proper nouns. Example command: "C:\Python25\python.exe" "C:\wamp\projects\python\trunk\tests\find_proper_nouns.py"...

How to set pythonpath (python2.6) for tkinter on Ubuntu 9.04 (to use nltk)?

I'd like to use the nltk toolkit on my machine which runs Ubuntu 9.04. I installed python 2.6.4 and several additional packages (numpy, scipy, matplotlib and of course nltk). I can import nltk, but calling a few methods gives various error masseges, all contain "please install Tkinter library". Googling around I discovered from http://wi...

How to extract common / significant phrases from a series of text entries

I have a series of text items- raw HTML from a MYSQL database. I want to find the most common phrases in these entries (not the single most common phrase, and ideally, not enforcing word-for-word matching). My example is any review on Yelp.com, that shows 3 snippets from hundreds of reviews of a given restaurant, in the format: "Try ...

Python NLTK figure out tense

I have a web application that translates sentences into English; the user chooses options from drop downs that basically provide the context. Now I want to turn the word and the context into an English sentence. One case is that the user chooses 'who' and 'when', 'who' could be: I, you, you two, he, she, we, they. 'When' could be: 'did ...

Searching text for geonames

Hi, which part of huge package nltk I must study and use, if I need mark geonames in text? ...

Natural language processing - Ideas for beginner's projects

Hi guys, I am a beginner in NLP and NLTK. I am very interested in NLP and hence joined a weekend course on AI in some local institution, which requires me to do a project for completion of the course, and I decided to do it in NLP. The problem is,the instructor is not good at all for this course (According to me she is just a charlatan)...

tag generation from a text content

Hello, I am curious if there is an algorithm/method exists to generate keywords/tags from a given text, by using some weight calculations, occurrence ratio or other tools. Additionally, I will be grateful if you point any Python based solution / library for this. Thanks ...

tag generation from a small text content (such as tweets)

Hello, I have already asked a similar question earlier but I have notcied that I have big constrain: I am working on small text sets suchs as user Tweets to generate tags(keywords). And it seems like the accepted suggestion ( point-wise mutual information algorithm) is meant to work on bigger documents. With this constrain(working on ...

Text mining with PHP

Hi, I'm doing a project for a college class I'm taking. I'm using PHP to build a simple web app that classify tweets as "positive" (or happy) and "negative" (or sad) based on a set of dictionaries. The algorithm I'm thinking of right now is Naive Bayes classifier or decision tree. However, I can't find any PHP library that helps me do...

Unable to import nltk in NetBeans

Hello all, I am trying to import NLTK in my python code and I get this error: Traceback (most recent call last): File "/home/afs/NetBeansProjects/NER/getNE_followers.py", line 7, in import nltk ImportError: No module named nltk I am using NetBeans: 6.7.1, Python 2.6 NLTK. My NLTK module is installed in /usr/local/lib/python2.6/d...

Sentiment analysis with NLTK python for sentences using sample data or webservice?

I am embarking upon a NLP project for sentiment analysis. I have successfully installed NLTK for python (seems like a great piece of software for this). However,I am having trouble understanding how it can be used to accomplish my task. Here is my task: I start with one long piece of data (lets say several hundred tweets on the subje...

Java or Python distributed compute job (on a student budget)?

I have a large dataset (c. 40G) that I want to use for some NLP (largely embarrassingly parallel) over a couple of computers in the lab, to which i do not have root access, and only 1G of user space. I experimented with hadoop, but of course this was dead in the water-- the data is stored on an external usb hard drive, and i cant load it...

How to identify ideas and concepts in a given text

I'm working on a project at the moment where it would be really useful to be able to detect when a certain topic/idea is mentioned in a body of text. For instance, if the text contained: Maybe if you tell me a little more about who Mr Jones is, that would help. It would also be useful if I could have a description of his appearance, ...

Text mining: when to use parser, tagger, NER tool?

I'm doing a project on mining blog contents and I need help differentiating on which tool to uses. When do I use a parser, when do I use a tagger, and when do I need to use a NER tool? For instance, I want to find out the most talked about topics/subjects between several blogs; do I use a part-of-speech tagger to grab the nouns and do a...

Classifying Documents into Categories

I've got about 300k documents stored in a Postgres database that are tagged with topic categories (there are about 150 categories in total). I have another 150k documents that don't yet have categories. I'm trying to find the best way to programmaticly categorize them. I've been exploring NLTK and its Naive Bayes Classifier. Seems li...

Cosine Similarity of Vectors of different lengths?

I'm trying to use TF-IDF to sort documents into categories. I've calculated the tf_idf for some documents, but now when I try to calculate the Cosine Similarity between two of these documents I get a traceback saying: #len(u)==201, len(v)==246 cosine_distance(u, v) ValueError: objects are not aligned #this works though: cosine_distan...