nlp

Algorithms to detect phrases and keywords from text

I have around 100 megabytes of text, without any markup, divided to approximately 10,000 entries. I would like to automatically generate a 'tag' list. The problem is that there are word groups (i.e. phrases) that only make sense when they are grouped together. If I just count the words, I get a large number of really common words (is, t...

Should I use LingPipe or NLTK for extracting names and places?

I'm looking to extract names and places from very short bursts of text example "cardinals vs jays in toronto" " Daniel Nestor and Nenad Zimonjic play Jonas Bjorkman w/ Kevin Ullyett, paris time to be announced" "jenson button - pole position, brawn-mercedes - monaco". This data is currently in a MySQL database, and I (pretty much) ...

Interesting linguistics/nlp problems/projects

As I know, looking for a problem to solve (debugging, thinking up a theme for an article, whatever) is the most creative, interesting and difficult part of any problem-solving work. Or just the most difficult. But I have no idea what's going on in programming-related linguistics. I love languages and simple-for-babies-but-neither-unders...

What is the default chunker for NLTK toolkit in Python?

I am using their default POS tagging and default tokenization..and it seems sufficient. I'd like their default chunker too. I am reading the NLTK toolkit book, but it does not seem like they have a default chunker? ...

chunking/text parsing using NLTK

I am trying to parse some text and diagram it, like you would a sentence. I am new to NLTK and am trying to find something in NLTK that will help me accomplish this. So far, I have seen nltk.ne_chunk and nltk.pos_tag. I find them to be not very helpful and I am not able to find any good online documentation. I have also tried to use the...

How to make words into a category. (NLP)

I love to eat chicken. Today I went running, swimming and played basketball. My objective is to return FOOD and SPORTS just by analyzing these two sentences. How can you do that? I am familiar with NLP and Wordnet. But is there something more high-level/practical/modern technology?? Is there anything that automatically categorizes w...

Does WordNet have "levels"? (NLP)

For example... Chicken is an animal. Burrito is a food. WordNet allows you to do "is-a"...the hiearchy feature. However, how do I know when to stop travelling up the tree? I want a LEVEL. That is consistent. For example, if presented with a bunch of words, I want wordNet to categorize all of them, but at a certain level, so it doesn'...

Dealing with integer-valued features for CRF in mallet

Hi, I am just starting to use the SimpleTagger class in mallet. My impression is that it expects binary features. The model that I want to implement has positive integer-valued features and I wonder how to implement this in mallet. Also, I heard that non-binary features need to be normalized if the model is to make sense. I would apprec...

Extracting pure content / text from HTML Pages by excluding navigation and chrome content

Hi, I am crawling news websites and want to extract News Title, News Abstract (First Paragraph), etc I plugged into the webkit parser code to easily navigate webpage as a tree. To eliminate navigation and other non news content I take the text version of the article (minus the html tags, webkit provides api for the same). Then I run th...

find some sentences

Hi, I'd like to find good way to find some (let it be two) sentences in some text. What will be better - use regexp or split-method? Your ideas? As requested by Jeremy Stein - there are some examples Examples: Input: The first thing to do is to create the Comment model. We’ll create this in the normal way, but with one small diffe...

Natural language rendering

Do you know any frameworks that implement natural language rendering concept ? I've found several NLP oriented frameworks like Anthelope or Open NLP but they have only parsers but not renderers or builders. For example I want to render a question about smth. I'm constructing sentence object, setting it's properties, specify it's language...

Computing precision and recall in Named Entity Recognition

Hi, Now I am about to report the results from Named Entity Recognition. One thing that I find a bit confusing is that my understanding of precision and recall was that one simply sums up true positives, true negatives, false positives and false negatives over all classes. But this seems implausible now that I think of it as each miscla...

what is the true difference between lemmatization vs stemming?

When do I use each ? Also...is the NLTK lemmatization dependent upon Parts of Speech? Wouldn't it be more accurate if it was? ...

Python and .NET integration

I'm currently looking at python because I really like the text parsing capabilities and the nltk library, but traditionally I am a .Net/C# programmer. I don't think IronPython is an integration point for me because I am using NLTK and presumably would need a port of that library to the CLR. I've looked a little at Python for .NET and w...

Clustering text in Python

I need to cluster some text documents and have been researching various options. It looks like LingPipe can cluster plain text without prior conversion (to vector space etc), but it's the only tool I've seen that explicitly claims to work on strings. Are there any Python tools that can cluster text directly? If not, what's the best wa...

Ideas for Natural Language Processing project?

I have to do a final project for my computational linguistics class. We've been using OCaml the entire time, but I also have familiarity with Java. We've studied morphology, FSMs, collecting parse trees, CYK parsing, tries, pushdown automata, regular expressions, formal language theory, some semantics, etc. Here are some ideas I've come...

can NLTK/pyNLTK work "per language" (i.e. non-english), and how?

how can I tell nltk to treat the text in a particular language? BKG: once in a while i write a specialized NLP routine to do POS tagging, tokenizing etc. on a non-english (but still hindo-european) text domain. this question seem to address only different corpora, not the change in code / settings: http://stackoverflow.com/questions/16...

c/c++ NLP library

I am looking for an open source Natural Language Processing library for c/c++ and especially i am interested in Part of speech tagging. ...

Java Stanford NLP: Find word frequency?

I'm using the Stanford NLP Parsing toolkit. Given a word in the lexicon, how can I find its frequency*? Or, given a frequency rank, how can I determine the corresponding word? *in the entire language, not just the text sample. This is a demo of the toolkit I'm using: class ParserDemo { public static void main(String[] args) { Le...

Java Stanford NLP: Part of Speech labels?

The Stanford NLP, demo'd here, gives an output like this: Colorless/JJ green/JJ ideas/NNS sleep/VBP furiously/RB ./. What do the Part of Speech tags mean? I am unable to find an official list. Is it Stanford's own system, or are they using universal tags? (What is JJ, for instance?) Also, when I am iterating through the sentences, lo...