nlp

Decision Trees For Document Classification

Hi I wanted to know that is it possible to use decision trees for document classification and if yes then how should be the data representation be? I know the use of R package party for Decision Trees. ...

Cosine Similarity of Vectors of different lengths?

I'm trying to use TF-IDF to sort documents into categories. I've calculated the tf_idf for some documents, but now when I try to calculate the Cosine Similarity between two of these documents I get a traceback saying: #len(u)==201, len(v)==246 cosine_distance(u, v) ValueError: objects are not aligned #this works though: cosine_distan...

Does NLTK have a tool for dependency parsing?

Hey all, I'm building a NLP application and have been using the Stanford Parser for most of my parsing work, but I would like to start using Python. So far, NLTK seems like the best bet, but I cannot figure out how to parse grammatical dependencies. I.e. this is an example from the Stanford Parser. I want to be able to produce this i...

Word frequency counter

Do you know a class in Java that counts word frequency of the text, and maybe gives all the blocks of the text where the word occurs? ...

how do I create my own training corpus for stanford tagger?

Hey guys, I have to analyze informal english text with lots of short hands and local lingo. Hence I was thinking of creating the model for the stanford tagger. How do i create my own set of labelled corpus for the stanford tagger to train on? What is the syntax of the corpus and how long should my corpus be in order to achieve a desir...

Unstructured Text to Structured Data

I am looking for references (tutorials, books, academic literature) concerning structuring unstructured text in a manner similar to the google calendar quick add button. I understand this may come under the NLP category, but I am interested only in the process of going from something like "Levi jeans size 32 A0b293" to: Brand: Levi, Si...

Gate Named Entity with ANNIE using IKVM in .net

hi, I am looking for some guidance on using Gate and ANNIE in a .net enviornment. Has anyone converted GATE to a .NET DLL using IKVMC, and had much success running named entity recognition in .NET/C# using the converted DLL? Thanks in advance. ...

Identifying collocation in Stanford POS Tagger?

Hi guys, Is the Stanford POS tagger able to detect collocation? If so, how do I use it? If I want to provide my own training file for the Stanford POS Tagger, do I have to tag the words according to the one like the WSJ This means that I have to 'bracket" the words into Entities and collocation right? If so, how do I find collocati...

Evaluating the "Value" Attribute

I'm attempting to use the OpenAmplify API to evaluate the content of a URI. The point is to draw out the topics that are truly relevant to the article. Unfortunately, the topical analysis I'm getting back is: Huge, and Varied Neither quality is terribly useful for what I'm trying to do because the signal to noise ratio is being heavi...

Probabilistic Generation of Semantic Networks

I've studied some simple semantic network implementations and basic techniques for parsing natural language. However, I haven't seen many projects that try and bridge the gap between the two. For example, consider the dialog: "the man has a hat" "he has a coat" "what does he have?" => "a hat and coat" A simple semantic network, based...

How to get synonyms ordered by their occurrence probability from Wordnet

I am searching in Wordnet for synonyms for a big list of words. The way I have it done it, when some word has more than one synonym, the results are returned in alphabetical order. What I need is to have them ordered by their probability of occurrence, and I would take just the top 1 synonym. I have used the prolog wordnet database an...

How to use NLP to separate a unstructured text content into distinct paragraphs ?

The following unstructured text has three distinct themes -- Stallone, Philadelphia and the American Revolution. But which algorithm or technique would you use to separate this content into distinct paragraphs? Classifiers won't work in this situation. I also tried to use Jaccard Similarity analyzer to find distance between successive ...

stanford tagger - tagging speed

Hey guys, regarding the stanford tagger, I've provided my own labelled corpus for training the model for the stanford tagger. However, I've realised that the tagging speed of my model for the tagger is much less slower than the default wsjleft3 tagger model. What might contribute to this? And how do I improve the speed of my model? (I'v...

Perl and NLP, parse Names out of Biographies

I'm pretty new to NLP in general, but getting really good at Perl, and I was wondering what kind of powerful NLP modules are out there. Basically, I have a file with a bunch of paragraphs, and some of them are people's biographies. So, first I need to look for a person's name, and that helps with the rest of the process later. So I was ...

arch options in stanford tagger?

Hey guys, other than the standard arch options like left3words, left5words,bidirectional, bi5words, what do the rest of the options mean? And what arguments are needed for them? I can't seem to find the documentation anywhere! ...

Adding documents to a scored TF-IDF collection?

I have a large collection of documents that already have their TF-IDF computed. I'm getting ready to add some more documents to the collection, and I am wondering if there is a way to add TF-IDF scores to the new documents without re-processing the entire database? ...

Generating RDF From Natural Language

Are there any tools available for generating RDF from natural language? A list of RDFizers compiled by the SIMILE project only mentions one, the Monrai Cypher. Unfortunately, it seems to have been a proprietary tool developed by Monrai Technologies, which has since disappeared, and I can't find any download links. Has anyone seen anythin...

Does anyone know how many sentences there are in the original Penn Treebank?

I can't seem to find that in the documentation anywhere ...

Parsing bulk text with Hadoop: best practices for generating keys.

Hello, I have a 'large' set of line delimited full sentences that I'm processing with Hadoop. I've developed a mapper that applies some of my favorite NLP techniques to it. There are several different techniques that I'm mapping over the original set of sentences, and my goal during the reducing phase is to collect these results into ...

Splitting a Domain name into constituent words (if possible)?

I want to break a domain name into constituent words and numbers e.g. iamadomain11.com = ['i', 'am', 'a', 'domain', '11'] How do i do this? I am aware that there may be multiple sets possible, however, i am currently even ok, just getting 1 set of possibilities. ...