nlp

Optimizing a Recursive Method In PHP

I'm writing a text tag parser and I'm currently using this recursive method to create tags of n words. Is there a way that it can be done non-recursively or at least be optimized? Assume that $this->dataArray could be a very large array. /** * A recursive function to add phrases to the tagTracker array * @param string $data * @param ...

Is there a better tool than opencalais?

Opencalais lets you submit a string (REST API) ....and it will analyze that string and break it down into named-entities, relationships, keywords, etc. Are there better tools other than opencalais? (both free and commercial) ...

English Lexicon for Search Query Correction

I'm building a spelling corrector for search engine queries by implementing the method described in "Spelling correction as an iterative process that exploits the collective knowledge of web users". The high-level approach is as follows: for a given query, come up with possible correction candidates (words in the query log within a c...

Python NLTK code snippet to train a classifier (naive bayes) using feature frequency

Hello, I was wondering if anyone could help me through a code snippet that demonstrates how to train Naive Bayes classifier using a feature frequency method as opposed to feature presence. I presume the below as shown in Chap 6 link text refers to creating a featureset using Feature Presence (FP) - def document_features(document): ...

Difference between feature selection, feature extraction, feature weights ...

Hello, I am slightly confused as to what "feature selection / extractor / weights" mean and the difference between them. As I read the literature sometimes I feel lost as I find the term used quite loosely, my primary concerns are -- When people talk of Feature Frequency, Feature Presence - is it feature selection? When people talk ...

How can I create my own corpus in the Python Natural Language Toolkit?

I have recently expanded the names corpus in nltk and would like to know how I can turn the two files I have (male.txt, female.txt) in to a corpus so I can access them using the existing nltk.corpus methods. Does anyone have any suggestions? Many thanks, James. ...

Run GATE pipeline from inside a Java program without the GUI. build a tomcat app with gate

Hi , i have built some plugin components to GATE and in combination with ANNIE tools, im running a pipeline in GATE platform. Does anyone know how can i run a pipeline from the console? I want to build a web application in Tomcat that will be taking a plain text from the web page, passing it to the GATE pipeline i have built and do so...

English verb inflector

Does anybody know of an English verb inflector that I can use on a lexicon of verbs (in present-participle) that can give me other inflected forms of the verbs? For example: I give it I get ========= ====================================== run ran, running, runs sing sang, singing, sings play played, ...

Part of Speech Tagging - where to start?

Hello I would like to know how to implement the solution to such a task: There's a 500Mb file of plain English texts. I'd like to collect the statistics about the frequency of words, but additionally to be sure that each word is recognized correctly (or the majority of words). In terms that 'cry' in the sentence "she gave a loud CRY" ...

Python - letter frequency count and translation.

Hi, I am using Python 3.1, but I can downgrade if needed. I have an ASCII file containing a short story written in one of the languages the alphabet of which can be represented with upper and or lower ASCII. I wish to: 1) Detect an encoding to the best of my abilities, get some sort of confidence metric (would vary depending on the len...

Machine Learning and Natural Language Processing

Assume you know a student who wants to study Machine Learning and Natural Language Processing. What introductory subjects would you recommend? Example: I'm guessing that knowing Prolog and Matlab might help him. He also might want to study Discrete Structures*, Calculus, and Statistics. *Graphs and trees. Functions: properties, recur...

Build a natural language model that fixes misspellings.

What are books about how to build a natural language parsing program like this: input: I got to TALL you output: I got to TELL you input: Big RAT box output: Big RED box in: hoo un thum zend three out: one thousand three It must have the language model that allows to predict what words are misspelled ! What are the best books on ...

How to automatically determine text quality?

A lot of Natural Language Processing (NLP) algorithms and libraries have a hard time working with random texts from the web, usually because they are presupposing clean, articulate writing. I can understand why that would be easier than parsing YouTube comments. My question is: given a random piece of text, is there a process to determi...

C++ - How to read Unicode characters( Hindi Script for e.g. ) using C++ or is there a better Way through some other programming language?

Hi:) I have a hindi script file like this: 3. भारत का इतिहास काफी समृद्ध एवं विस्तृत है। I have to write a program which adds a position to each and every word in each sentence. Thus the numbering for every line for a particular word position should start off with 1 in parentheses. The output should be something like this. 3. भारत...

generate a list of english words containing consecutive consonant sounds

Start with this: [G|C] * [T] * Write a program that generates this: Cat Cut Cute City <-- NOTE: this one is wrong, because City has an "ESS" sound at the start. Caught ... Gate Gotti Gut ... Kit Kite Kate Kata Katie Another Example, This: [C] * [T] * [N] Should produce this: Cotton Kitten Where should I start my research as...

What is a good Java library for Parts-Of-Speech tagging?

I'm looking for a good open source POS Tagger in Java. Here's what I have come up with so far. LingPipe Stanford LBJ FastTag Anybody got any recommendations? ...

python - syntax error

Hi:) I am not able to figure out what the error in the program is could you please help me out with it. Thank you..:) The input file contains the following: 3. भारत का इतिहास काफी समृद्ध एवं विस्तृत है। 57. जैसे आज के झारखंड प्रदेश से, उन दिनों, बहुत से लोग चाय बागानों में मजदूरी करने के उद्देश्य से असम आए। ( its basically sample se...

Python - appending word position nos. to Unicode text

Hi..:) I have a code which appends word positions to the words from the source file but the output is not coming as desired: The input file contains the following: 3. भारत का इतिहास काफी समृद्ध एवं विस्तृत है। 57. जैसे आज के झारखंड प्रदेश से, उन दिनों, बहुत से लोग चाय बागानों में मजदूरी करने के उद्देश्य से असम आए। The original sourc...

Text similarity algorithm

I have two subtitles files. I need a function that tells whether they represent the same text, or the similar text Sometimes there are comments like "The wind is blowing... the music is playing" in one file only. But 80% percent of the contents will be the same. The function must return TRUE (files represent the same text). And sometime...

How do I tell what language is a plain-text file written in ?

Suppose we have a text file with the content: "Je suis un beau homme ..." another with: "I am a brave man" the third with a text in German: "Guten morgen. Wie geht's ?" How do we write a function that would tell us: with such a probability the text in the first file is in English, in the second we have French etc? Links to books / ou...