nlp

Problem trimming Japanese string in java.

I have the following string (japanese) " ユーザー名" , the first character is "like" whitespace but its number in unicode is 12288, so if I do " ユーザー名".trim() I get the same string (trim doesn't work). If i do trim in c++ it works ok. Does anyone know how to solve this issue in java? Is there a special trim method for unicode? ...

Java library that finds sentence boundaries

Does anyone know of a Java library that handles finding sentence boundaries? I'm thinking that it would be a smart StringTokenizer implementation that knows about all of the sentence terminators that languages can use. Here's my experience with BreakIterator: Using the example here: I have the following Japanese: 今日はパソコンを買った。高性能のマックは早...

What's the best Scripting Language for Natural Language Processing?

My graduate research is in Arabic Speech Recognition. My work involves dealing with text alot for different kinds of tasks such as: Cleaning up messy transcriptions, I work with diacritized text and it is very important that they are put in the right place. I use lots of Regular Expressions for that. Experimenting with search algorithm...

NLP classify sentences/paragraph as funny

Is there a way to classify a particular sentence/paragraph as funny. There are very few pointers as to where one should go further on this. ...

Building or Finding a "relevant terms" suggestion feature.

Given a few words of input, I want to have a utility that will return a diverse set of relevant terms, phrases, or concepts. A caveat is that it would need to have a large graph of terms to begin with, or else the feature would not be very useful. For example, submitting "baseball" would return ["shortstop", "Babe Ruth", "foul ball",...

How to get started on Information Extraction?

Could you recommend a training path to start and become very good in Information Extraction. I started reading about it to do one of my hobby project and soon realized that I would have to be good at math (Algebra, Stats, Prob). I have read some of the introductory books on different math topics (and its so much fun). Looking for some gu...

Sentiment analysis for twitter in python

I'm looking for an open source implementation, preferably in python, of Textual Sentiment Analysis (http://en.wikipedia.org/wiki/Sentiment_analysis). Is anyone familiar with such open source implementation I can use? I'm writing an application that searches twitter for some search term, say "youtube", and counts "happy" tweets vs. "sad"...

Binarization in Natural Language Processing

Binarization is the act of transforming colorful features of of an entity into vectors of numbers, most often binary vectors, to make good examples for classifier algorithms. If we where to binarize the sentence "The cat ate the dog", we could start by assigning every word an ID (for example cat-1, ate-2, the-3, dog-4) and then simply r...

Stemming - code examples or open source projects?

Stemming is something that's needed in tagging systems. I use delicious, and I don't have time to manage and prune my tags. I'm a bit more careful with my blog, but it isn't perfect. I write software for embedded systems that would be much more functional (helpful to the user) if they included stemming. For instance: Parse Parser Par...

Strategies for recognizing proper nouns in NLP

I'm interested in learning more about Natural Language Processing (NLP) and am curious if there are currently any strategies for recognizing proper nouns in a text that aren't based on dictionary recognition? Also, could anyone explain or link to resources that explain the current dictionary-based methods? Who are the authoritative exper...

Finding related words (specifically physical objects) to a specific word

I am trying to find words (specifically physical objects) related to a single word. For example: Tennis: tennis racket, tennis ball, tennis shoe Snooker: snooker cue, snooker ball, chalk Chess: chessboard, chess piece Bookcase: book I have tried to use WordNet, specifically the meronym semantic relationship; however, this method is...

Online (preferably) lookup API of a word's class.

I have a list of words and I want to filter it down so that I only have the nouns from that list of words (Using Java). To do this I am looking for an easy way to query a database of words for their type. My question is does anybody know of a free, easy word lookup API that would enable me to find the class of a word, not necessarily i...

WordNet code for NLP

hi there Is there any code available to demonstrate Natural language processing using Wordnet? My problem statment is "Develop a Query answering system . It takes query string as input. Search for the revelent answer from the document which is the user is reading. Its a desktop application the document is already saved. Desired output i...

Natural language parsing, practical example

I am looking to use a natural language parsing library for a simple chat bot. I can get the Parts of Speech tags, but I always wonder. What do you do with the POS. If I know the parts of the speech, what then? I guess it would help with the responses. But what data structures and architecture could I use. ...

NLP: Morphological manipulations

I am trying to build an NLP system for an assignment, for which I am allowed to use external libraries. I am using parse trees to break down sentences into their constituent parts down to nouns, verbs, etc. I am looking for a library or software that would let me identify which lexical form a word is in, and possibly translate it to some...

Multi layer perceptron for OCR

Hi, I intend to use a multi layer perceptron network trained with backpropagation (one hidden layer, inputs served as 8x8 bit matrices containing the B/W pixels from the image). The following questions arise: which type of learning should I use: batch or on-line? how could I estimate the right number of nodes in the hidden layer? I in...

Algorithm to classify a list of products?

I have a list representing products which are more or less the same. For instance, in the list below, they are all Seagate hard drives. Seagate Hard Drive 500Go Seagate Hard Drive 120Go for laptop Seagate Barracuda 7200.12 ST3500418AS 500GB 7200 RPM SATA 3.0Gb/s Hard Drive New and shinny 500Go hard drive from Seagate Seagate Barracuda...

Crawling The Internet

Hi All, I want to crawl for specific things. Specifically events that are taking place like concerts, movies, art gallery openings, etc, etc. Anything that one might spend time going to. How do I implement a crawler? I have heard of Grub (grub.org -> Wikia) and Heritix (http://crawler.archive.org/) Are there others? What opinions do...

Match rows containing a word with permutations

Say you've got a big table that contains a varchar column. How would you match rows that contain the word 'preferred' in the varchar col BUT the data is somewhat noisy and contains occasional spelling errors, e.g.: ['$2.10 Cumulative Convertible Preffered Stock, $25 par value', '5.95% Preferres Stock', 'Class A Preffered', 'Series A Pe...

Naive bayes calculation in sql

I want to use naive bayes to classify documents into a relatively large number of classes. I'm looking to confirm whether an mention of an entity name in an article really is that entity, on the basis of whether that article is similar to articles where that entity has been correctly verified. Say, we find the text "General Motors" in a...