I have the following string (japanese) " ユーザー名" , the first character is "like" whitespace but its number in unicode is 12288, so if I do " ユーザー名".trim() I get the same string (trim doesn't work).
If i do trim in c++ it works ok.
Does anyone know how to solve this issue in java?
Is there a special trim method for unicode?
...
Does anyone know of a Java library that handles finding sentence boundaries? I'm thinking that it would be a smart StringTokenizer implementation that knows about all of the sentence terminators that languages can use.
Here's my experience with BreakIterator:
Using the example here:
I have the following Japanese:
今日はパソコンを買った。高性能のマックは早...
My graduate research is in Arabic Speech Recognition. My work involves dealing with text alot for different kinds of tasks such as:
Cleaning up messy transcriptions, I work with diacritized text and it is very important that they are put in the right place. I use lots of Regular Expressions for that.
Experimenting with search algorithm...
Is there a way to classify a particular sentence/paragraph as funny. There are very few pointers as to where one should go further on this.
...
Given a few words of input, I want to have a utility that will return a diverse set of relevant terms, phrases, or concepts. A caveat is that it would need to have a large graph of terms to begin with, or else the feature would not be very useful.
For example, submitting "baseball" would return
["shortstop", "Babe Ruth", "foul ball",...
Could you recommend a training path to start and become very good in Information Extraction. I started reading about it to do one of my hobby project and soon realized that I would have to be good at math (Algebra, Stats, Prob). I have read some of the introductory books on different math topics (and its so much fun). Looking for some gu...
I'm looking for an open source implementation, preferably in python, of Textual Sentiment Analysis (http://en.wikipedia.org/wiki/Sentiment_analysis). Is anyone familiar with such open source implementation I can use?
I'm writing an application that searches twitter for some search term, say "youtube", and counts "happy" tweets vs. "sad"...
Binarization is the act of transforming colorful features of of an entity into vectors of numbers, most often binary vectors, to make good examples for classifier algorithms.
If we where to binarize the sentence "The cat ate the dog", we could start by assigning every word an ID (for example cat-1, ate-2, the-3, dog-4) and then simply r...
Stemming is something that's needed in tagging systems. I use delicious, and I don't have time to manage and prune my tags. I'm a bit more careful with my blog, but it isn't perfect. I write software for embedded systems that would be much more functional (helpful to the user) if they included stemming.
For instance:
Parse
Parser
Par...
I'm interested in learning more about Natural Language Processing (NLP) and am curious if there are currently any strategies for recognizing proper nouns in a text that aren't based on dictionary recognition? Also, could anyone explain or link to resources that explain the current dictionary-based methods? Who are the authoritative exper...
I am trying to find words (specifically physical objects) related to a single word. For example:
Tennis: tennis racket, tennis ball, tennis shoe
Snooker: snooker cue, snooker ball, chalk
Chess: chessboard, chess piece
Bookcase: book
I have tried to use WordNet, specifically the meronym semantic relationship; however, this method is...
I have a list of words and I want to filter it down so that I only have the nouns from that list of words (Using Java). To do this I am looking for an easy way to query a database of words for their type.
My question is does anybody know of a free, easy word lookup API that would enable me to find the class of a word, not necessarily i...
hi there
Is there any code available to demonstrate Natural language processing using Wordnet?
My problem statment is "Develop a Query answering system .
It takes query string as input.
Search for the revelent answer from the document which is the user is reading. Its a desktop application the document is already saved.
Desired output i...
I am looking to use a natural language parsing library for a simple chat bot. I can get the Parts of Speech tags, but I always wonder. What do you do with the POS. If I know the parts of the speech, what then?
I guess it would help with the responses. But what data structures and architecture could I use.
...
I am trying to build an NLP system for an assignment, for which I am allowed to use external libraries.
I am using parse trees to break down sentences into their constituent parts down to nouns, verbs, etc.
I am looking for a library or software that would let me identify which lexical form a word is in, and possibly translate it to some...
Hi,
I intend to use a multi layer perceptron network trained with backpropagation (one hidden layer, inputs served as 8x8 bit matrices containing the B/W pixels from the image). The following questions arise:
which type of learning should I use: batch or on-line?
how could I estimate the right number of nodes in the hidden layer? I in...
I have a list representing products which are more or less the same. For instance, in the list below, they are all Seagate hard drives.
Seagate Hard Drive 500Go
Seagate Hard Drive 120Go for laptop
Seagate Barracuda 7200.12 ST3500418AS 500GB 7200 RPM SATA 3.0Gb/s Hard Drive
New and shinny 500Go hard drive from Seagate
Seagate Barracuda...
Hi All,
I want to crawl for specific things. Specifically events that are taking place like concerts, movies, art gallery openings, etc, etc. Anything that one might spend time going to.
How do I implement a crawler?
I have heard of Grub (grub.org -> Wikia) and Heritix (http://crawler.archive.org/)
Are there others?
What opinions do...
Say you've got a big table that contains a varchar column.
How would you match rows that contain the word 'preferred' in the varchar col BUT the data is somewhat noisy and contains occasional spelling errors, e.g.:
['$2.10 Cumulative Convertible Preffered Stock, $25 par value',
'5.95% Preferres Stock',
'Class A Preffered',
'Series A Pe...
I want to use naive bayes to classify documents into a relatively large number of classes. I'm looking to confirm whether an mention of an entity name in an article really is that entity, on the basis of whether that article is similar to articles where that entity has been correctly verified.
Say, we find the text "General Motors" in a...