natural-language

How to define person's names in text (Java)

I have some input text, which contains one or more human person names. I do not have any dictionary for these names. Which Java library can help me to define names from my input text? I looked through OpenNLP, but did not find any example or guide or at least description of how it can be applied into my code. (I saw javadoc, but it is pr...

I have a list of country codes and a list of language codes. How do I map from country code to language code?

When the user visits the site, I can get their country code. I want to use this to set the default language (which they can later modify if necessary, just a general guess as to what language they might speak based on what country they are in). Is there a definitive mapping from country codes to language codes that exists somewhere? ...

Extract Nouns from Text (Java)

Does anyone know the easiest way to extract only nouns from a body of text? I've heard about the TreeTagger tool and I tried giving it a shot but couldn't get it to work for some reason. Any suggestions? Thanks Phil EDIT: import org.annolab.tt4j.*; TreeTaggerWrapper tt = new TreeTaggerWrapper(); try { tt.setModel("/Nouns/english...

Training Hidden Markov Models without Tagged Corpus Data

For a linguistics course we implemented Part of Speech (POS) tagging using a hidden markov model, where the hidden variables were the parts of speech. We trained the system on some tagged data, and then tested it and compared our results with the gold data. Would it have been possible to train the HMM without the tagged training set? ...

What is a good tool for Natural Language Detection in Java?

I need to do natural language detection (with confidence scores), preferably in Java, I'd really not introduce more platforms/technologies at this stage of the project. I have previously used the Google API for this in a PoC, but I now need to scale up to very large amounts of data, so any web-based solution won't cut it, (also Google ar...

Intelligent text parsing and translation

What would be an intelligent way to store text, so that it can be intelligently parsed and translated later on. For example, The employee is outstanding as he can identify his own strengths and weaknesses and is comfortable with himself. The above could be the generic text which is shown to the user prior to evaluation. If the user is ...

How to recognize words in text with non-word tokens?

I am currently parsing a bunch of mails and want to get words and other interesting tokens out of mails (even with spelling errors or combination of characters and letters, like "zebra21" or "customer242"). But how can I know that "0013lCnUieIquYjSuIA" and "anr5Brru2lLngOiEAVk1BTjN" are not words and not relevant? How to extract words a...

How does twitter's trending topics algorithm decide which words to extract from tweets?

I saw this question, which focuses on the "Brittney Spears" problem. But I have a bit of a different question. How does the algorithm determine which words or phrases need to be ranked? For instance, if I send out a tweet that says "Michael Jackson died", how does it know to pull out "Michael Jackson" but not "died"? Or suppose that ...

How does Amazon's Statistically Improbable Phrases work?

How does something like Statistically Improbable Phrases work? According to amazon: Amazon.com's Statistically Improbable Phrases, or "SIPs", are the most distinctive phrases in the text of books in the Search Inside!™ program. To identify SIPs, our computers scan the text of all books in the Search Inside! program. If ...

semantic similarity between sentences.

hi this is salma.i am doing project.i need any opensource tool or technique to find the semantic similarity between sentences where i give input as two sentences and output as score (i.e.,semantic similarity).can any one know this information.i hope i will get reply soon.thank you all. ...

What statistics should a programmer (or computer scientist) know?

I'm a programmer with a decent background in math and computer science. I've studied computability, graph theory, linear algebra, abstract algebra, algorithms, and a little probability and statistics (through a few CS classes) at an undergraduate level. I feel, however, that I don't know enough about statistics. Statistics are increasin...

Natural Language Parsing tools: what is out there and what is not?

I'm looking for various NLP tools for a project I'm working on and right now I've found most useful the Stanford NLP projects. Does anyone know if there are other tools that are out there that would be useful for a language understander? And more importantly, are there tools that are NOT out there? Most specifically, I'm looking fo...

Language related -What does Client-server application mean?

Well It's not a big question, obviously. But you see, an application that is using a database on the servers, and is installed on multiple clients. Is called Client/Server application. And an application that is constituted by two parts: Host (or server) part, and the client part. They are both called client/server apps How can we d...

Data structure/Algorithm for Streaming Data and identifying topics

Hi, I want to know the effective algorithms/data structures to identify the below information in streaming data. Consider a real-time streaming data like twitter. I am mainly interested in the below queries rather than storing the actual data. I need my queries to run on actual data but not any of the duplicates. As I am not i...

Is there a better tool than opencalais?

Opencalais lets you submit a string (REST API) ....and it will analyze that string and break it down into named-entities, relationships, keywords, etc. Are there better tools other than opencalais? (both free and commercial) ...

What a single sentence consist of? How to name it?

Hi, I'm designing architecture of a text parser. Example sentence: Content here, content here. Whole sentence is a... sentence, that's obvious. The, quick etc are words; , and . are punctuation marks. But what are words and punctuation marks all together in general? Are they just symbols? I simply don't know how to name what a singl...

English Lexicon for Search Query Correction

I'm building a spelling corrector for search engine queries by implementing the method described in "Spelling correction as an iterative process that exploits the collective knowledge of web users". The high-level approach is as follows: for a given query, come up with possible correction candidates (words in the query log within a c...

Am I passing the string correctly to the python library?

I'm using a python library called Guess Language: http://pypi.python.org/pypi/guess-language/0.1 "justwords" is a string with unicode text. I stick it in the package, but it always returns English, even though the web page is in Japanese. Does anyone know why? Am I not encoding correctly? §ç©ºéå ¶ä»æ¡å°±æ²æéç¨®å¾ ...

Is there some functionality in Cocoa to display time intervals in natural language?

What I am searching for is a Cocoa (or third party) class that can display time intervals in natural language, e.g.: 10 seconds ago 1 hour ago 2 days ago Do you know anything that could help me to achieve this task without writing it by myself and melting in if-else hell? ...

Part of Speech Tagging - where to start?

Hello I would like to know how to implement the solution to such a task: There's a 500Mb file of plain English texts. I'd like to collect the statistics about the frequency of words, but additionally to be sure that each word is recognized correctly (or the majority of words). In terms that 'cry' in the sentence "she gave a loud CRY" ...