nlp

How to ensure user submit only english text

I am building a project involving natural language processing, since the nlp module currently only deal with english text, so I have to make sure the user submitted content (not long, only several words) is in english. Are there established ways to achieve this? Python or Javascript way preferred. ...

How can I use NLP to parse recipe ingredients?

I need to parse recipe ingredients into amount, measurement, item, and description as applicable to the line, such as 1 cup flour, the peel of 2 lemons and 1 cup packed brown sugar etc. What would be the best way of doing this? I am interested in using python for the project so I am assuming using the nltk is the best bet but I am open t...

What are good starting points for someone interested in natural language processing?

Question So I've recently came up with some new possible projects that would have to deal with deriving 'meaning' from text submitted and generated by users. Natural language processing is the field that deals with these kinds of issues, and after some initial research I found the OpenNLP Hub and university collaborations like the atte...

tf-idf and previously unseen terms

TF-IDF (term frequency - inverse document frequency) is a staple of information retrieval. It's not a proper model though, and it seems to break down when new terms are introduced into the corpus. How do people handle it when queries or new documents have new terms, especially if they are high frequency. Under traditional cosine match...

Natural Language/Text Mining and Reddit/social news site

I think there is a wealth of natural language data associated with sites like reddit or digg or news.google.com. I have done a little bit of research with text mining, but can't find how I could use those tools to parse something like reddit. What kind of applications can you come up with? ...

How does Google's In Quotes work?

I find Google's In Quotes a really nifty application, and as a CS guy, I have to understand how it works. How do you think it turns news articles into a list of quotes attributed to specific persons? Sure, there are some mistakes, but their algorithm seems to be smarter than just a simple heuristic or multiple regular expressions. For ex...

(human) Language of a document

Is there a way (a program, a library) to approximately know which language a document is written in? I have a bunch of text documents (~500K) in mixed languages to import in a i18n enabled CMS (Drupal).. I don't need perfect matches, only some guess. ...

CORPUS resource

Hello friends! I am designing an Automatic text summarizer. One of the major modules in this project requires TRAINING CORPUS. Can someone please help me out by providing TRAINING CORPUS or referring some link to download it. Thanks in anticipation ...

Algorithm to determine how positive or negative a statement/text is

I need an algorithm to determine if a sentence, paragraph or article is negative or positive in tone... or better yet, how negative or positive. For instance: Jason is the worst SO user I have ever witnessed (-10) Jason is an SO user (0) Jason is the best SO user I have ever seen (+10) Jason is the be...

fuzzy string search in Java

I'm looking for high performance Java library for fuzzy string search. There are numerous algorithms to find similar strings, Levenshtein distance, Daitch-Mokotoff Soundex, n-grams etc. What Java implemenations exists? Pros and cons for them? I'm aware of Lucene, any other solution or Lucene is best? I found these, anyone has experien...

Algorithms recognizing physical address on a webpage

What are the best algorithms for recognizing structured data on an HTML page? For example Google will recognize the address of home/company in an email, and offers a map to this address. ...

Story telling/building algorithms?

I'm working on a simple story generator and am looking for story building algorithms and patterns to use in my design. Anyone has some good recommendations? ...

natural language identification in PHP

I am looking for an automatic language identification tool written in php. The tool should receive as input a string, and output the name of the (natural) language the string is written in. A perl example is TextCat, ported to Java by Knallgrau New Media Solutions. Does anyone know of a PHP port? Or another similar PHP tool? ...

Is there open source software available that analyses a string and guesses the gender of the author?

I can't find anything other than closed-source web applications. Are there any active projects? I'd be interested in using the software in something I'm developing and getting involved. ...

About "AUTOMATIC TEXT SUMMARIZER (lingustic based)"

Hello, I am having "AUTOMATIC TEXT SUMMARIZER (linguistic approach)" as my final year project. I have collected enough research papers and gone through them. Still i am not very clear about the 'how-to-go-for-it' thing. Basically i found "AUTOMATIC TEXT SUMMARIZER (statistical based)" and found that it is much easier compared to my...

Detecting syllables in a word

I need to find a fairly efficient way to detect syllables in a word. E.g., invisible -> in-vi-sib-le There are some syllabification rules that could be used: V CV VC CVC CCV CCCV CVCC *where V is a vowel and C is a consonant. e.g., pronunciation (5 Pro-nun-ci-a-tion; CV-CVC-CV-V-CVC) I've tried few methods, among which were using...

Measuring the performance of classification algorithm

I've got a classification problem in my hand, which I'd like to address with a machine learning algorithm ( Bayes, or Markovian probably, the question is independent on the classifier to be used). Given a number of training instances, I'm looking for a way to measure the performance of an implemented classificator, with taking data overf...

Tool or methods for automatically creating contextual links within a large corpus of content?

Here's the basic scenario - I have a corpus of say 100,000 newspaper-like articles. Minimally they will all have a well-defined title, and some amount of body content. What I want to do is find runs of text in articles that ought to link to other articles. So, if article Foo has a run of text like "Students in 8th grade are being en...

Natural English language words

I need the most exhaustive English word list I can find for several types of language processing operations, but I could not find anything on the internet that has good enough quality. There are 1,000,000 words in the English language including foreign and/or technical words. Can you please suggest me such a source (or close to 500k w...

Natural language parser for dates (.NET)?

I want to be able to let users enter dates (including recurring dates) using natural language (eg "next friday", "every weekday"). Much like the examples at http://todoist.com/Help/timeInsert I found this post, but it's a bit old and offered only one solution that I'm not entirely content with. I thought I'd resurrect this question and ...