natural-language

How to detect nonsensical text in PHP ?

I have comments enabled on my site and I require users to enter at least 30 characters to publish their comments (Just to get some value because they usualy just submitted "I like it") But some users now use simple technique to overcome this and enter e.g.: "I like it. asdsdf dfdsfsdf tt erretrt re" As you can see the rest of the text...

Looking for artificial intelligence (AI) cookbook reader research

I am looking for research (published) on AI techniques for reading cookbook recipes. Recipes are a very limited domain that might be doable in a natural language recognition engine with some degree of accuracy. I have in mind writing a program that would allow copy/pasting a recipe from a web browser into the AI and having it determine ...

Changing the words keeping its meaning intact...

Hi We have a requirement in which we need to change change the words or phrases in the sentence while keeping its meaning intact. This application is going to provide suggestions to users who are involved in copy-writing. I don't know where should I start... we have not yet finalized the technology but would like to do it in a Python o...

Topic modeling using mallet

Hey guys, I'm trying to use topic modeling with Mallet but have a question. How do I know when do I need to rebuild the model? For instance I have this amount of documents I crawled from the web, using topic modeling provided by Mallet I might be able to create the models and infer documents with it. But overtime, with new data that I...

How do I get of the most common words in various languages?

Stackoverflow implemented its "Related Questions" feature by taking the title of the current question being asked and removing from it the 10,000 most common English words according to Google. The remaining words are then submitted as a fulltext search to find related questions. How do I get such a list of the most common English words?...

natural language question creation

I am trying to build question based on information available on about 10 variables- e.g. shape (square, circle, rectangle, paralellogram),length, width, circumference, area, diagonal length etc e.g. if i want to set question to calculate area based on shape, length and width- the question gets created stating- calculate area of 'rectang...

N-gram generation form sentence

how to generate ngram of a string like String Input="This is my car." i want to generate Ngram of this input Input Ngram size = 3 Output should come: This is my car This is is my my car This is my is my car give some idea in java, how to implement that or any library is available for it. I am trying to use this NGramTokenizer ...

Regular expression for counting sentences in a block of text.

Possible Duplicate: PHP - How to split a paragraph into sentences. I have a block of text that I would like to separate into sentences, what would be the best way of doing this? I thought of looking for '.','!','?' characters, but I realized there were some problems with this, such as when people use acronyms, or end a sentenc...

True definition of an English word?

What would be the best definition of an English word? What are the other cases of an English word than just \w+? Some may include \w+-\w+ or \w+'\w+; some may exclude cases like \b[0-9]+\b. But I haven't seen any general consensus on those cases. Do we have a formal defintion of such? Can any of you clarify? (Edit: broaden the questi...

Wikipedia: pages across multiple languages

Hi, I want to use wikipedia dump for my project. The below information is required for my project. For an wikipedia entry, I want to know which other language contain the page? I want an downloadable data in csv or other common format. Is there a way to get this data? Thanks Bala ...

NLP research :: Clustering of english language words?

Hi, Is there a partition of english words into a high level categories like say sports, basketball etc... Its required for my project. Is this data available somewhere? I am okay with overlapping of words across categories. Thank you Bala ...

Wikipedia categories

Hi, I want to get a list of all the wikipedia categories. I can find them here : http://en.wikipedia.org/wiki/Special:Categories Is there a way to download all of them in xml/csv format. Thank you Bala ...

Natural Language Processing Algorithm for mood of an email

Hi, One simple question (but I haven't quite found an obvious answer in the NLP stuff I've been reading, which I'm very new to): I want to classify emails with a probability along certain dimensions of mood. Is there an NLP package out there specifically dealing with this? Is there an obvious starting point in the literature I start re...

Natural Language Processing

I have thousands of sentences in a file. I want to find only right/useful English Language words. Is it possible with Natural Language Processing? Sample Sentence: ~@^.^@~ tic but sometimes world good famous tac Zorooooooooooo I just want to extract only English Words like tic world good famous Any Advice how can I achieve this. Th...

Variations in spelling of first name

As part of a contact management system I have a large database of names. People frequently edit this and as a result we run into issues of the same person existing in different forms (John Smith and Jonathan Smith). I looked into word similarity but it's easy to think of name variations which are not similar at all (Richard vs Dick). I w...

How Can I Parse a Document and Replace the Content to Change Context from 1st or 2nd Person to 3rd Person?

Basically I need some text like: I have an ice cream cone. You are in trouble. You need a bath. And change it from 1st or 2nd person to 3rd person. He has an ice cream cone. He is in trouble. He needs a bath. I've started a js app, but it's super simple at the moment. Before I waste time reinventing the wheel, I figured I'd ask:...

opennlp vs stanford nlptools vs berkeley

Hi the aim is to parse a sizeable corpus like wikipedia to generate the most probable parse tree,and named entity recognition. Which is the best library to achieve this in terms of performance and accuracy? Has anyone used more than one of the above libraries? ...

Person names disambiguation

Hi, I am currently doing a project on person name disambiguation. The idea behind the project, that it will be able to identify the correct person, when there are multiple people with the same name. I have used wikipedia for this. I want to evaluate my project on some standard data. I am looking for some testing data. I am not familiar ...

Automated question answering (FAQ) in .NET

Hi I would like to build a very simple application - Automated FAQ. I searched the internet and found some information about different approaches but there is no .Net specific example. Do you have som experience of building such application or maybe know some .Net specific examples? It would be very interesting to take a look at one. H...

How to separate words in a "sentence" with spaces?

Background Looking to automate creating Domains in JasperServer. Domains are a "view" of data for creating ad hoc reports. The names of the columns must be presented to the user in a human readable fashion. Problem There are over 2,000 possible pieces of data from which the organization could theoretically want to include on a report....