text-analysis

NLP: Qualitatively "positive" vs "negative" sentence

I need your help in determining the best approach for analyzing industry-specific sentences (i.e. movie reviews) for "positive" vs "negative". I've seen libraries such as OpenNLP before, but it's too low-level - it just gives me the basic sentence composition; what I need is a higher-level structure: - hopefully with wordlists - hopefull...

How to find common phrases in a large body of text

Hi, I'm working on a project at the moment where I need to pick out the most common phrases in a huge body of text. For example say we have three sentences like the following: The dog jumped over the woman. The dog jumped into the car. The dog jumped up the stairs. From the above example I would want to extract "the dog jumped" as i...

term clustering library?

Hi, Does anybody know an open-source\free library that does term clustering? Thanks, yaniv ...

Word lists for a lot of articles - document-term matrix

I have nearly 150k articles in Turkish. I will use articles for natural language processing research. I want to store words and frequency of them per article after processing articles. I'm storing them in RDBS now. I have 3 tables: Articles -> article_id,text Words -> word_id, type, word Words-Article -> id, word_id, article_id, frequ...

Any tutorial or code for Tf Idf in java

Hi All, I am looking for a simple java class that can compute tf-idf calculation. I want to do similarity test on 2 documents. I found so many BIG API who used tf-idf class. I do not want to use a big jar file, just to do my simple test. Please help ! Or atlest if some one can tell me how to find TF? and IDF? I will calculate the results...

Splitting string on probable English word boundaries

I recently used Adobe Acrobat Pro's OCR feature to process a Japanese kanji dictionary. The overall quality of the output is generally quite a bit better than I'd hoped, but word boundaries in the English portions of the text have often been lost. For example, here's one line from my file: softening;weakening(ofthemarket)8 CHANGE [tra...

How to extract common / significant phrases from a series of text entries

I have a series of text items- raw HTML from a MYSQL database. I want to find the most common phrases in these entries (not the single most common phrase, and ideally, not enforcing word-for-word matching). My example is any review on Yelp.com, that shows 3 snippets from hundreds of reviews of a given restaurant, in the format: "Try ...

tag generation from a small text content (such as tweets)

Hello, I have already asked a similar question earlier but I have notcied that I have big constrain: I am working on small text sets suchs as user Tweets to generate tags(keywords). And it seems like the accepted suggestion ( point-wise mutual information algorithm) is meant to work on bigger documents. With this constrain(working on ...

Algorithm to suggest a list of tags to users

Given a free text, I need to analyse this this text and suggest a list of tags from a pre existing list. What algorithms are out there in the market? Can they handle a case where, for example, the text have a word like high cholesterol and I would like it so suggest heart disease although "high cholesterol" might not exists...

Java text analysis libraries

I'm looking for a java driven solution to a requirement for analysing sentences to log whether a key word was used positively or negatively. Ie The key word might be 'cabbages' and the sentence:- 'I like cabbages but not peas' And I'd like a java text analyser of some kind to log this as positive. Can the lucene (Hibernate-Search) li...

How to wrap words or words sequences which have not been already wrapped?

I'm trying to wrap words and words sequence from a given list with preg_replace. It almost works, but there are some use cases it doesn't and I can't figure it how. For instance I do this: // sort by descending length usort($this->_keywords, function($a,$b){return(strlen($a)<strlen($b));}); // wrapper is -%string%- fo...