views:

558

answers:

6

I'm looking for an existing library to summarize or paraphrase content (I'm aiming at blog posts) - any experience with existing natural language processing libraries?

I'm open to a variety of languages, so I'm more interested in the abilities & accuracy.

A: 

Your getting into really far out AI type domain. I have done extensive work in text transformation into machine knowledge mainly using Attempto Controlled English (see: http://attempto.ifi.uzh.ch/site/), it is a natural language (english) that is completely computer processable into several different ontologies, such as OWLDL.

Seems like that would we way overkill though...

Is there a reason for not just taking the first few sentences of your blog post and then appending an ellipse for your summary?

mmattax
A: 
Prakash
A: 

Thanks for those links. Looks like GROK is dead - but it may work still for my purposes.

2 more links:

The Attempto Controlled English is an interesting concept: as it's a completely reverse way of looking at the problem. Not really practical for what I am trying to do.

@mmattax As for the suggestion of taking a few sentences - I'm not trying to present a summary: otherwise that would be a nice judo solution. I'm looking to actually summarize the content to use for other evaluation purposes.

jeffreypriebe
A: 

Might want to try GATE or the closed-source, proprietary and costly TextAnalyst COM API

Josh
+2  A: 

I think he wants to generate blog posts by automatically paraphrasing whatever was it the blogs this system is monitoring.

This would be really interesting if you could combine 2 to 10 blog posts that are similar, but from different sources and then do a paraphrased "real" summary automatically (the size of 1 blog post).

It could also be great for Homeworks. Unfortunately it's not that easy to do.

The only way I could see is to be able to decompose every sentence into "meaning", and then randomly change the sentence structure and some words retaining the meaning.

These sentences mean the same:

  • I hate this guy, he is so dumb.
  • This guy is stupid, I hate him.
  • I despise this dumb guy.
  • He is dumb, I hate him.

It would be nontrivial to write a program to transform one of these sentences to the others, and these are simple sentences, real sentences from blogs are much more complicated.

Osama ALASSIRY
+1  A: 

There was some discussion of Grok. This is now supported as OpenCCG, and will be reimplemented in OpenNLP as well.

You can find OpenCCG at http://openccg.sourceforge.net/. I would also suggest the Curran and Clark CCG parser available here: http://svn.ask.it.usyd.edu.au/trac/candc/wiki

Basically, for paraphrase, what you're going to need to do is write up something that first parses sentences of blog posts, extracts the semantic meaning of these posts, and then searches through the space of vocab words which will compositionally create the same semantic meaning, and then pick one that doesn't match the current sentence. This will take a long time and it might not make a lot of sense. Don't forget that in order to do this, you're going to need near-perfect anaphora resolution and the ability to pick up discourse-level inferences.

If you're just looking to make blog posts that don't have machine-identifiable duplicate content, you can always just use topic and focus transformations and WordNet synonyms. There have definitely been sites which have made money off of AdWords that have done this before.

Robert Elwell