ansaurus

Question

Given a document, select a relevant snippet.

Answer 1

+5 A:

Automatic Text Summarization

It sounds like you're interested in automatic text summarization. For a nice overview of the problem, issues involved, and available algorithms, take a look at Das and Martin's paper A Survey on Automatic Text Summarization (2007).

Simple Algorithm

A simple but reasonably effective summarization algorithm is to just select a limited number of sentences from the original text that contain the most frequent content words (i.e., the most frequent ones not including stop list words).

Summarizer(originalText, maxSummarySize):
   // start with the raw freqs, e.g. [(10,'the'), (3,'language'), (8,'code')...]
   wordFrequences = getWordCounts(originalText)
   // filter, e.g. [(3, 'language'), (8, 'code')...]
   contentWordFrequences = filtStopWords(wordFrequences)
   // sort by freq & drop counts, e.g. ['code', 'language'...]
   contentWordsSortbyFreq = sortByFreqThenDropFreq(contentWordFrequences)

   // Split Sentences
   sentences = getSentences(originalText)

   // Select up to maxSummarySize sentences
   setSummarySentences = {}
   foreach word in contentWordsSortbyFreq:
      firstMatchingSentence = search(sentences, word)
      setSummarySentences.add(firstMatchingSentence)
      if setSummarySentences.size() = maxSummarySize:
         break

   // construct summary out of select sentences, preserving original ordering
   summary = ""
   foreach sentence in sentences:
     if sentence in setSummarySentences:
        summary = summary + " " + sentence

   return summary

Some open source packages that do summarization using this algorithm are:

Classifier4J (Java)

If you're using Java, you can use Classifier4J's module SimpleSummarizer.

Using the example found here, let's assume the original text is:

Classifier4J is a java package for working with text. Classifier4J includes a summariser. A Summariser allows the summary of text. A Summariser is really cool. I don't think there are any other java summarisers.

As seen in the following snippet, you can easily create a simple one sentence summary:

// Request a 1 sentence summary
String summary = summariser.summarise(longOriginalText, 1);

Using the algorithm above, this will produce Classifier4J includes a summariser..

NClassifier (C#)

If you're using C#, there's a port of Classifier4J to C# called NClassifier

Tristan Havelick's Summarizer for NLTK (Python)

There's a work-in-progress Python port of Classifier4J's summarizer built with Python's Natural Language Toolkit (NLTK) available here.

dmcer 2010-05-14 00:40:12

I wonder if the C# version is fast enough to be used for this site?

BCS 2010-05-14 00:55:31

The algorithm is **dead simple**, so it really should be fast enough. It first determines the **most frequent content words** in the original text. It then iterates over them and selects the **earliest sentence** in the original string that contains each word. This continues until the desired number of N many sentences are selected.

dmcer 2010-05-14 03:01:34

ansaurus

tags:

views:

answers:

Given a document, select a relevant snippet.

related questions