views:

137

answers:

1

When I ask a question here, the tool tips for the question returned by the auto search given the first little bit of the question, but a decent percentage of them don't give any text that is any more useful for understanding the question than the title. Does anyone have an idea about how to make a filter to trim out useless bits of a question?

My first idea is to trim any leading sentences that contain only words in some list (for instance, stop words, plus words from the title, plus words from the SO corpus that have very weak correlation with tags, that is that are equally likely to occur in any question regardless of it's tags)

+5  A: 

Automatic Text Summarization

It sounds like you're interested in automatic text summarization. For a nice overview of the problem, issues involved, and available algorithms, take a look at Das and Martin's paper A Survey on Automatic Text Summarization (2007).

Simple Algorithm

A simple but reasonably effective summarization algorithm is to just select a limited number of sentences from the original text that contain the most frequent content words (i.e., the most frequent ones not including stop list words).

Summarizer(originalText, maxSummarySize):
   // start with the raw freqs, e.g. [(10,'the'), (3,'language'), (8,'code')...]
   wordFrequences = getWordCounts(originalText)
   // filter, e.g. [(3, 'language'), (8, 'code')...]
   contentWordFrequences = filtStopWords(wordFrequences)
   // sort by freq & drop counts, e.g. ['code', 'language'...]
   contentWordsSortbyFreq = sortByFreqThenDropFreq(contentWordFrequences)

   // Split Sentences
   sentences = getSentences(originalText)

   // Select up to maxSummarySize sentences
   setSummarySentences = {}
   foreach word in contentWordsSortbyFreq:
      firstMatchingSentence = search(sentences, word)
      setSummarySentences.add(firstMatchingSentence)
      if setSummarySentences.size() = maxSummarySize:
         break

   // construct summary out of select sentences, preserving original ordering
   summary = ""
   foreach sentence in sentences:
     if sentence in setSummarySentences:
        summary = summary + " " + sentence

   return summary

Some open source packages that do summarization using this algorithm are:

Classifier4J (Java)

If you're using Java, you can use Classifier4J's module SimpleSummarizer.

Using the example found here, let's assume the original text is:

Classifier4J is a java package for working with text. Classifier4J includes a summariser. A Summariser allows the summary of text. A Summariser is really cool. I don't think there are any other java summarisers.

As seen in the following snippet, you can easily create a simple one sentence summary:

// Request a 1 sentence summary
String summary = summariser.summarise(longOriginalText, 1);

Using the algorithm above, this will produce Classifier4J includes a summariser..

NClassifier (C#)

If you're using C#, there's a port of Classifier4J to C# called NClassifier

Tristan Havelick's Summarizer for NLTK (Python)

There's a work-in-progress Python port of Classifier4J's summarizer built with Python's Natural Language Toolkit (NLTK) available here.

dmcer
I wonder if the C# version is fast enough to be used for this site?
BCS
The algorithm is **dead simple**, so it really should be fast enough. It first determines the **most frequent content words** in the original text. It then iterates over them and selects the **earliest sentence** in the original string that contains each word. This continues until the desired number of N many sentences are selected.
dmcer