views:

174

answers:

3

I have implemented a full text search in a discussion forum database and I want to display the search results in a way Google does. Even for a very long html page only a two or three lines of the texts displayed in a search result list. Usually these are the lines which contain a search terms.

What would be the good algorithm of how to extract a few lines of the text based on the text itself and a search terms. I could think of something as easy as just using one line of text before the search term occurrence in a text and a line after - but that seems to be too simple to work.

Would like to get a few directions, ideas and insights.

Thank you.

A: 

Have you tried the "line before/after search term occurrance" in code to see if for that simple coding investment the results are good enough for what you want? Might already be enough?

Otherwise, you could go for pieces of sentences: so don't split on lines, but on newlines, full stops, comma's, spaced out hyphens etc. Then show the pieces that contain the search terms. You could separate each matching sentence piece with "..." or something.

If you get a lot of these pieces, you could try to prioritize the pieces, sort on descending priority and only show the first n of them. And/or cut down the pieces to just the search term and a couple of words around the search term.

Just a couple of informal ideas that might get you started?

peSHIr
+2  A: 

If you are looking for something fancier than the 'line before/after' approach, a summarizer might do the trick.

Here's a Naive Bayes based system: http://classifier4j.sourceforge.net/

Bayes is the statistical system used by many spam filters - I researched Bayes summarizers a few years back, and found that they do a pretty good job of summarizing text, as long as there is a decent amount of text to process. I haven't actually tried the above library, though, so your mileage may vary.

Kevin Day
A: 

Concentrate on the beginning of the content. Think of where you would look when you visit a blog. The beginning para tells you whether the article is in the right direction. So in your algorithm it will make sense to reflect this.

Check for occurrences of the search term in headings (H1,H2 etc) and give more priority to them.

This should get you started.

Bobby Alexander