ansaurus

Question

Database Search Term Highlighting and Result Truncating

Answer 1

+1 A:

I would not keep the HTML formatting in the search results. That would make your results page very messy. It doesn't make sense to include headings, line breaks, images, paragraph margins, etc. in the result descriptions--especially if you're only going to be printing short excerpt of truncated content.
I think in most cases, a result that matches 100% of the search terms one time is going to be more relevant than a result that only matches 50% of the search terms repeated twice. But this also depends on the query.
That's the only viable option, unless you want to send the client all of the result pages at once.

Since you're using MySQL's built-in fulltext search function, you can't really show the user why the results are what they are--not without a detailed understanding of how the fulltext search determines relevance. What you can do is show the user excerpts from each page that may be relevant to their search and may help them make useful determinations of which results to look into.

I would first strip the page content of any markup using strip_tags(), then explode() the content into an array of individual sentences. Then you could iterate through the array to determine the relevance of each sentence and then simply display the most relevant sentence(s) to the user. If the most relevant sentence is too long, then truncate it at word boundaries.

$text = strip_slashes($content);
$sentences = explode('.  ', $text);
$relevance = array();
foreach ($sentences as $i=>$sentence) {
    $rel = 0;
    $relevance[$i] = calcRel($sentence);
}
arsort($relevance);
list($i, $j) = array_keys($relevance);
$ellips = (abs($i-$j)>1?'...':'');
if ($i < $j) {
    $description = $sentences[i].$ellips.$sentences[j];
} else {
    $description = $sentences[j].$ellips.$sentences[i];
}

calcRel($sentence) would return a numeric value representing relevance calculated by:

Searching $sentence for the entire query string. For each occurrence, the relevance number would be increased by 2^n; where n is the number of words in the query string.
Search for partial matches--again weighted by 2^n; n being the number of words matched.
Search for individual query words, giving each match a weight of 1.
Lastly, in each of the above searches, the matching words/phrases should be removed from $sentence so they aren't counted more than once.

An alternate strategy could be just to scan the entire text for the search terms, recording the position of each match. Then using simple arithmetic, you can find the tightest cluster of search keywords and select your excerpt that way, truncating at word boundaries or sentence boundaries.

Calvin 2009-05-01 03:58:53

Answer 2

A:

try preg_match(); with preg_replace();

2009-05-09 20:24:10

ansaurus

tags:

views:

answers:

Database Search Term Highlighting and Result Truncating

related questions