views:

358

answers:

2

I am currently performing a full text search on my "pages" in a database. While users get the results they want, I am unable to provide them with relevant information as to why in the world the results that came up, came up.

Specifications on what I am looking for:

  1. I have HTML Data, meaning that if you search for a term such as "test" and the resulting page contained, <b>here is some test</b> page. I should be able to highlight the term without adversely affecting the html code on the page.
  2. I only want to return a portion of the document, much like google does; where the portion returned contains a good portion of my search terms. How can I determine which portion contains the most terms? Would it be best to determine which section returns the most terms overall, or the section that has the most of the individual search terms, or a combination of both? Or should multiple snipits of information be included?
  3. I would like to do this server side, if that is a viable option?

I am unsure as to what the best way of going about doing these two things are. I do know of one issue that can easily be overlooked that needs to be taken into account:

a. Snipping off html data at random points can completely ruin the page if you are not careful, for example, not closing a div tag can throw my whole layout off. What are the best solutions around this?

What are the best methods for achieving a search system like the one above?

+1  A: 
  1. I would not keep the HTML formatting in the search results. That would make your results page very messy. It doesn't make sense to include headings, line breaks, images, paragraph margins, etc. in the result descriptions--especially if you're only going to be printing short excerpt of truncated content.
  2. I think in most cases, a result that matches 100% of the search terms one time is going to be more relevant than a result that only matches 50% of the search terms repeated twice. But this also depends on the query.
  3. That's the only viable option, unless you want to send the client all of the result pages at once.

Since you're using MySQL's built-in fulltext search function, you can't really show the user why the results are what they are--not without a detailed understanding of how the fulltext search determines relevance. What you can do is show the user excerpts from each page that may be relevant to their search and may help them make useful determinations of which results to look into.

I would first strip the page content of any markup using strip_tags(), then explode() the content into an array of individual sentences. Then you could iterate through the array to determine the relevance of each sentence and then simply display the most relevant sentence(s) to the user. If the most relevant sentence is too long, then truncate it at word boundaries.

$text = strip_slashes($content);
$sentences = explode('.  ', $text);
$relevance = array();
foreach ($sentences as $i=>$sentence) {
    $rel = 0;
    $relevance[$i] = calcRel($sentence);
}
arsort($relevance);
list($i, $j) = array_keys($relevance);
$ellips = (abs($i-$j)>1?'...':'');
if ($i < $j) {
    $description = $sentences[i].$ellips.$sentences[j];
} else {
    $description = $sentences[j].$ellips.$sentences[i];
}

calcRel($sentence) would return a numeric value representing relevance calculated by:

  1. Searching $sentence for the entire query string. For each occurrence, the relevance number would be increased by 2^n; where n is the number of words in the query string.
  2. Search for partial matches--again weighted by 2^n; n being the number of words matched.
  3. Search for individual query words, giving each match a weight of 1.
  4. Lastly, in each of the above searches, the matching words/phrases should be removed from $sentence so they aren't counted more than once.

An alternate strategy could be just to scan the entire text for the search terms, recording the position of each match. Then using simple arithmetic, you can find the tightest cluster of search keywords and select your excerpt that way, truncating at word boundaries or sentence boundaries.

Calvin
A: 
try preg_match(); with preg_replace();