tags:

views:

628

answers:

4

I want to display the results of a searchquery in a website with a title and a short description. The short description should be a small part of the page which holds the searchterm. What i want to do is: 1 strip tags in page 2 find first position of seachterm 3 from that position, going back find the beginning (if there is one) of that sentence. 4 Start at the found position in step 3 and display ie 200 characters from there

I need some help with step 3. I think i need an regex that finds the first capital or dot...

+2  A: 

The way I would do it is, I would parse the page...

  1. Skip over all the things starting with '<'

  2. When you encounter a "." or [A-Z], start putting it into a buffer till you find another "."

  3. If the buffered string has the search keyword, thats your string! Else. start buffering at the "." you encountered and repeat.

EDIT: As James Curran pointed out, this strategy would fail in some cases... So heres the solution:

What you can do, is to start X number of characters from start of page (after tags)

and then search for your keyword, buffering 2 previous words. When you find it, do something like this: {X} ... {prev-2} {next-2}

Example: This planet has - or rather had - a problem, which was this: most of the people living on it were unhappy for pretty much of the time. Many solutions were suggested for this problem, but most of these were largely concerned with the movement of small green pieces of paper, which was odd because on the whole it wasn't the small green pieces of paper that were unhappy.

Search Keyword: "suggested"

Result: This planet has - or rather had - a problem ... Many solutions were suggested for this problem...

Mostlyharmless
+4  A: 

Even that will ultimately fail. Given the sentence "We went to Dr. Smith's office", if your search term is "office", virtually any criterion you use will give you "Smith's office" as your sentence.

James Curran
I posted a slight change to the strategy... can you see any bug in that one.
Mostlyharmless
+1  A: 

For step 3: If you reverse the substring that ends where you want to search backward from, get the position of the first '.' and subtrack that value from the position of your search string.

$offset = stripos( strrev(substr($string, $searchlocation)), '.');
$startloc = $searchlocation - $offset;
$finalstring = substr($string, $startloc, 200);

That may be off by 1, but I think it'll get the job done. Seems like there should be a shorter way to do it.

acrosman
James Curran answer also applies here, this would still fail for Dr. Smith's office.
acrosman
+1  A: 

I think instead of trying to find sentences, I'd think about the amount of context around the search term I would need in words. Then go backwards some fraction of this number of words (or to the beginning) and forward the remaining number of words to select the rest of the context. In this way, you just split the entire corpus on whitespace, find the first occurence of the term (perhaps using a fuzzy match to find subterms and account for punctuation), and apply the above algorithm. You could even be creative about introducing ellipses if the first non-selected term doesn't end in punctuation, etc.

tvanfosson