ansaurus

Question

Answer 1

+2 A:

The way I would do it is, I would parse the page...

Skip over all the things starting with '<'
When you encounter a "." or [A-Z], start putting it into a buffer till you find another "."
If the buffered string has the search keyword, thats your string! Else. start buffering at the "." you encountered and repeat.

EDIT: As James Curran pointed out, this strategy would fail in some cases... So heres the solution:

What you can do, is to start X number of characters from start of page (after tags)

and then search for your keyword, buffering 2 previous words. When you find it, do something like this: {X} ... {prev-2} {next-2}

Example: This planet has - or rather had - a problem, which was this: most of the people living on it were unhappy for pretty much of the time. Many solutions were suggested for this problem, but most of these were largely concerned with the movement of small green pieces of paper, which was odd because on the whole it wasn't the small green pieces of paper that were unhappy.

Search Keyword: "suggested"

Result: This planet has - or rather had - a problem ... Many solutions were suggested for this problem...

Mostlyharmless 2008-10-10 14:15:35

Answer 2

+4 A:

Even that will ultimately fail. Given the sentence "We went to Dr. Smith's office", if your search term is "office", virtually any criterion you use will give you "Smith's office" as your sentence.

James Curran 2008-10-10 14:18:14

I posted a slight change to the strategy... can you see any bug in that one.

Mostlyharmless 2008-10-10 14:31:07

Answer 3

+1 A:

For step 3: If you reverse the substring that ends where you want to search backward from, get the position of the first '.' and subtrack that value from the position of your search string.

$offset = stripos( strrev(substr($string, $searchlocation)), '.');
$startloc = $searchlocation - $offset;
$finalstring = substr($string, $startloc, 200);

That may be off by 1, but I think it'll get the job done. Seems like there should be a shorter way to do it.

acrosman 2008-10-10 14:20:49

James Curran answer also applies here, this would still fail for Dr. Smith's office.

acrosman 2008-10-10 14:22:20

Answer 4

+1 A:

I think instead of trying to find sentences, I'd think about the amount of context around the search term I would need in words. Then go backwards some fraction of this number of words (or to the beginning) and forward the remaining number of words to select the rest of the context. In this way, you just split the entire corpus on whitespace, find the first occurence of the term (perhaps using a fuzzy match to find subterms and account for punctuation), and apply the above algorithm. You could even be creative about introducing ellipses if the first non-selected term doesn't end in punctuation, etc.

tvanfosson 2008-10-10 14:53:34

ansaurus

tags:

views:

answers:

Find beginning of sentence in String

related questions