views:

98

answers:

3

I will have a string like this:

Bob is a boy. Bob is 1000 years old! <b>Bob loves you!</b> Do you love bob?

I want to parse it into an array, using the following delimiters to identify each array element:

.
!
?
<b> and </b>

So I will have an array with the following structure:

[0]Bob is a boy.
[1]Bob is 1000 years old!
[2]Bob loves you!
[3]Do you love bob?

Any ideas?

As you can see, i'd like the text between <b> and </b> to be extracted, previously I'm using the following regexp to do it:

preg_match_all(":<b>(.*?)</b>:is", $text, $matches);
+1  A: 

If nobody provides a better solution, this almost works:

(?:<b>|[.!?]*)((?:[^<]+?)(?:[.!?]+|</b>))\s+

Only it would return Bob loves you!</b> in third match, which can be cleaned by applying strip_tags() to results I guess...

serg
Can't you move the closing brackets before the </b> so that it isn't captured?
Gazler
Not sure how to do that :)
serg
A: 

Divide and conquer?

assume $myString is your string...

First grab your quoted stuff:

preg_match (" /(.*?)<b>(.*?)<\/b>(.*?)/", $myString);

now you have $1, $2, and $3

$firstMatches = preg_split("/[\.\!\?]/", $1);

$lastMatches = preg_split("/[\.\!\?]/", $3);

Then get your punctuation back:

function addPunctuation($matches, $myString)
{
    $punctuadedResults = array();
    foreach($matches as $match)
    {
       $position = strpos( $myString, $match);
       #position is the offset of the start of your match. Find the character after your match.
       $punctMark = substr($myString, $position + length($match), 1);
       $punctuadedResults[] = $match . $punctMark;

    }
    return $punctuadedResults;
}


$allMatches = addPunctuation($firstMatches, $myString);
$allMatches[] = $2;

$allMatches = array_merge($allMatches, addPunctuation($lastMatches, $myString) );
Zak
+1  A: 

I think this should accomplish what you're going for:

$string = 'Bob is a boy. Bob is 1000 years old! <b>Bob loves you!</b> Do you love bob?'; 

// parser
$array = preg_split('/[\.|\!\?]|[\s]*<b>|<\/b>[\s]*/', $string, 0, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_OFFSET_CAPTURE);
foreach ($array as $key => $element) $array[$key] = trim($element[0]).substr($string,$element[1]+strlen($element[0]),1);

print_r($array);

It yields:

Array
(
    [0] => Bob is a boy.
    [2] => Bob is 1000 years old!
    [4] => Bob loves you!
    [6] => Do you love bob?
)

The first line of the parser grabs each of the strings of text between the delimiters and their offsets in the string. The second line adds the punctuation marks from the original string to the end of each element.

Alan Christopher Thomas