tags:

views:

42

answers:

2

I'm trying to split the following text:

<word>test</word><word>test2</word>

etc via the following reg ex:

preg_split(":</?word>:is", $html);

I get the result: test and test2 as the result, but what I need is to retrain the <word> and </word> tags, so instead of just test and test2, i get another 4 elements with the matching tags in them.

How can this be done?

+2  A: 

First of all: use a parser to modify XML (something like SimpleXML of DOM could suit you fine, depending on the actions taken next).

However, for the sake of argument:

preg_split(":(</?word>):",
    "<word>test</word><word>test2</word>",
    0,
    PREG_SPLIT_NO_EMPTY|PREG_SPLIT_DELIM_CAPTURE);
Wrikken
What's with the `is` modifiers; I'd give a vote if they weren't just copy/pasted from the question.
salathe
Ah, yes, wholly unnecessary indeed. I'll edit them out. (I do remember when starting out with regexes years ago I typed `/six` almost per default :), at this moment I was just lazy c/p-ing of course... :P )
Wrikken
And here is your upvote, thank you for indulging a persnickity regex-author. :-)
salathe
And right you are to point it out ;)
Wrikken
A: 

First off, NEVER USE REGEX TO PARSE HTML..

But to solve your problem, look at the flags for preg_split()

preg_split(
    ":(</?word>):is", 
    $html, 
    -1, 
    PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY
);

Now, it'll split them and give you this:

array(7) {
  [0]=>
  string(6) "<word>"
  [1]=>
  string(4) "test"
  [2]=>
  string(7) "</word>"
  [3]=>
  string(2) ", "
  [4]=>
  string(6) "<word>"
  [5]=>
  string(5) "test2"
  [6]=>
  string(7) "</word>"
}

Still no good. But, what we can do, is loop over the array, and move <word> to the next buffer, and </word> to the prior...

$buffer = '';
$newWords = array();
foreach ($words as $word) {
    if (strcasecmp($word, '<word>') === 0) {
        $buffer .= $word;
    } elseif (strcasecmp($word, '</word>') === 0) {
        // Find the last buffer
        $last = end($newWords);
        $newWords[key($newWords)] = $last . $buffer . $word;
        $buffer = '';
    } else {
        $newWords[] = $buffer . $word;
        $buffer = '';
    }
}
if (!empty($buffer)) {
    $newWords[] = $buffer;
}

Which would give you:

array(3) {
  [0]=>
  string(17) "<word>test</word>"
  [1]=>
  string(2) ", "
  [2]=>
  string(18) "<word>test2</word>"
}
ircmaxell