views:

258

answers:

2

I'm searching for a function in PHP to put every paragraph element like <p>, <ul> and <ol> into an array. So that i can manipulate the paragraph, like displayen the first two paragraphs and hiding the others.

This function does the trick for the p-element. How can i adjust the regexp to also match the ul and ol? My tryout gives an error: complaining the < is not an operator...

function aantalP($in){
    preg_match_all("|<p>(.*)</p>|U",
     $in,
        $out, PREG_PATTERN_ORDER);
    return $out;
}

//tryout:
    function aantalPT($in){
        preg_match_all("|(<p> | <ol>)(.*)(</p>|</o>)|U",
         $in,
            $out, PREG_PATTERN_ORDER);
        return $out;
    }

Can anyone help me?

+3  A: 

You can't do this reliably with regular expressions. Paragraphs are mostly OK because they're not nested generally (although they can be). Lists however are routinely nested and that's one area where regular expressions fall down.

PHP has multiple ways of parsing HTML and retrieving selected elements. Just use one of those. It'll be far more robust.

Start with Parse HTML With PHP And DOM.

If you really want to go down the regex route, start with:

function aantalPT($in){
  preg_match_all('!<(p|ol)>(.*)</\1>!Us', $in, $out);
  return $out;
}

Note: PREG_PATTERN_ORDER is not required as it is the default value.

Basically, use a backreference to find the matching tag. That will fail for many reasons such as nested lists and paragraphs nested within lists. And no, those problems are not solvable (reliably) with regular expressions.

Edit: as (correctly) pointed out, the regex is also flawed in that it used a pipe delimeter and you were using a pipe character in your regex. I generally use ! as that doesn't normally occur in the pattern (not in my patterns anyway). Some use forward slashes but they appear in this pattern too. Tilde (~) is another reasonably common choice.

cletus
All right. So the way to do it is not by regexp's. I'm looking into the PHP and DOM method. The domdocument object is a standard feature?
blub
Your regex is flawed, see my answer point 1.
OIS
Yes, it is a standard feature. http://www.php.net/manual/en/dom.installation.php
cletus
+2  A: 
  • First of all, you use | as delimiter to mark the beginning and end of the regular expression. But you also use | as the or sign. I suggest you replace the first and last | with #.
  • Secondly, you should use backreferences with capture of the start and end tag like such: <(p|ul)>(.*?)</\1>
OIS