tags:

views:

291

answers:

3

Hello there,

I'm trying to make my own BBCode parser for my website and I'm looking for a way to "htmlentities()" except the codes inside PRE tags, and the PRE tag itself.

For example:

<b>Hello world</b> (outputs &lt;b&gt;Hello world&lt;&gt;)
<pre>"This must not be converted to HTML entities"</pre> (outputs <pre>"This must not be converted to HTML entities"</pre>)

I really got no idea on how to do this.

Any kind of help would be appreciated :)

Thanks.

A: 

Personally I would accomplish this with a simple state machine:

$text = <<<END
<b>Hello, world!</b>
<pre>Hello there<br/></pre>
END;

$segments = preg_split('/(<\/?pre>)/', $text, -1, PREG_SPLIT_DELIM_CAPTURE);

// $state = 0 if outside of a pre
// $state = 1 if inside of a pre
$state = 0;
foreach ($segments as &$segment) {
    if ($state == 0) {
        if ($segment == '<pre>')
            $state = 1;
        else
            $segment = htmlentities($segment);
    } else if ($state == 1) {
        if ($segment == '</pre>')
            $state = 0;
    }
}

$entityText = implode($segments);

print $entityText;

Output:

&lt;b&gt;Hello, world!&lt;/b&gt;
<pre>Hello there<br/></pre>

Note that the above code does not handle nested pre tags. If you wish to do this, you'll need the following

$segments = preg_split('/(<\/?pre>)/', $text, -1, PREG_SPLIT_DELIM_CAPTURE);

// $depth = how many nested pres we're inside of.
$depth = 0;
foreach ($segments as &$segment) {
    if ($depth == 0 && $segment != '<pre>')
        $segment = htmlentities($segment);
    else if ($segment == '<pre>')
        $depth++;
    else if ($depth > 0 && $segment == '</pre>')
        $depth--;
}

$entityText = implode($segments);
Sebastian P.
+1  A: 

If it's to practice, ok. But if it's just to get the feature, then don't reinvent the wheel. Parsing is not an easy task, and there are plenty of mature parsers out there. Of course, I would look at the PEAR packages first. Try HTML_BBCodeParser.

If you really want to do it yourself, you got two ways :

  • regexp
  • state machines

Usually a mix of both is handy. But because tags can be nested and badly formed, it's really a hard stuff to code. At least, use a generic parser code and define you lexical fields, from scratch it will take all the time you use to code the web site.

Btw : using a BBparser does not free you from sanitizing the user input...

EDIT : I'm in a good mood today, so here is a snippet on how to use HTML_BBCodeParser :

// if you don't know how to use pear, you'd better learn that quick
// set the path so pear is in it
ini_set("include_path", ini_get("include_path").":/usr/share/pear");
// include PEAR and the parser
require_once("PEAR.php");
require_once("HTML/BBCodeParser.php");

// you can tweak settings from a ini fil
$config = parse_ini_file("BBCodeParser.ini", true);
$options = &PEAR::getStaticProperty("HTML_BBCodeParser", "_options");
$options = $config["HTML_BBCodeParser"];

// here start the parsing
$parser = new HTML_BBCodeParser();
$parser->setText($the_mighty_BBCode);
$parser->parse();
$parsed = $parser->getParsed();

// don't forget to clean that
echo htmlspecialchars(striptags($parsed));
e-satis
A: 

You could convert the &lt;pre&gt; … &lt;/pre&gt; back to <pre> … </pre>:

// convert anything
$str = htmlspecialchars($str);
// convert <pre> back
$str = preg_replace('/&lt;pre&gt;((?:[^&]+|&(?!lt;\\/pre&gt;))*)&lt;\\/pre&gt;/s', '<pre>$1</pre>', $str);
Gumbo