views:

1420

answers:

4

I'd like to work on a bbcode filter for a php website. (I'm using cakephp, it would be a bbcode helper) I have some requirement.

Bbcodes can be nested. So something like that is valid.

[block]  
    [block]  
    [/block]  
    [block]  
        [block]  
        [/block]  
    [/block]  
[/block]

Bbcodes can have 0 or more parameters.

Exemple:
[video: url="url", width="500", height="500"]Title[/video]

Bbcodes might have mutliple behaviours.

Let say, [url]text[/url] would be transformed to [url:url="text"]text[/url] or the video bbcode would be able to choose between youtube, dailymotion....

I think it cover most of my needs. I alreay done something with regex. But my biggest problem was to match parameters. In fact, I got nested bbcode to work and bbcode with 0 parameters. But when I added a regex match for parameters it didn't match nested bbcode correctly.

"\[($tag)(=.*)\"\](.*)\[\/\1\]" // It wasn't .* but the non-gready matcher

I don't have the complete regex with me right now, But I had something that looked like that(above).

So is there a way to match bbcode efficiently with regex or something else. The only thing I can think of is to use the visitor pattern and to split my text with each possible tags this way, I can have a bit more of control over my text parsing and I could probably validate my document so if the input text doesn't have valid bbcode. I could Notify the user with a error before saving anything.

I would use sablecc to create my text parser. http://sablecc.org/

Any better idea? or anything that could lead to a efficient flexible bbcode parser?

Thank you and sorry for my bad english...

+2  A: 

Responding to: "Any better idea?" (and I'm assuming that this was an invite not just for improvement over bbcode-specific suggestions)

We recently looked at going the bbcode route and decided on using htmlpurifier instead. This decision was based in part on the (admittedly biased probably) comparisons between various methods listed by the htmlpurifier group here and the discussion of bbcode (again, by the htmlpurifer group) here

And for the record I think your english was very good. I'm sure it's much better than I could do in your native language.

codemonkey
Ah thank you, I'll probably include html purifier. But because i'm not really a fan of things like fck editor. I'd say that it will mostly be used to purify the html output. But it looks very nice.
Sybiam
+7  A: 

There are several existing libraries for parsing BBCode, it may be easier to look into those than trying to roll your own:

Here's a couple, I'm sure there are more if you look around:
PECL bbcode
PEAR HTML_BBCodeParser

Chad Birch
A: 

There's both a pecl and PEAR BBCode parsing library. Software's hard enough without reinventing years of work on your own.

If neither of those are an option, I'd concentrate on turning the BBCode into a valid XML string, and then using your favorite XML parsing routine on that. Very very rough idea here, but

  1. Run the code through htmlspecialchars to escape any entities that need escaping

  2. Transform all [ and ] characters into < and > respectively

  3. Don't forget to account for the colon in cases like [tagname:

If the BBCode was nested properly, you should be all set to pass this string into an XML parsing object (SimpleXML, DOMDocument, etc.)

Alan Storm
That's a horrible idea. What would [script] ... [/script] do?
Charlie Somerville
Yeah, that's pretty awful if you're planning on outputting HTML back. What I wrote was assuming you're parsing the BBCode to pull out information. If you're using anything but official BBCode parsers (mentioned in the first paragraph) you're bound to leave yourself open to a XSS attack.
Alan Storm
A: 

Use preg_split() with PREG_DELIM_CAPTURE flag to split source code into tags and non-tags. Then iterate over tags keeping stack of open blocks (i.e. when you see opening tag, add it to an array. When you see closing tag, remove elements from end of the array until closing tag matches opening tag.)

porneL