views:

50

answers:

3

Hi,

I have this function to parse bbcode -> html:

  $this->text = preg_replace(array(
    '/\[b\](.*?)\[\/b\]/ms', 
    '/\[i\](.*?)\[\/i\]/ms',
    '/\[u\](.*?)\[\/u\]/ms',
    '/\[img\](.*?)\[\/img\]/ms',
    '/\[email\](.*?)\[\/email\]/ms',
    '/\[url\="?(.*?)"?\](.*?)\[\/url\]/ms',
    '/\[size\="?(.*?)"?\](.*?)\[\/size\]/ms',
    '/\[youtube\](.*?)\[\/youtube\]/ms',
    '/\[color\="?(.*?)"?\](.*?)\[\/color\]/ms',    
    '/\[quote](.*?)\[\/quote\]/ms',
    '/\[list\=(.*?)\](.*?)\[\/list\]/ms',
    '/\[list\](.*?)\[\/list\]/ms',
    '/\[\*\]\s?(.*?)\n/ms'
   ),array(
    '<strong>\1</strong>',
    '<em>\1</em>',
    '<u>\1</u>',
    '<img src="\1" alt="\1" />',
    '<a href="mailto:\1">\1</a>',
    '<a href="\1">\2</a>',
    '<span style="font-size:\1%">\2</span>',
    '<object width="450" height="350"><param name="movie" value="\1"></param><param name="allowFullScreen" value="true"></param><param name="allowscriptaccess" value="always"></param><embed src="\1" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="450" height="350"></embed></object>',
    '<span style="color:\1">\2</span>',
    '<blockquote>\1</blockquote>',
    '<ol start="\1">\2</ol>',
    '<ul>\1</ul>',
    '<li>\1</li>'
   ),$original);

Problem is, how to unparse this, like html -> bbcode?

My regex skills are poor :(

Thanks.

+4  A: 

It's pretty safe to say it's nigh impossible to build a reliable way to convert html to bbcode with just a slew of regexes. Use a parser (DOMDocument for instance), remove invalid elements & attributes with xpath's & inspection and then recursively walk it creating a bbcode string on the way (or just ignore invalid tags / attributes on the way).

Wrikken
+7  A: 

Don't.

Instead, store both the original unparsed text and the processed parsed text. Yes, this doubles the storage requirement, but it also makes it blindingly easy to:

  1. Allow user edits of the original without parsing the BBCode back out
  2. Allow quotes of other user posts, again without parsing
  3. Change the HTML each BBCode generates (just re-parse every post)
  4. Switch BBCode engines down the line (again, just re-parse every post)
Charles
+1 If indeed the data has been available in BBCode format this would be far preferable.
Wrikken
Thanks, i think that's a better approach than what i originally thought.
Rodrigo
+3  A: 

If you know exactly that the HTML code you want to de-bbcode was en-bbcoded using your method, than do the following:

Switch the two array you pass to preg_replace.

In the array with the HTML code, do the following for every element: Prepend # to the string. Append #s. Replace \1 (and \2 aso) with (.*?).

For the array with the bbcodes do thefollowing with every element: Remove / at the beginning and /ms at end. Replace \s with . Remove all \. Remove all ?. Replace the first (.*) in the string with $1 and the second with $2.

This should do. If any problems: Ask ;)

nikic