tags:

views:

84

answers:

4

I'm trying to find a way to strip tags from a user-inputted string except from tags that are wrapped in the [code] [/code] BB style tag.

For example, a user may enter this:

<script>alert("hacked");</script>
[code]<script>alert("hello");</script>[/code]

What I would like is the "hacked" alert to be removed, but not the "Hello" alert.

I would like to remove ALL tags (php, html, css, js) outside of the [code] but allow anything within them.

So far, I've got the following code to do the reverse of what I would like:

preg_replace('/\[code\](.*?)\[\/code\]/ise','strip_tags(\'$1\')',$code)
+1  A: 

This is where regular expressions are not ideal. Regular expressions are superb when you know "what you want" but not "what you don't want". My suggestion is that you try to find an alternative way of doing the same thing, but without regular expressions.

Deniz Dogan
can you think of any ways of achieving this without using regular expressions?
Jamie Bicknell
Using regular expressions to pull values from HTML is always a mistake. HTML syntax is a lot more complex that it may first appear and it's very easy for a page to catch out even a very complex regular expression. This question does come up a lot on Stack Overflow and I think it's because Regular Expressions seems perfect for this job, but they aren't. Use an HTML Parser instead.
Dave Webb
@Jamie: I haven't ever done this and I don't know for sure, but my initial attempt would be to see if there are any occurrences of `"[code]"` in the text and if so, find out its position. Then I would try to find `"[/code]"` and the position of that and then work from there.
Deniz Dogan
A: 

You want to use an HTML Parser for this job.

I don't know PHP but Google found this HTML Parser for PHP.

Dave Webb
A: 

Use a simple parser like this:

stack-pointer = 0
while not finished:
    stack-pointer-n = code-start-matched or endl
    tag-free-str = regex-magic-to-strip-tags(extract-str(stack-pointer, stack-pointer-n))
    preserve-str = extract-str(stack-pointer-n, code-endl-matched or endl)
    stack-pointer = code-endl-matched + 1
    push(tag-free-str)
    push(preserve-str)
Ramkumar Ramachandra
+3  A: 

I'm not sure if this is the best algorithm, but here's an idea.

  1. Remove all the [code] blocks into an array
  2. Strip tags from the remaining string
  3. Re-insert the previously removed [code] blocks.
  4. Voila!

Here's a stab at that algo

<?php

header( 'Content-Type: text/plain' );

$input = <<<BB
[code]<script>alert("hello");</script>[/code]
some text <script>alert("hacked");</script> some other text
[code]<script>alert("hello");</script>[/code]
some text <script>alert("hacked");</script> some other text
[code]<script>alert("hello");</script>[/code]
BB;

echo strip_custom( $input );

function strip_custom( $content )
{
  $pattern = "#\\[code].*?\\[/code]#i";

  if ( preg_match_all( $pattern, $content, $codeBlocks ) )
  {
    return array_join( $codeBlocks[0], array_map( 'strip_tags', preg_split( $pattern, $content ) ) );
  }
  return strip_tags( $content );
}

function array_join( array $glue, array $pieces )
{
  $glue       = array_values( $glue );
  $pieces     = array_values( $pieces );
  $piecesSize = count( $pieces );

  if ( count( $glue ) + 1 != $piecesSize )
  {
    return false;
  }

  $joined = array();
  for ( $i = 0; $i < $piecesSize; $i++ )
  {
    $joined[] = $pieces[$i];
    if ( isset( $glue[$i] ) )
    {
      $joined[] = $glue[$i];
    }
  }
  return implode( '', $joined );
}
Peter Bailey
Works like a charm! Thank you very much Peter
Jamie Bicknell
in the strip_custom() function, I've changed "return $content" to return strip_tags($content); as this will return the stripped out text when [code] is not present.Thanks again
Jamie Bicknell
Good catch! Added to my answer for future-proofing =)
Peter Bailey