views:

414

answers:

5

Hi,

I'm trying to write a regular expression that will strip all tag attributes except for the SRC attribute. For example:

<p id="paragraph" class="green">This is a paragraph with an image <img src="/path/to/image.jpg" width="50" height="75"/></p>

Would be returned as:

<p>This is a paragraph with an image <img src="/path/to/image.jpg" /></p>

I have a regular expression to strip all attributes, but I'm trying to tweak it to leave in src. Here's what I have so far:

<?php preg_replace('/<([A-Z][A-Z0-9]*)(\b[^>]*)>/i', '<$1>', '<html><goes><here>');

Using PHP's preg_replace() for this.

Thanks! Ian

+4  A: 

You usually should not parse HTML using regular expressions.

Instead, you should call DOMDocument::loadHTML.
You can then recurse through the elements in the document and call removeAttribute.

SLaks
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.
fmark
You can parse HTML using regular expressions. Not all HTML. But if you know exactly what you're receiving you can use regular expressions. This is a religious war started by people who assume that infinite stacks and memory are available in all situations.
PP
Some people have a terrible habit of not answering the question and instead obsessing about mantras. This should have been downvoted, not upvoted by the religious right.
PP
Some people, when confronted with a problem, think "I know, I'll quote Jamie Zawinski." Now they have two problems. This really is the kind of problem that is best handled by a dedicated markup parser/processor, that's quite true. But regular expressions are a damn fine tool for many jobs, including some markup processing tasks, and it's foolish to outright dismiss them.
Weston C
I'm gonna have to agree with PP. Downvoted because of the dogmatic answer given. It IS possible to parse HTML with regular expressions, especially if you know exactly what you're going for. DOMDocument is great is some cases, but not all.
Ian Silber
+1  A: 

Unfortunately I'm not sure how to answer this question for PHP. If I were using Perl I would do the following:

use strict;
my $data = q^<p id="paragraph" class="green">This is a paragraph with an image <img src="/path/to/image.jpg" width="50" height="75"/></p>^;

$data =~ s{
    <([^/> ]+)([^>]+)> # split into tagtype, attribs
}{
    my $attribs = $2;
    my @parts = split( /\s+/, $attribs ); # separate by whitespace
    @parts = grep { m/^src=/i } @parts;   # retain just src tags
    if ( @parts ) {
        "<" . join( " ", $1, @parts ) . ">";
    } else {
        "<" . $1 . ">";
    }
}xseg;

print( $data );

which returns

<p>This is a paragraph with an image <img src="/path/to/image.jpg"></p>
PP
A: 

Alright, here's what I used that seems to be working well:

<([A-Z][A-Z0-9]*)(\b[^>src]*)(src\=[\'|"|\s]?[^\'][^"][^\s]*[\'|"|\s]?)?(\b[^>]*)>

Feel free to poke any holes in it.

Ian Silber
+2  A: 

This might work for your needs:

$text = '<p id="paragraph" class="green">This is a paragraph with an image <img src="/path/to/image.jpg" width="50" height="75"/></p>';

echo preg_replace("/<([a-z][a-z0-9]*)(?:[^>]*(\ssrc=['\"][^'\"]*['\"]))?[^>]*?(\/?)>/i",'<$1$2$3>', $text);

// <p>This is a paragraph with an image <img src="/path/to/image.jpg"/></p>

The RegExp broken down:

/              # Start Pattern
 <             # Match '<' at beginning of tags
 (             # Start Capture Group $1 - Tag Name
  [a-z]         # Match 'a' through 'z'
  [a-z0-9]*     # Match 'a' through 'z' or '0' through '9' zero or more times
 )             # End Capture Group
 (?:           # Start Non-Capture Group
  [^>]*         # Match anything other than '>', Zero or More Times
  (             # Start Capture Group $2 - ' src="...."'
   \s            # Match one whitespace
   src=          # Match 'src='
   ['"]          # Match ' or "
   [^'"]*        # Match anything other than ' or " 
   ['"]          # Match ' or "
  )             # End Capture Group 2
 )?            # End Non-Capture Group, match group zero or one time
 [^>]*?        # Match anything other than '>', Zero or More times, not-greedy (wont eat the /)
 (\/?)         # Capture Group $3 - '/' if it is there
 >             # Match '>'
/i            # End Pattern - Case Insensitive

Add some quoting, and use the replacement text <$1$2$3> it should strip any non src= properties from well-formed HTML tags.

Please Note This isn't necessarily going to work on ALL input, as the Anti-HTML + RegExp people are so cleverly noting below. There are a few fallbacks, most notably <p style=">"> would end up <p>"> and a few other broken issues... I would recommend looking at Zend_Filter_StripTags as a full proof tags/attributes filter in PHP

gnarf
Unless `>` appears in an attribute value. Parsing evil HTML is _hard_. Plus, you forgot to escape `\ `.
SLaks
Which `\ ` did I forget to escape?
gnarf
A: 

As above introduced you shouldn use regex to parse html, or xml.

I would do your example with str_replace(); if its all time the same.

$str = '<p id="paragraph" class="green">This is a paragraph with an image <img src="/path/to/image.jpg" width="50" height="75"/></p>';

$str = str_replace('id="paragraph" class="green"', "", $str);

$str = str_replace('width="50" height="75"',"",$str);
streetparade