tags:

views:

81

answers:

5

I have an HTML string like so:

<img src="http://foo"&gt;&lt;img src="http://bar"&gt;

What would be the regex pattern to split this into two separate img tags?

A: 

Shooting from the hip, something like:

(<img src=\".*?\"\\?>)

Your regex engine should capture multiple groups of img tags, which you can iterate through. I also added notation for an optional xhtml closing tab (e.g. <img src="foo" />).

Brent Arias
There are more things in HTML and tags, Brent Arias, than are dreamt of in your philosophy.
tchrist
@tchrist: Its smarter to make comments that are presumptive and insulting.
Brent Arias
+3  A: 

Don't do it with regex. Use an HTML/XML parser. You can even run it through Tidy first to clean it up. Most languages have a Tidy library. What language are you using?

Vivin Paliath
A: 

This will do it:

<img\s+src=\"[^\"]*?\">

Or you can do this to account for any additional attributes

<img\s+[^>]*?\bsrc=\"[^\"]*?\"[^>]*>
orvado
That doesn't account for "additional attributes" that you say it does. Look at my solution for how to do this properly. Well, as properly as possible for if not using an HTML-parsing class.
tchrist
A: 
<img src=\"https?://([-\w\.]+)+(:\d+)?(/([\w/_\.]*(\?\S+)?)?)?\">

PHP example:

$prom = '<img src="http://foo"&gt;&lt;img src="http://bar"&gt;';

preg_match_all('|<img src=\"https?://([-\w\.]+)+(:\d+)?(/([\w/_\.]*(\?\S+)?)?)?\">|',$prom, $matches);

print_r($matches[0]);
XViD
+2  A: 

How sure are you that your string is exactly that? What about input like this:

<img alt=">"          src="http://foo"  >
<img src='http://bar' alt='<'           >

What programming language is this? Is there some reason you're not using a standard HTML-parsing class to handle this? Regexes are only a good approach when you have an extremely well-known set of inputs. They don't work for real HTML, only for rigged demos.

Even if you must use a regex, you should use a proper grammatical one. This is quite easy. I've tested the following programacita on a zillion web pages. It takes care of the cases I outline above — and one or two others, too.

#!/usr/bin/perl
use 5.10.0;
use strict;
use warnings;

my $img_rx = qr{

    # save capture in $+{TAG} variable
    (?<TAG> (?&image_tag) )

    # remainder is pure declaration
    (?(DEFINE)

        (?<image_tag>
            (?&start_tag)
            (?&might_white) 
            (?&attributes) 
            (?&might_white) 
            (?&end_tag)
        )

        (?<attributes>
            (?: 
                (?&might_white) 
                (?&one_attribute) 
            ) *
        )

        (?<one_attribute>
            \b
            (?&legal_attribute)
            (?&might_white) = (?&might_white) 
            (?:
                (?&quoted_value)
              | (?&unquoted_value)
            )
        )

        (?<legal_attribute> 
            (?: (?&required_attribute)
              | (?&optional_attribute)
              | (?&standard_attribute)
              | (?&event_attribute)
              # for LEGAL parse only, comment out next line 
              | (?&illegal_attribute)
            )
        )

        (?<illegal_attribute> \b \w+ \b )

        (?<required_attribute>
            alt
          | src
        )

        (?<optional_attribute>
            (?&permitted_attribute)
          | (?&deprecated_attribute)
        )

        # NB: The white space in string literals 
        #     below DOES NOT COUNT!   It's just 
        #     there for legibility.

        (?<permitted_attribute>
            height
          | is map
          | long desc
          | use map
          | width
        )

        (?<deprecated_attribute>
             align
           | border
           | hspace
           | vspace
        )

        (?<standard_attribute>
            class
          | dir
          | id
          | style
          | title
          | xml:lang
        )

        (?<event_attribute>
            on abort
          | on click
          | on dbl click
          | on mouse down
          | on mouse out
          | on key down
          | on key press
          | on key up
        )

        (?<unquoted_value> 
            (?&unwhite_chunk) 
        )

        (?<quoted_value>
            (?<quote>   ["']      )
            (?: (?! \k<quote> ) . ) *
            \k<quote> 
        )

        (?<unwhite_chunk>   
            (?:
                # (?! [<>'"] ) 
                (?! > ) 
                \S
            ) +   
        )

        (?<might_white>     \s *   )

        (?<start_tag>  
            < (?&might_white) 
            img 
            \b       
        )

        (?<end_tag>          
            (?&html_end_tag)
          | (?&xhtml_end_tag)
        )

        (?<html_end_tag>       >  )
        (?<xhtml_end_tag>    / >  )

    )

}six;

$/ = undef;
$_ = <>;   # read all input

# strip stuff we aren't supposed to look at
s{ <!    DOCTYPE  .*?         > }{}sx; 
s{ <! \[ CDATA \[ .*?    \]\] > }{}gsx; 

s{ <script> .*?  </script> }{}gsix; 
s{ <!--     .*?        --> }{}gsx;

my $count = 0;

while (/$img_rx/g) {
    printf "Match %d at %d: %s\n", 
            ++$count, pos(), $+{TAG};
} 

There you go. Nothing to it!

Gee, why would you ever want to use an HTML-parsing class, given how easily HTML can be dealt with in a regex. ☺

tchrist