tags:

views:

542

answers:

4

Is it possible to do a regex replace on all IMG tags that are unclosed? If so, how would I identify:

  <img src="..." alt="...">

...as a potential canidate to be replaced?

   = <img src="..." alt="..."/>

Update: We have hundreds of pages, and thousands of image tags, all must of which must be closed. I'm not stuck on RegEx -- any other method, aside from manually updating all IMG tags, would suffice.

+4  A: 

In HTML the end tag for an <img> "must be omitted", so the start tag closes the element and you can't have an unclosed img.

If you want to convert your HTML to XHTML then use a real parser. Regular Expressions aren't a very good tool for this job.

David Dorward
A: 

What exactly do you mean by "unclosed"?

 <img src="a1.jpg    <--no ending quotes and end parens
 <img src="a1.jpg"   <-- no end parens
 <img src="a1.jpg">  <-- the tag does not self-close as should be done in XHTML

You can try to intelligently find such suspects, but you are never guaranteed to be fool-proof.

naivists
A: 

I have never tried this but a closed img tag is a tag beginning with <img with stuffs in and a /> at the end.

Here is something I tried in perl

!/usr/bin/env perl

my @images = ('<img src="toto.jpg">',
          '<img src="truc/machin.jpg" title="pouet" >',
          '<img        src="pouet.jpg" alt="toto" />',
          '<img src="math/a-greater-than-b.png" alt="a > b">');

foreach (@images) {
    if (/<img\s+(([a-z]+=".*?")+\s*)>/) {
    print "Match : <img $1 />\n";
    }
}

Produces:

Match : <img src="toto.jpg" />
Match : <img src="truc/machin.jpg" title="pouet"  />
Match : <img src="math/a-greater-than-b.png" alt="a > b" />
Aif
And it breaks if attribute values aren't quoted (valid!) or are quoted with single quotes (valid!) or the attribute name contains non-alpha numeric characters (HTML5's data-foo) or if the attribute name includes uppercase characters.
David Dorward
uppercase is easy to deal with. I tought simple-quotes were not allowed, but it's not the case.Again, easy to replace: replace " by ['"] but you're right for non-alpha chars. Again, I thing it can be done, but the spec has to be more precise.Nevertheless, it's possible to achieve this task automaticaly, but (maybe) not with regexp only. regexp are just a pretty good first filter. I may be enough if the url scheme is always the same on his pages.Thans for your comment anyway.
Aif
Replacing `"` with `["']` would cause it to break for `foo="bar 'baz' bar"`. HTML is **not** simple to parse with regex.
David Dorward
+2  A: 
(<img[^>]+)(?<!/)>

will match an img tag that is not properly closed. It requires that the regex flavor you're using supports lookbehind (which Ruby and JavaScript don't but most others do). Backreference no. 1 will contain the match, so if you search for this regex and replace by \1/> you should be good to go.

If you need to account for the possibility of > inside attributes, you could use

(<img("[^"]*"|[^>])+)(?<!/)>

This will match, e.g.,

<img src="image.gif" alt="hey, look--->">
<img src="image/image.gif">

and leave

<img src="image/image.gif" />

alone.

Tim Pietzcker
Does this assume the `img` element occupies a single text line?
Loadmaster
No, it doesn't.
Tim Pietzcker
It does however assume that the alt and title text do not contain a `>`. (Which you are not guaranteed, with hundreds of pages of code).
Sean Vieira
You're right. That's one of the reasons why regexes are not the best tool to handle HTML, to paraphrase bobince's legendary post. Of course, you can account for that (will edit my post).
Tim Pietzcker
`<` is not allowed in an attribute value in plain.
Gumbo
Of course. Thanks.
Tim Pietzcker
@Gumbo — Yes, it is.
David Dorward
@David Dorward: I’m not quite sure about SGML, but in XML a plain `<` is not allowed.
Gumbo
It's allowed in HTML.
David Dorward