tags:

views:

115

answers:

4

Hi,

I am trying to write a pattern for extracting the path for files found in img tags in HTML.

String string = "<img src=\"file:/C:/Documents and Settings/elundqvist/My Documents/My Pictures/import dialog step 1.JPG\" border=\"0\" />";

My Pattern:

src\\s*=\\s*\"(.+)\"

Problem is that my pattern will also include the 'border="0" part of the img tag.

What pattern would match the URI path for this file without including the 'border="0"?

Thanks

+2  A: 

Thy this one:

src\s*=\s*"([^"]+)"
Gumbo
+4  A: 

Your pattern should be (unescaped):

src\s*=\s*"(.+?)"

The important part is the added question mark that matches the group as few times as possible

Sebastian Dietz
TY that worked :)
willcodejavaforfood
A: 

You want to play with the greedy form of group-capture. Something like

src\\s*=\\s*\"(.+)?\"

By default the regex will try and match as much as possible

oxbow_lakes
You need to put the question mark inside the parens, like Sebastian did.
Alan Moore
A: 

I am trying to write a pattern for extracting the path for files found in img tags in HTML.

Can we have an autoresponder for "Don't use regex to parse [X]HTML"?

Problem is that my pattern will also include the 'border="0" part of the img tag.

Not to mention any time 'src="' appears in plain text!

If you know in advance the exact format of the HTML you're going to be parsing (eg. because you generated it yourself), you can get away with it. But otherwise, regex is entirely the wrong tool for the job.

bobince