views:

556

answers:

4

I need to extract the src element from all image tags in an HTML document.

So, the input is an HTML page and the output would be a list of URL's pointing to images: ex... http://www.google.com/intl/en_ALL/images/logo.gif

The following is what I came up with so far:

<img\s+src=""(http://.*?)

This does not work for tags where the src isn't directly after the img tag, for example:

<img height="1px" src="spacer.gif">

Can someone help complete this regular expression? It's pretty easy, but I thought this may be a faster way to get an answer.

A: 

The following regexp snippet should work.

<img[^>]+src="([^">]+)"

It looks for text that starts with <img, followed by one or more characters that are not >, then src=". It then grabs everything between that point and the next " or >.

But if at all possible, use a real HTML parser. It's more solid, and will handle edge cases much better.

Anirvan
It won't work for single quoted tags, and remember that HTML actually doesn't require quotes on attributes unless they contain whitespace.
Lucero
+5  A: 

You don't want to do that. Correctly parsing HTML is a very complex problem, and regular expressions are not a good tool for that.

See e.g. http://stackoverflow.com/questions/701166/can-you-provide-some-examples-of-why-it-is-hard-to-parse-xml-and-html-with-a-regex

And here for a good solution:

http://stackoverflow.com/questions/218535/how-do-i-programatically-inspect-a-html-document

sleske
A: 

In any situation where you need to extract values from HTML markup, you should avoid regular expressions. I don't know enough VBScript to tell if there are any libraries available that you can use.

If you know that everything will be well formed, valid markup you can try...

<img[^>]+src="([^"]+)"[^>]+>

As I said, I don't know much VBScript, but this regexp should match the value of the src tag by grouping the stuff in the parenthesis above.

Of course the regular expression above won't work if the markup isn't well formed or uses different delimiters for attributes (such as single quotes, or no delimiter at all).

Dan Herbert
+2  A: 

You could do this pretty easily with Javascript. An example would be like below:

var images = document.getElementsByTagName("img");

for (i=0; i < images.length; i++)
{
   // get image src
   var currImage = images[i].src;

   // do link creation here
}
ryanulit