I have a string of text that contains html, and I need to extract each url (most likely in img or a tags) to create a generic list of string objects. I only want the urls from inside html tags, not in the text. Is there an easy way to do this or will I have to resort to regular expressions?
If I have to resort to regular expressions, would you mind helping me out with that as well? :)
UPDATE: To answer Seph, the input will be standard html.
<p>This is some html text. my favourite website is <a href="http://www.google.com">google</a> and my favourite help site is <a href="http://www.stackoverflow.com">stackoverflow</a> and i check my email at <a href="http://www.gmail.com">gmail</a>. the url to my site is http://www.mysite.com. <img src="http://www.someserver.com/someimage.jpg" alt=""/></p>
And I want
- http://www.google.com
- http://www.stackoverflow.com
- http://www.gmail.com
- http://www.someserver.com/someimage.jpg
the end result should be All urls in any html tag, ignoring those are are "plain text"
UPPERDATE Although he deleted his answer, I want to thank Jerry Bullard for bringing to my attention Regex Buddy (http://www.regexbuddy). I wanted to upvote your answer but its gone. Bring it back and you get a vote!