ansaurus

Question

c# regular expression to match img src="*" type URLs.

Answer 1

A:

regex is a bad idea. better use an html parser. here is a a regex i used for parsing links with regex though:

String body = "..."; //body of the page
Matcher m = Pattern.compile("(?im)(?:(?:(?:href)|(?:src))[ ]*?=[ ]*?[\"'])(((?:http|https)(?::\\/{2}[\\w]+)(?:[\\/|\\.]?)(?:[^\\s\"]*))|((?:\\/{0,1}[\\w\\.]+)+))[\"']").matcher(body);
while(m.find()){
  String absolute = m.group(2);
  String relative = m.group(3);
}

its a lot easier with a parser though, and better on resources. here is a link showing what i eventually wrote when i switched to a parser.

http://notetodogself.blogspot.com/2007/11/extract-links-using-htmlparser.html

probably not as helpful since that was java and you need C#

mkoryak 2010-09-09 19:55:19

Answer 2

A:

First, I would try to skip all the manual parsing and use linq to html

HDocument document = HDocument.Load("http://www.microsoft.com");

foreach (HElement element in document.Descendants("img"))
{
   Console.WriteLine("src = " + element.Attribute("src"));
}

If that didn't work, only then would I go back to manual parsing and I'm sure one of the fine gentle-people here has already posted a working regex for your needs.

BioBuckyBall 2010-09-09 19:56:41

Do you know how LINQ2 to HTML compares to, let's say, HTML Agility Pack, in terms of how well it parses messed up layout?

Jim Brissom 2010-09-09 19:59:27

+1 @Jim Brissom - just what I was about to ask :)

Oded 2010-09-09 20:03:38

@Jim Brissom Good point, I don't actually. I will add text to clarify.

BioBuckyBall 2010-09-09 20:06:46

@Lucas Heneks - The page you link to claims that it is _not_ based on Linq2Xml but is _like_ it and that is _does_ handle malformed HTML.

Oded 2010-09-09 20:08:09

@Oded I guess I should read my own links, shame on me.

BioBuckyBall 2010-09-09 20:09:10

lol... However, I would still like to know how it compares...

Oded 2010-09-09 20:09:54

Answer 3

+2 A:

Using RegEx to parse images in this way is a bad idea. See here for a good demonstration of why.

You can use an HTML parser such as the HTML Agility Pack to parse the HTML and query it using XPath syntax.

Oded 2010-09-09 19:56:56

Answer 4

A:

I don't know what your program does, but I'm guessing this is an example of something you would do in 5 minutes from the command line in linux. You can download windows versions of many of the same tools (sed, for instance) and save yourself the hassle of writing all that code.

Kendrick 2010-09-09 19:58:35

ansaurus

tags:

views:

answers:

c# regular expression to match img src="*" type URLs.

related questions