views:

122

answers:

4

I have a regex in c# that i'm using to match image tags and pull out the URL. My code is working in most situations. The code below will "fix" all relative image URLs to Absolute URLs.

The issue is that the regex will not match the following:

<img height="150" width="202" alt="" src="../Image%20Files/Koala.jpg" style="border: 0px solid black; float: right;">

For example it matches this one just fine

<img height="147" width="197" alt="" src="../Handlers/SignatureImage.ashx?cid=5" style="border: 0px solid black;">

Any ideas on how to make it match would be great. I think the issue is the % but I could be wrong.

Regex rxImages = new Regex(" src=\"([^\"]*)\"", RegexOptions.IgnoreCase & RegexOptions.IgnorePatternWhitespace);
mc = rxImages.Matches(html);
if (mc.Count > 0)
{
    Match m = mc[0];
    string relitiveURL = html.Substring(m.Index + 6, m.Length - 7);
    if (relitiveURL.Substring(0, 4) != "http")
    {
        Uri absoluteUri = new Uri(baseUri, relitiveURL);
        ret += html.Substring(0, m.Index + 5);
        ret += absoluteUri.ToString();
        ret += html.Substring(m.Index + m.Length - 1, html.Length - (m.Index + m.Length - 1));
        ret = convertToAbsolute(URL, ret);
    }
}
A: 

regex is a bad idea. better use an html parser. here is a a regex i used for parsing links with regex though:

String body = "..."; //body of the page
Matcher m = Pattern.compile("(?im)(?:(?:(?:href)|(?:src))[ ]*?=[ ]*?[\"'])(((?:http|https)(?::\\/{2}[\\w]+)(?:[\\/|\\.]?)(?:[^\\s\"]*))|((?:\\/{0,1}[\\w\\.]+)+))[\"']").matcher(body);
while(m.find()){
  String absolute = m.group(2);
  String relative = m.group(3);
}

its a lot easier with a parser though, and better on resources. here is a link showing what i eventually wrote when i switched to a parser.

http://notetodogself.blogspot.com/2007/11/extract-links-using-htmlparser.html

probably not as helpful since that was java and you need C#

mkoryak
A: 

First, I would try to skip all the manual parsing and use linq to html

HDocument document = HDocument.Load("http://www.microsoft.com");

foreach (HElement element in document.Descendants("img"))
{
   Console.WriteLine("src = " + element.Attribute("src"));
}

If that didn't work, only then would I go back to manual parsing and I'm sure one of the fine gentle-people here has already posted a working regex for your needs.

BioBuckyBall
Do you know how LINQ2 to HTML compares to, let's say, HTML Agility Pack, in terms of how well it parses messed up layout?
Jim Brissom
+1 @Jim Brissom - just what I was about to ask :)
Oded
@Jim Brissom Good point, I don't actually. I will add text to clarify.
BioBuckyBall
@Lucas Heneks - The page you link to claims that it is _not_ based on Linq2Xml but is _like_ it and that is _does_ handle malformed HTML.
Oded
@Oded I guess I should read my own links, shame on me.
BioBuckyBall
lol... However, I would still like to know how it compares...
Oded
+2  A: 

Using RegEx to parse images in this way is a bad idea. See here for a good demonstration of why.

You can use an HTML parser such as the HTML Agility Pack to parse the HTML and query it using XPath syntax.

Oded
A: 

I don't know what your program does, but I'm guessing this is an example of something you would do in 5 minutes from the command line in linux. You can download windows versions of many of the same tools (sed, for instance) and save yourself the hassle of writing all that code.

Kendrick