tags:

views:

1223

answers:

3

I have the following Regex to match all link tags on a page generated from our custom cms

<a\s+((?:(?:\w+\s*=\s*)(?:\w+|"[^"]*"|'[^']*'))*?\s*href\s*=\s*(?<url>\w+|"[^"]*"|'[^']*')(?:(?:\s+\w+\s*=\s*)(?:\w+|"[^"]*"|'[^']*'))*?)>.+?</a>

We are using c# to loop through all matches of this and add an onclick event to each link (for tracking software) before rendering the page content. I need to parse the link and add a parameter to the onclick function which is the "link name".

I was going to modify the regex to get the following subgroups

  • The title attribute of the link
  • If the link contains an image tag get the alt text of the image
  • The text of the link

I can then check the match of each subgroup to aqquire the relevant name of the link.

How would I modify the above regex to do this or could I achieve the same think using c# code?

+4  A: 

Regexes are fundamentally bad at parsing HTML (see Can you provide some examples of why it is hard to parse XML and HTML with a regex? for why). What you need is an HTML parser. See Can you provide an example of parsing HTML with your favorite parser? for examples using a variety of parsers.

In particular you may be interested in the HTMLAgilityPack answer.

Chas. Owens
Yes I can see that regex isn't particularly great at parsing html, which I why (as well as my lack of regex knowledge) I am struggling with this.Bearing in mind I can't guarantee this app has xhtml can you reccomend a good c# parser that would achieve the above?
Sheff
Sorry missed your anchor to the HTMLAgilityPack, ill take a look, thanks
Sheff
+1  A: 

Try this:

Regex reg = new Regex("<a[^>]*?title=\"([^\"]*?\"[^>]*?>");

A couple of gotchas:

  • This will match is case-sensitive, you may want to adjust that
  • This expects the title attribute both exists and is quoted
    • Of course, if the title attribute doesn't exist, you probably don't want the match anyway?

To Extract, use the groups collection:

reg.Match("<a href=\"#\" title=\"Hello\">Howdy</a>").Groups[1].Value
Goyuix
Unfortunately I do want to match if no title tag is present the content in this particular cms is very poor quality html so If title isn't present I need to check image alt then link text.
Sheff
A: 

Thanks to Chaos. Owens for pointing me towards the HtmlAgilityPack library its great. in the end I used it to sort out my problem as below. I would defiantly recommend this library to others.

   HtmlDocument htmldoc = new HtmlDocument();
    htmldoc.LoadHtml(content);
    HtmlNodeCollection linkNodes = htmldoc.DocumentNode.SelectNodes("//a[@href]");
    if (linkNodes != null)
    {
        foreach (HtmlNode linkNode in linkNodes)
        {
            string linkTitle = linkNode.GetAttributeValue("title", string.Empty);
            //If no title attribute exists check for an image alt tag
            if (linkTitle == string.Empty)
            {
                HtmlNode imageNode = linkNode.SelectSingleNode("img[@alt]");
                if (imageNode != null)
                {
                    linkTitle = imageNode.GetAttributeValue("alt", string.Empty);
                }
            }
            //If no image alt tag check for span with text
            if (linkTitle == string.Empty)
            {
                HtmlNode spanNode = linkNode.SelectSingleNode("span");
                if (spanNode != null)
                {
                    linkTitle = spanNode.InnerText;
                }
            }

            if (linkTitle == string.Empty)
            {
                if (!linkNode.HasChildNodes)
                {
                    linkTitle = linkNode.InnerText;
                }
            }

        }
    }
Sheff