ansaurus

Question

Get "Title" attribute from html link using Regex

Answer 1

+4 A:

Regexes are fundamentally bad at parsing HTML (see Can you provide some examples of why it is hard to parse XML and HTML with a regex? for why). What you need is an HTML parser. See Can you provide an example of parsing HTML with your favorite parser? for examples using a variety of parsers.

In particular you may be interested in the HTMLAgilityPack answer.

Chas. Owens 2009-05-12 15:37:17

Yes I can see that regex isn't particularly great at parsing html, which I why (as well as my lack of regex knowledge) I am struggling with this.Bearing in mind I can't guarantee this app has xhtml can you reccomend a good c# parser that would achieve the above?

Sheff 2009-05-12 15:44:39

Sorry missed your anchor to the HTMLAgilityPack, ill take a look, thanks

Sheff 2009-05-12 15:48:50

Answer 2

+1 A:

Try this:

Regex reg = new Regex("<a[^>]*?title=\"([^\"]*?\"[^>]*?>");

A couple of gotchas:

This will match is case-sensitive, you may want to adjust that
This expects the title attribute both exists and is quoted
- Of course, if the title attribute doesn't exist, you probably don't want the match anyway?

To Extract, use the groups collection:

reg.Match("<a href=\"#\" title=\"Hello\">Howdy</a>").Groups[1].Value

Goyuix 2009-05-12 15:41:43

Unfortunately I do want to match if no title tag is present the content in this particular cms is very poor quality html so If title isn't present I need to check image alt then link text.

Sheff 2009-05-12 15:46:40

Answer 3

A:

Thanks to Chaos. Owens for pointing me towards the HtmlAgilityPack library its great. in the end I used it to sort out my problem as below. I would defiantly recommend this library to others.

   HtmlDocument htmldoc = new HtmlDocument();
    htmldoc.LoadHtml(content);
    HtmlNodeCollection linkNodes = htmldoc.DocumentNode.SelectNodes("//a[@href]");
    if (linkNodes != null)
    {
        foreach (HtmlNode linkNode in linkNodes)
        {
            string linkTitle = linkNode.GetAttributeValue("title", string.Empty);
            //If no title attribute exists check for an image alt tag
            if (linkTitle == string.Empty)
            {
                HtmlNode imageNode = linkNode.SelectSingleNode("img[@alt]");
                if (imageNode != null)
                {
                    linkTitle = imageNode.GetAttributeValue("alt", string.Empty);
                }
            }
            //If no image alt tag check for span with text
            if (linkTitle == string.Empty)
            {
                HtmlNode spanNode = linkNode.SelectSingleNode("span");
                if (spanNode != null)
                {
                    linkTitle = spanNode.InnerText;
                }
            }

            if (linkTitle == string.Empty)
            {
                if (!linkNode.HasChildNodes)
                {
                    linkTitle = linkNode.InnerText;
                }
            }

        }
    }

Sheff 2009-05-13 16:40:40

ansaurus

tags:

views:

answers:

Get "Title" attribute from html link using Regex

related questions