tags:

views:

53

answers:

3

I have bunch of html and I need to get all the anchors and the anchor value using Regular Expression.

This is sample html that I need to process:

<P align=center><SPAN style="FONT-FAMILY: Arial; FONT-SIZE: 10px"><SPAN style="COLOR: #666666">View the </SPAN><A href="http://www.google.com"&gt;&lt;SPAN style="COLOR: #666666">online version</SPAN></A><SPAN style="COLOR: #666666"> if you are having trouble <A name=hi>displaying </A>this <a name="msg">message</A></SPAN></SPAN></P>

So, I need to be able to all <A name="blah">.

Any help is greatly appreciated.

+3  A: 

As hundreds of other answers on stackoverflow suggest - its a bad idea to use regex for processing html. use some html parser.

But for example, if still you need a regex to find the href urls, below is an regex you can use to match hrefs and extract its value:

\b(?<=(href="))[^"]*?(?=")

If you want to get contents inside <A> and </A>, then using regex is really a bad approach as lookahead/behind in the regex do not support regex producing variable length matches.

Gopi
Thanks for your help!
mob1lejunkie
A: 

the pattern is <a.*?(?<attribute>href|name)="(?<value>.*?)".*?>

so your c# code will be

Regex expression = new Regex("<a.*?(?<attribute>href|name)=\"(?<value>.*?)\".*?>", RegexOptions.IgnoreCase);
ajay_whiz
Also recommend to use `RegexOptions.Compiled`
abatishchev
`.*?` is too permissive. If there were (for example) an `<abbr>` tag at the beginning of the OP's sample text, your regex would match all the way from there to the first anchor. I'd use `[^<>]*?` instead.
Alan Moore
@Alan at which place are you talking about?
ajay_whiz
It's the first `.*?` that causes the problem I described, but you should replace all of them. `[^<>]*?` can never match beyond the tag's enclosing angle brackets, but `.*?` can, despite the reluctant quantifier.
Alan Moore
A: 

Don't forget to add a reference to Microsoft.mshtml.dll

using System;
using System.IO;
using System.Linq;
using System.Windows.Forms;

namespace WindowsFormsApplication1
{
    public partial class Form1 : Form
    {
        public Form1()
        {
            InitializeComponent();

            string html = "<P align=center><SPAN style=\"FONT-FAMILY: Arial; FONT-SIZE: 10px\"><SPAN style=\"COLOR: #666666\">View the </SPAN><A href=\"http://www.google.com\"&gt;&lt;SPAN style=\"COLOR: #666666\">online version</SPAN></A><SPAN style=\"COLOR: #666666\"> if you are having trouble <A name=hi>displaying </A>this <a name=\"msg\">message</A></SPAN></SPAN></P>";
            string fileName = Path.Combine(Path.GetTempPath(), Path.GetTempFileName());
            System.IO.File.WriteAllText(fileName, html);

            var browser = new WebBrowser();
            browser.Navigated += (sender, e) => browser_Navigated(sender, e);
            browser.Navigate(new Uri(fileName));
        }

        private void browser_Navigated(object sender, WebBrowserNavigatedEventArgs e)
        {
            var browser = (WebBrowser)sender;
            var links = browser
                        .Document
                        .Links
                        .OfType<HtmlElement>()
                        .Select(l => ((mshtml.HTMLAnchorElement)l.DomElement).href); 
                        //result: { "http://www.google.com", .. }
        }
    }
}
abatishchev
This is just plain overkill
Jan Jongboom
@Jan: absolutely not. Sane developer should not ever use regex to parse html http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 ! Also OP has a bunch of links, he cited just an example. So the best solution from the .NET box is using of `System.Windows.Forms.WebBrowser` control. The best what it can do - be a wrapper against `mshtml`
abatishchev
That's the same what Gopi suggests, but creating a whole WebBrowser component, dumping a page to disk, then load the page and then extract all the links is overkill. You can use any light HTML parser for this.
Jan Jongboom
@Jan: External parser is better but it will do nearly the same and probably even more, I guess. Dumping to disk is ugly, I agree, but this sh** (WebBrowser) can't load document another way..
abatishchev