ansaurus

Question

What is the REGEX to match this pattern in a html document in C#?

Answer 1

+3 A:

As often stated befor, regular expressions are usually not the right tool for parsing HTML, XML, and friends - think about using HTML or XML parsing libraries. If you really want to or have to use regular expressions, the following will match the content of the tags in many cases, but might still fail in some cases.

<a href="[^"]*">(?<data>[^<]*)</a>

This expression will match all links not starting with http:// - this is the only obviouse difference I can see between the links.

<a href="(?!http://)[^"]*"&gt;(?&lt;data&gt;[^&lt;]*)&lt;/a&gt;

Daniel Brückner 2009-05-27 11:26:14

Won't that also match 'Joe Bloggs' and any other link in the html document?

Matthew Rathbone 2009-05-27 11:27:47

That picks up Joe Bloggs too. He said he only wants the 2 "Important" points.

Matthew Flaschen 2009-05-27 11:28:10

Yes, I noticed that. I just don't know how he wants to differntiate between these links. Matthew will have to elaborate the difference.

Daniel Brückner 2009-05-27 11:33:36

That second one does match the links, but it matches all similar links in the HTML doc. How would I restrict it to links within that specific span?

Matthew Rathbone 2009-05-27 11:44:12

+1 for pointing out that regex is not the right tool for the job. Esp. when the HTML isn't marked up semantically.

PatrikAkerstrand 2009-05-27 12:02:55

I ended up using a couple of regex's to extract the data, and your example above was the key one. Thanks!

Matthew Rathbone 2009-05-27 14:21:39

Answer 2

A:

Look up look-behind and look-ahead syntax for .NET and use that to look for the anchor tags in the HTML. This site may help you. As an alternative to regular expressions, you might consider using a System.Xml.XPath.XPathNavigator to address those nodes directly.

John M Gant 2009-05-27 11:31:07

Answer 3

A:

My Regex is a little rusty but something along the lines of the following may help (although it will probably need some fine-tuning):

(?<=\<a href="/variableLink[/]?"\>)(.*)+(?=</a>)

Dave_Stott 2009-05-27 11:33:34

Answer 4

A:

  <a\shref.*?"/variableLink/?">(.*)</a>

First group contains the Name of the anchors. Tested with Expresso. Works on the sample text you've provided.
Update: works with Snippy too.

Regex regex = new Regex(@"<a\shref.*?""/variableLink/?"">(.*)</a>", RegexOptions.Multiline);
foreach (Match everyMatch in regex.Matches(sText))
{
  Console.WriteLine("{0}", everyMatch.Groups[1]);
}

Outputs:

Important Data
Important data 2

Gishu 2009-05-27 11:44:16

Answer 5

+4 A:

The below uses HtmlAgilityPack. It prints any text within a second-or-later link within the "label" id. Of course, it's relatively simple to modify the XPath to do something a little different.

    HtmlDocument doc = new HtmlDocument();
    doc.Load(new StringReader(@"<span id=""label"">
<span>
<a href=""http://variableLink""&gt;Joe Bloggs</a>
now using
</span>
<span>
'
<a href=""/variableLink/"">Important Data</a>
'
</span>
<span>
on
<a href=""/variableLink"">Important data 2</a>
</span>
</span>
"));
    HtmlNode root = doc.DocumentNode;

    HtmlNodeCollection anchors;
    anchors = root.SelectNodes("//span[@id='label']/span[position()>=2]/a/text()");
    IList<string> importantStrings;
    if(anchors != null)
    {
        importantStrings = new List<string>(anchors.Count);
        foreach(HtmlNode anchor in anchors)
     importantStrings.Add(((HtmlTextNode)anchor).Text);
    }
    else
        importantStrings = new List<string>(0);

    foreach(string s in importantStrings)
        Console.WriteLine(s);

Matthew Flaschen 2009-05-27 11:51:11

I know he asked for regex, but I fully agree regex would suck for this type of thing. Not impossible mind you, but damn near impossible to maintain because regexes for html and similar usually end up with craploads of escapes. Especially if you're looking to go beyond a single tag.

Kris 2009-05-27 12:01:00

ansaurus

tags:

views:

answers:

What is the REGEX to match this pattern in a html document in C#?

related questions