tags:

views:

202

answers:

5

I really can't work out how to best do this, I can do fairly simple regex expressions, but the more complex ones really stump me.

The following appears in specific HTML documents:

<span id="label">
<span>
<a href="http://variableLink"&gt;Joe Bloggs</a>
now using
</span>
<span>
'
<a href="/variableLink/">Important Data</a>
'
</span>
<span>
on
<a href="/variableLink">Important data 2</a>
</span>
</span>

I need to extract the two 'important data' points and could spend hours working out the regex to do it.(I'm using the .net Regex Library in C# 3.5)

+3  A: 

As often stated befor, regular expressions are usually not the right tool for parsing HTML, XML, and friends - think about using HTML or XML parsing libraries. If you really want to or have to use regular expressions, the following will match the content of the tags in many cases, but might still fail in some cases.

<a href="[^"]*">(?<data>[^<]*)</a>

This expression will match all links not starting with http:// - this is the only obviouse difference I can see between the links.

<a href="(?!http://)[^"]*"&gt;(?&lt;data&gt;[^&lt;]*)&lt;/a&gt;
Daniel Brückner
Won't that also match 'Joe Bloggs' and any other link in the html document?
Matthew Rathbone
That picks up Joe Bloggs too. He said he only wants the 2 "Important" points.
Matthew Flaschen
Yes, I noticed that. I just don't know how he wants to differntiate between these links. Matthew will have to elaborate the difference.
Daniel Brückner
That second one does match the links, but it matches all similar links in the HTML doc. How would I restrict it to links within that specific span?
Matthew Rathbone
+1 for pointing out that regex is not the right tool for the job. Esp. when the HTML isn't marked up semantically.
PatrikAkerstrand
I ended up using a couple of regex's to extract the data, and your example above was the key one. Thanks!
Matthew Rathbone
A: 

Look up look-behind and look-ahead syntax for .NET and use that to look for the anchor tags in the HTML. This site may help you. As an alternative to regular expressions, you might consider using a System.Xml.XPath.XPathNavigator to address those nodes directly.

John M Gant
A: 

My Regex is a little rusty but something along the lines of the following may help (although it will probably need some fine-tuning):

(?<=\<a href="/variableLink[/]?"\>)(.*)+(?=</a>)
Dave_Stott
A: 
  <a\shref.*?"/variableLink/?">(.*)</a>

First group contains the Name of the anchors. Tested with Expresso. Works on the sample text you've provided.
Update: works with Snippy too.

Regex regex = new Regex(@"<a\shref.*?""/variableLink/?"">(.*)</a>", RegexOptions.Multiline);
foreach (Match everyMatch in regex.Matches(sText))
{
  Console.WriteLine("{0}", everyMatch.Groups[1]);
}

Outputs:

Important Data
Important data 2
Gishu
+4  A: 

The below uses HtmlAgilityPack. It prints any text within a second-or-later link within the "label" id. Of course, it's relatively simple to modify the XPath to do something a little different.

    HtmlDocument doc = new HtmlDocument();
    doc.Load(new StringReader(@"<span id=""label"">
<span>
<a href=""http://variableLink""&gt;Joe Bloggs</a>
now using
</span>
<span>
'
<a href=""/variableLink/"">Important Data</a>
'
</span>
<span>
on
<a href=""/variableLink"">Important data 2</a>
</span>
</span>
"));
    HtmlNode root = doc.DocumentNode;

    HtmlNodeCollection anchors;
    anchors = root.SelectNodes("//span[@id='label']/span[position()>=2]/a/text()");
    IList<string> importantStrings;
    if(anchors != null)
    {
        importantStrings = new List<string>(anchors.Count);
        foreach(HtmlNode anchor in anchors)
     importantStrings.Add(((HtmlTextNode)anchor).Text);
    }
    else
        importantStrings = new List<string>(0);

    foreach(string s in importantStrings)
        Console.WriteLine(s);
Matthew Flaschen
I know he asked for regex, but I fully agree regex would suck for this type of thing. Not impossible mind you, but damn near impossible to maintain because regexes for html and similar usually end up with craploads of escapes. Especially if you're looking to go beyond a single tag.
Kris