tags:

views:

145

answers:

3

I have a string that contains html. I want to get all href value from hyperlinks using C#.
Target String
<a href="~/abc/cde" rel="new">Link1</a>
<a href="~/abc/ghq">Link2</a>

I want to get values "~/abc/cde" and "~/abc/ghq"

+2  A: 

Use the HTML Agility Pack for parsing HTML. Right on their examples page they have an example of parsing some HTML for the href values:

 foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a[@href"])
 {
    HtmlAttribute att = link["href"];

    // Do stuff with attribute value
 }
womp
+2  A: 

Using a regex to parse HTML is not advisable (think of text in comments etc.).

That said, the following regex should do the trick, and also gives you the link HTML in the tag if desired:

Regex regex = new Regex(@"\<a\s[^\<\>]*?href=(?<quote>['""])(?<href>((?!\k<quote>).)*)\k<quote>[^\>]*\>(?<linkHtml>((?!\</a\s*\>).)*)\</a\s*\>", RegexOptions.IgnoreCase|RegexOptions.ExplicitCapture);
for (Match match = regex.Match(inputHtml); match.Success; match=match.NextMatch()) {
  Console.WriteLine(match.Groups["href"]);
}
Lucero
Thats exactly what i was looking for, how the groups thing is working?
coure06
I am trying same thing for img src but its not working, any idea? Regex srcs = new Regex(@"\<img\s[^\<\>]*?src=(?<quote>['""])(?<src>((?!\k<quote>).)*)\k<quote>[^\>]*\>(?<linkHtml>((?!\</img\s*\>).)*)\</img\s*\>", RegexOptions.IgnoreCase | RegexOptions.ExplicitCapture);
coure06
The `img` tag is an empty tag, so you have no contents. Try this: `\<img\s[^\<\>]*?src=(?<quote>['""])(?<src>((?!\k<quote>).)*)\k<quote>[^\>]*\>`
Lucero
+1  A: 

On my blog I wrote an article (C# Regex Linq: Extract an Html Node with Attributes of Varying Types) which might be of service to you. Here is a snippet of the regex (use IgnoreWhitespace option):

(?:<)(?<Tag>[^\s/>]+)       # Extract the tag name.
(?![/>])                    # Stop if /> is found
# -- Extract Attributes Key Value Pairs  --

((?:\s+)             # One to many spaces start the attribute
 (?<Key>[^=]+)       # Name/key of the attribute
 (?:=)               # Equals sign needs to be matched, but not captured.

(?([\x22\x27])              # If quotes are found
  (?:[\x22\x27])
  (?<Value>[^\x22\x27]+)    # Place the value into named Capture
  (?:[\x22\x27])
 |                          # Else no quotes
   (?<Value>[^\s/>]*)       # Place the value into named Capture
 )
)+                  # -- One to many attributes found!

This will give you every tag and you can filter out what is needed and target the attribute you want. HTH

OmegaMan