I have a string that contains html. I want to get all href value from hyperlinks using C#.
Target String
<a href="~/abc/cde" rel="new">Link1</a>
<a href="~/abc/ghq">Link2</a>
I want to get values "~/abc/cde" and "~/abc/ghq"
views:
145answers:
3
+2
A:
Use the HTML Agility Pack for parsing HTML. Right on their examples page they have an example of parsing some HTML for the href values:
foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a[@href"])
{
HtmlAttribute att = link["href"];
// Do stuff with attribute value
}
womp
2010-04-12 16:56:06
+2
A:
Using a regex to parse HTML is not advisable (think of text in comments etc.).
That said, the following regex should do the trick, and also gives you the link HTML in the tag if desired:
Regex regex = new Regex(@"\<a\s[^\<\>]*?href=(?<quote>['""])(?<href>((?!\k<quote>).)*)\k<quote>[^\>]*\>(?<linkHtml>((?!\</a\s*\>).)*)\</a\s*\>", RegexOptions.IgnoreCase|RegexOptions.ExplicitCapture);
for (Match match = regex.Match(inputHtml); match.Success; match=match.NextMatch()) {
Console.WriteLine(match.Groups["href"]);
}
Lucero
2010-04-12 17:00:06
Thats exactly what i was looking for, how the groups thing is working?
coure06
2010-04-12 18:09:18
I am trying same thing for img src but its not working, any idea? Regex srcs = new Regex(@"\<img\s[^\<\>]*?src=(?<quote>['""])(?<src>((?!\k<quote>).)*)\k<quote>[^\>]*\>(?<linkHtml>((?!\</img\s*\>).)*)\</img\s*\>", RegexOptions.IgnoreCase | RegexOptions.ExplicitCapture);
coure06
2010-04-12 18:58:43
The `img` tag is an empty tag, so you have no contents. Try this: `\<img\s[^\<\>]*?src=(?<quote>['""])(?<src>((?!\k<quote>).)*)\k<quote>[^\>]*\>`
Lucero
2010-04-12 19:26:00
+1
A:
On my blog I wrote an article (C# Regex Linq: Extract an Html Node with Attributes of Varying Types) which might be of service to you. Here is a snippet of the regex (use IgnoreWhitespace option):
(?:<)(?<Tag>[^\s/>]+) # Extract the tag name.
(?![/>]) # Stop if /> is found
# -- Extract Attributes Key Value Pairs --
((?:\s+) # One to many spaces start the attribute
(?<Key>[^=]+) # Name/key of the attribute
(?:=) # Equals sign needs to be matched, but not captured.
(?([\x22\x27]) # If quotes are found
(?:[\x22\x27])
(?<Value>[^\x22\x27]+) # Place the value into named Capture
(?:[\x22\x27])
| # Else no quotes
(?<Value>[^\s/>]*) # Place the value into named Capture
)
)+ # -- One to many attributes found!
This will give you every tag and you can filter out what is needed and target the attribute you want. HTH
OmegaMan
2010-04-12 17:18:44