tags:

views:

113

answers:

4

How can I write a regular expression to replace links with no link text like this:

<a href="http://www.somesite.com"&gt;&lt;/a&gt;

with

<a href="http://www.somesite.com"&gt;http://www.somesite.com&lt;/a&gt;

?

This is what I was trying to do to capture the matches, and it isn't catching any. What am I doing wrong?

string pattern = "<a\\s+href\\s*=\\s*\"(?<href>.*)\">\\s*</a>";
+2  A: 

I could be wrong, but I think you simply need to change the quantifier within the href group to be lazy rather than greedy.

string pattern = @"<a\s+href\s*=\s*""(?<href>.*?)"">\s*</a>";

(I've also changed the type of the string literal to use @, for better readability.)

The rest of the regex appears fine to me. That you're not capturing any matches at all makes me think otherwise, but there could be a problem in the rest of the code (or even the input data - have you verified that?).

Noldorin
+1  A: 

I would suggest

string pattern = "(<a\\b[^>]*href=\"([^\"]+)\"[^>]*>)[\\s\\r\\n]*(</a>)";

This way also links with their href attribute somewhere else would be captured.

Replace with

"$1$2$3"

The usual word of warning: HTML and regex are essentially incompatible. Use with caution, this might blow up.

Tomalak
+8  A: 

I wouldn't use a regex - I'd use the Html Agility Pack, and a query like:

foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a[.='']")) {
    link.InnerText = link.GetAttribute("href");
}
Marc Gravell
+1 for my daily dose of learning something new.
womp
+1 for avoiding regex shallows.
Tomalak
A: 

Marc Gravell has the right answer, regexes are fundamentally bad at parsing HTML (see Can you provide some examples of why it is hard to parse XML and HTML with a regex? for why). See Can you provide an example of parsing HTML with your favorite parser? for examples using a variety of parsers.

Chas. Owens