ansaurus

Question

How can I write a regular expression to capture links with no link text?

Answer 1

+2 A:

I could be wrong, but I think you simply need to change the quantifier within the href group to be lazy rather than greedy.

string pattern = @"<a\s+href\s*=\s*""(?<href>.*?)"">\s*</a>";

(I've also changed the type of the string literal to use @, for better readability.)

The rest of the regex appears fine to me. That you're not capturing any matches at all makes me think otherwise, but there could be a problem in the rest of the code (or even the input data - have you verified that?).

Noldorin 2009-05-09 20:34:11

Answer 2

+1 A:

I would suggest

string pattern = "(<a\\b[^>]*href=\"([^\"]+)\"[^>]*>)[\\s\\r\\n]*(</a>)";

This way also links with their href attribute somewhere else would be captured.

Replace with

"$1$2$3"

The usual word of warning: HTML and regex are essentially incompatible. Use with caution, this might blow up.

Tomalak 2009-05-09 20:34:19

Answer 3

+8 A:

I wouldn't use a regex - I'd use the Html Agility Pack, and a query like:

foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a[.='']")) {
    link.InnerText = link.GetAttribute("href");
}

Marc Gravell 2009-05-09 20:48:08

+1 for my daily dose of learning something new.

womp 2009-05-09 20:55:40

+1 for avoiding regex shallows.

Tomalak 2009-05-09 21:04:06

Answer 4

A:

Marc Gravell has the right answer, regexes are fundamentally bad at parsing HTML (see Can you provide some examples of why it is hard to parse XML and HTML with a regex? for why). See Can you provide an example of parsing HTML with your favorite parser? for examples using a variety of parsers.

Chas. Owens 2009-05-09 21:20:52

ansaurus

tags:

views:

answers:

How can I write a regular expression to capture links with no link text?

related questions