ansaurus

Question

Matching a specific Tag using Regular Expression

Answer 1

+1 A:

I would advise you to try an HTML parser rather than using regexps. It's going to be less error prone for all but the simplest cases (due to HTML not being regular, and so not a suitable candidate for regular expressions).

Brian Agnew 2010-08-21 09:09:17

Brian, you are, of course, right. I've used the HTML Agility Pack parser in the past with great success: http://htmlagilitypack.codeplex.com/

Hristo Deshev 2010-08-21 09:10:22

I see. But can the HtnlAgility Pack do the same task. I mean is it as powerfull as regex ?

Joseph Ghassan 2010-08-21 09:15:35

@Joseph: Brace yourself for this news: a full-blown parser is _WAAAYYY_ more powerful than regex. _WAAAAYYYYYY_ more.

polygenelubricants 2010-08-21 09:22:23

Answer 2

A:

You don't state clearly if you'll have other (unwanted) <a> tags, but to get all <a> beginnings, you could try a regex like "<a[^>]*>".

Hristo Deshev 2010-08-21 09:09:41

Works great! thanks. How can I do the same thing using HTMLAgility Pack. I mean Regex is more powerfull I guess.

Joseph Ghassan 2010-08-21 09:15:55

Hristo Deshev 2010-08-21 10:39:19

Answer 3

A:

Regex is not the best tool of the job, but you can in fact use regex to match strings in this pattern:

<a href="News_ViewStory\.asp\?NewsID=\d{4}">

As a @-quoted C# string literal, this is:

@"<a href=""News_ViewStory\.asp\?NewsID=\d{4}"">"

The \d is the shorthand for the digit character class. {4} is exact finite repetition. Thus, \d{4} means "exactly 4 digits".

If you want to allow a different numeric pattern, you may use e.g. \d{2,6}. This allows anywhere between 2 and 6 digits, inclusive. You can also use \d+ to allow at least one digit, with no upper bound.

Note that the . and the ? are preceded by backslash in the above pattern. That's because they are regex metacharacters that have special meanings (the [dot] matches (almost) any character, the ? is optional repetition specifier. Escaping gets rid of these special meanings, and they become literal period and question mark.

Whether or not strings in these patterns are exactly the HTML tags that what you want is an entirely different issue. Parsing HTML with regex is generally not recommendable.

polygenelubricants 2010-08-21 09:28:11

ansaurus

tags:

views:

answers:

Matching a specific Tag using Regular Expression

related questions