tags:

views:

152

answers:

2

How would I use regex to parse the following:

<b>HelloWorld</b>
<p>This is a test</p>
<a href="myUrl">Google</a>

All html tags need to be removed and the urls extracted from hyperlink tags, and the result should be:

HelloWorld
This is a test
myUrl
+1  A: 

You should use a parser for this. Regexes just won't do. You could use recursive regex patterns, but I don't think they're supported by the .NET regex engine.

Blixt
+7  A: 

I know that's not the answer you expect but you shouldn't try parsing HTML with regular expressions. HTML is way to complicated to be parsed by regexes, there are all sorts of stuff that can go wrong. It is very hard to write a regex that parses HTML reliably well, I'm not even sure if it's possible.

Use something like the Beautiful Soup or HTML Agility Pack for .NET. Or you can create your own parser with a parser generator.

DrJokepu
The futility of telling people "don't do HTML with regex" never stops to amaze me. Stack Overflow is full of this advice, as is the rest of the internet. As if no-one ever reads or believes it. Anyway, you have my vote. :)
Tomalak
Tomalak: A lot of areas covered on stackoverflow have these typical recurring questions - and that's why having per-tag FAQs on stackoverflow would be great. http://stackoverflow.uservoice.com/pages/1722-general/suggestions/138261-allow-a-per-tag-home-faq-page
DrJokepu
Nobody reads an FAQ, that's more or less a fact. If people would read/google before they ask, the number of questions per day would reduce drastically.
Tomalak
Tomalak: That's true, but as far as I understand, the idea is to have a well-written piece of text we can direct the occasional question askers to instead of having to explain it all the time or looking up a similar question with a good answer.
DrJokepu
I guess that people would go for the rep rather than pointing the OP to an FAQ. If anyone in the thread solves the immediate problem of the OP, their answer will be accepted instead of a "boring" FAQ pointer, however correct it may be.
Tomalak
Maybe the solution is to have some sort of canned-faq answer? So instead of a pointer to the FAQ, there could be a way to import a specific answer just by typing (e.g.) [faq:html-parsing/why-not-regex]
Peter Boughton
I guess it would also help if there was a consistent HTML parser available on many platforms that would allow people to say "here's an xpath/selector/etc, use it with X", rather than having to refer to the assorted ones, and not being able to give a single answer as easily?
Peter Boughton
DrJokepu We don't need SO to create FAQ sections for us. I have already started two questions to deal with this issue ( http://stackoverflow.com/questions/701166 and http://stackoverflow.com/questions/773340) as well as a canned response that I modify for each user (http://tr.im/oh5k). My impression (I need to go back and check the facts) is that about half the people say "oh, yeah that makes sense" and I get up-voted. The other half gets defensive and I get down-voted. Luckily the +10/-2 difference in votes means I could care less about people who want to stay ignorant.
Chas. Owens