views:

70

answers:

3

What I'm interested in is a regular expression that will accept HTML input and remove all attributes inside the tag while leaving the tag intact. For example I want this...

<p class="test" id="TestParagraph">This is some test text right here.</p>

To become this...

<p>This is some test text right here.</p>

Any help would be much appreciated.

+2  A: 

Hi,

HTML is not a regular language and hence you will run into issue when trying to parse it with regular expressions. As Greg noted above you might want to look at an HTML parser to do this work for you.

Enjoy!

Doug
+1 for the dot-connection regular language - regular expression
azatoth
+5  A: 

You really don't want to use regex for this. HTML is not a regular language, you cannot guarantee that your actual text won't mimic the tags and be stripped as well. Whatever expression you come up with, there will always be cases that break it.

I would suggest using the Html Agility Pack for any HTML manipulation that you need to do.

womp
Could you please elaborate on "you cannot guarantee that your actual text won't mimic the tags"?
Greg
The content might contain text of the form "id=something", and your regex may strip it out. Or perhaps contain an html comment tag. Ultimately it's likely that you could build a regex that will work 99.99% of the time, but I would argue that it is never the correct approach.
womp
Did a little research and downloaded the HTML Agility Pack this morning and working with it now, thanks for the input.
huffmaster
A: 

Apologies for not not answering the question.

You can start with this

<(\S+)[^>]+>

replace with

<$1>

Of course, this would be easy to break if the input contains scripts or CDATA sections, or all sorts of cases. But it may be close enough for your input set.

harpo
If the OP decides to do the wrong thing, they should at least use a better expression than that... drop the unwanted escapes and simplify tag name and you get `<(\S+)[^>]+>` which is much more readable.
Peter Boughton
@Peter, all right then.
harpo