tags:

views:

123

answers:

3

Hi,

I've tried to strip html tags using regex replace with pattern "<[^>]*>" from word generated html that looks like this:

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:st1="urn:schemas-microsoft-com:office:smarttags" xmlns="http://www.w3.org/TR/REC-html40&quot;&gt;

<head> <meta http-equiv=Content-Type content="text/html; charset=iso-8859-2"> <meta name=Generator content="Microsoft Word 11 (filtered medium)"> <!--[if !mso]> <style>

v\:* {behavior:url(#default#VML);}

o\:* {behavior:url(#default#VML);}

w\:* {behavior:url(#default#VML);}

.shape {behavior:url(#default#VML);}

</style> <![endif]--><o:SmartTagType namespaceuri="urn:schemas-microsoft-com:office:smarttags" name="place" downloadurl="http://www.5iantlavalamp.com/&quot;/&gt; <!--[if !mso]> <style>

st1\:*{behavior:url(#default#ieooui) }

</style> <![endif]--> <style> <!-- /* Font Definitions / @font-face {font-family:Tahoma; panose-1:2 11 6 4 3 5 4 4 2 4;} / Style Definitions */ p.MsoNormal, li.MsoNormal, div.MsoNormal {margin:0in; margin-bottom:.0001pt; font-size:12.0pt; font-family:"Times New Roman";} a:link, span.MsoHyperlink {color:blue; text-decoration:underline;} a:visited, span.MsoHyperlinkFollowed {color:purple; text-decoration:underline;} span.EmailStyle17 {mso-style-type:personal; font-family:Arial; color:windowtext;} span.EmailStyle18 {mso-style-type:personal-reply; font-family:Arial; color:navy;} @page Section1 {size:8.5in 11.0in; margin:1.0in 1.25in 1.0in 1.25in;} div.Section1 {page:Section1;} --> </style>

</head>

Everything works fine except for the bolded lines above, anybody got ideas how to match the them to?

Thanks,

Aleksandar

A: 

You can't use Regular expressions to parse HTML (or XML for that matter).

Williham Totland
He's not trying to parse it, he's trying to get rid of it.I happen to agree with you -- the OP is going to run into the same kind of trouble -- but that link doesn't effectively make that point.
Etaoin
Beat me to it :D
James Westgate
Depends on the situation. If the OP just has to clean out a few HTML files in a text editor, a simple regex or two may do the job just fine.
Jan Goyvaerts
A: 

People generally advise the use of a parser instead of regex when dealing with HTML.

In case you have to use a regex :) you could use-

<style>.*?</style>
Jordan Stewart
+1  A: 

Your regex does not take into account that comments can contain > characters that do not terminated the comment. Try this regex:

<!--.*?-->|<[^>]*>

You'll have to turn on the option to make . match line breaks. How to do that depends on the application or programming language you're using this regex with. E.g. in Perl you'd use the /s flag. In .NET you'd set RegexOptions.SingleLine.

Jan Goyvaerts
*Your* regex doesn't take into account that attribute values of HTML tags can contain '>', as in `<img alt="<enter text here>">`
Williham Totland
My answer only explains why Aleksandar's regex doesn't do what he expects and only provides a solution for that specific problem on his specific example. There are a lot of things my regex doesn't take into account. If MS Word did not put its `<style>` tags inside comments then my regex would have the same problem as Aleksandar's. If you want to take everything into account, then you need a full HTML parser and knowledge about the meaning of specific tags (e.g. `<style>` and `<script>` tags do not contain dipslayable content).
Jan Goyvaerts