views:

509

answers:

3

I have a Python script that will look at an HTML file that has the following format:

<DOC>
<HTML>
...
</HTML>
</DOC>
<DOC>
<HTML>
...
</HTML>
</DOC>

How do I remove all HTML tags (replace the tags with '') with the exception of the opening and closing DOC tags using regex in Python? Also, if I want to retain the alt-text of an tag, what should the regex expression look like?

+1  A: 

search and replace with this regex: search for: <.*?> replace with: "

ennuikiller
+1  A: 

Check out lxml, a really nice python library for dealing with xml. You can use drop_tag to accomplish what you are looking for.

from lxml import html 
h = html.fragment_fromstring('<doc>Hello <b>World!</b></doc>')
h.find('*').drop_tag()
print(html.tostring(h, encoding=unicode))

<doc>Hello World!</doc>
moorej
+3  A: 

For what you are trying to accomplish I would use BeautifulSoup rather than regex.

http://www.crummy.com/software/BeautifulSoup/

gnibbler
+1. BeautifulSoup is the definite answer.
muhuk