I have a Python script that will look at an HTML file that has the following format:
<DOC>
<HTML>
...
</HTML>
</DOC>
<DOC>
<HTML>
...
</HTML>
</DOC>
How do I remove all HTML tags (replace the tags with '') with the exception of the opening and closing DOC tags using regex in Python? Also, if I want to retain the alt-text of an tag, what should the regex expression look like?