tags:

views:

93

answers:

3

Hello!

I'm working on a small Python script to clean up HTML documents. It works by accepting a list of tags to KEEP and then parsing through the HTML code trashing tags that are not in the list I've been using regular expressions to do it and I've been able to match opening tags and self-closing tags but not closing tags. The pattern I've been experimenting with to match closing tags is </(?!a)>. This seems logical to me so why is not working? The (?!a) should match on anything that is NOT an anchor tag (not that the "a" is can be anything-- it's just an example).

Edit: AGG! I guess the regex didn't show!

+1  A: 

Don't use regex to parse HTML. It will only give you headaches.

Use an XML parser instead. Try BeautifulSoup or lxml.

NullUserException
I've seen BeautifulSoup but I'm also a minimalist, so I've preferred using only what ships with Python. I think my issue here is enough to make me reconsider it. Thanks!
kevin628
A: 
<TAG\b[^>]*>(.*?)</TAG> 

Matches the opening and closing pair of a specific HTML tag.

<([A-Z][A-Z0-9]*)\b[^>]*>(.*?)</\1>

Will match the opening and closing pair of any HTML tag.

See here.

pavanlimo