There are so many html and xml libraries built into python, that it's hard to believe there's no support for real-world HTML parsing.
I've found plenty of great third-party libraries for this task, but this question is about the python standard library.
Requirements:
- Use only Python standard library components (any 2.x version)
- DOM support
- Handle HTML entities (
) - Handle partial documents (like:
Hello, <i>World</i>!
)
Bonus points:
- XPATH support
- Handle unclosed/malformed tags. (
<big>does anyone here know <html ???
Here's my 90% solution, as requested. This works for the limited set of HTML I've tried, but as everyone can plainly see, this isn't exactly robust. Since I did this by staring at the docs for 15 minutes and one line of code, I thought I would be able to consult the stackoverflow community for a similar but better solution...
from xml.etree.ElementTree import fromstring
DOM = fromstring("<html>%s</html>" % html.replace(' ', ' '))