views:

446

answers:

6

There are so many html and xml libraries built into python, that it's hard to believe there's no support for real-world HTML parsing.

I've found plenty of great third-party libraries for this task, but this question is about the python standard library.

Requirements:

  • Use only Python standard library components (any 2.x version)
  • DOM support
  • Handle HTML entities ( )
  • Handle partial documents (like: Hello, <i>World</i>!)

Bonus points:

  • XPATH support
  • Handle unclosed/malformed tags. (<big>does anyone here know <html ???

Here's my 90% solution, as requested. This works for the limited set of HTML I've tried, but as everyone can plainly see, this isn't exactly robust. Since I did this by staring at the docs for 15 minutes and one line of code, I thought I would be able to consult the stackoverflow community for a similar but better solution...

from xml.etree.ElementTree import fromstring
DOM = fromstring("<html>%s</html>" % html.replace('&nbsp;', '&#160;'))
+1  A: 

doesn't fit your requirement of the std only, but beautifulsoup is nice

PW
That's one of the libraries that I referenced with this:"I've found plenty of great third-party libraries for this task, but this question is about the python standard library."
bukzor
+4  A: 

Take the source code of BeautifulSoup and copy it into your script ;-) I'm only sort of kidding... anything you could write that would do the job would more or less be duplicating the functionality that already exists in libraries like that.

If that's really not going to work, I have to ask, why is it so important that you only use standard library components?

David Zaslavsky
It's not so important. It's simply my question. As I said, there are tons of html and xml support in the python library. It seems like something there should support this. If not, that's an answer too, but I'm not convinced yet.
bukzor
Note that BeautifulSoup is no longer being maintained. I prefer lxml.html myself. Overall, this is a great answer.
Mike Graham
Where did you hear that? The BeautifulSoup website shows no evidence that it is no longer being maintained. In fact the most recent release was 11 days ago. (Of course, any other third-party HTML parser works just as well for the argument I was making in the answer)
David Zaslavsky
Maybe he was thinking BS 3.0 was only for Python 3.x? Their site indicates BS 3.0 is for Py 2.3-2.6, and BS 3.1 is for Py 3.x (though ironically the last BS 3.1 release is about a year old, versus a couple weeks for BS 3.0)
Nick T
@David, Richardson has said multiple times that he is trying his best to quit BS development, though it seems he does still do a little. See e.g. http://www.crummy.com/software/BeautifulSoup/3.1-problems.html
Mike Graham
@Mike Graham: Under that link I see this: "... you can use Element Soup to feed the HTML into Beautiful Soup once ElementTree has cleaned it up." Can anyone expand what he means by that? How do you clean up HTML with ElementTree?
bukzor
@bukzor, (It seems a bit odd to ask me about stuff found on a page I presented about why not to use a piece of software.) In any event, as I understand the element tree API, you would call `ElementSoup.parse(some_file).write(some_new_place)` to parse an HTML file then write the tree you got after reconciling everything less than kosher about it. http://effbot.org/zone/element-index.htm#documentation provides some information about ElementTree in its various incarnations (which include this and other HTML parsers). Feel free to open a question for a more complete answer.
Mike Graham
@Mike Graham: I just noticed that the quote said said ElementSoup, not ElementTree. I was asking about it because it seemed to imply that I could use ElementTree independent of BeautifulSoup for HTML "cleaning".
bukzor
@bukzor, Cleaning HTML is the topic of another question. The snippet I provide should be the essence of doing it with an ElementTree HTML parser. I don't understand to what you're referring to about "the only reference to html seems to be a side project that is unmaintained since 2007". If you're talking about the ElementTree docs I linked to, stuff not applying to HTML directly is relevant if you're interested in an ElementTree-based HTML parser since the API is independent of the exact format being parsed/generated using ElementTree.
Mike Graham
@bukzor, ElementSoup is an implementation of ElementTree using BeautifulSoup for parsing. ElementTree is an API with many implementations for parsing XML and HTML.
Mike Graham
@Mike Graham: Thanks. I'm inferring that any HTML parsers implemented with ElementTree are not included in the stdlib. Do you know of a better-maintained etree-html parser than esoup?
bukzor
@bukzor, There are no general-purpose, robust HTML parsers of any kind in the stdlib. `lxml.html`, which I have mentioned several places, provides an extended ElementTree API. `html5lib`, which others have mentioned, is compatible with a number of APIs including multiple ElementTree implementations as I best understsand it.
Mike Graham
+1  A: 

I cannot think of any popular languages with a good, robust, heuristic HTML parsing library in its stdlib. Python certainly does not have one, which is something I think you know.

Why the requirement of a stdlib module? Most of the time when I hear people make that requirement, they are being silly. For most major tasks, you will need a third party module or to spend a whole lot of work re-implementing one. Introducing a dependency is a good thing, since that's work you didn't have to do.

So what you want is lxml.html. Ship lxml with your code if that's an issue, at which point it becomes functionally equivalent to writing it yourself except in difficulty, bugginess, and maintainability.

Mike Graham
From my research, I was seeing that as the most common answer, but I don't know, and I'm still not convinced that there's no such capability in the stdlib. You'll have to admit that a script that uses no external library is much more likely to work correctly for novice users.
bukzor
@bukzor, Well get convinced, since it's the case. =p And I do not have to admit that at all. ;)
Mike Graham
Parsing HTML is something people have only actually understood widely for a few years now; it's taken shockingly long. So it can be said quite definitively that there is nothing in the standard library: BeautifulSoup, html5lib, and lxml.html makes a complete list.
Ian Bicking
@Ian Bicking: If you'd make that an answer, I'd check it. Am I getting downrated simply because my answer is no?
bukzor
+3  A: 

Your choices are to change your requirements or to duplicate all of the work done by the developers of third party modules.

Beautiful soup consists of a single python file with about 2000 lines of code, if that is too big of a dependency, then go ahead and write your own, it won't work as well and probably won't be a whole lot smaller.

mikerobi
If it's really that compact (never really bothered to look :P ) and he's hell-bent on having a script work without any other dependencies, copy-paste sounds a great plan.
Nick T
Literal copy-and-paste is a ridiculous way to add a dependency.
Mike Graham
+12  A: 

Parsing HTML reliably is a relatively modern development (weird though that may seem). As a result there is definitely nothing in the standard library. HTMLParser may appear to be a way to handle HTML, but it's not -- it fails on lots of very common HTML, and though you can work around those failures there will always be another case you haven't thought of (if you actually succeed at handling every failure you'll have basically recreated BeautifulSoup).

There are really only 3 reasonable ways to parse HTML (as it is found on the web): lxml.html, BeautifulSoup, and html5lib. lxml is the fastest by far, but can be a bit tricky to install (and impossible in an environment like App Engine). html5lib is based on how HTML 5 specifies parsing; though similar in practice to the other two, it is perhaps more "correct" in how it parses broken HTML (they all parse pretty-good HTML the same). They all do a respectable job at parsing broken HTML. BeautifulSoup can be convenient though I find its API unnecessarily quirky.

Ian Bicking
Great answer. Thanks! I don't have enough rep to uprate you. QQ I wish people weren't so touchy about hard questions. The good scientist seeks negative experiments as well..
bukzor
@Ian Bicking: finally got enough rep to bump you. Just to confirm, there's no known way to get ElementTree (as it exists in the stdlib) to parse real-world HTML?
bukzor
You can have BeautifulSoup (with ElementSoup) or html5lib parse the HTML and generate an ElementTree structure, but ElementTree itself definitely cannot parse HTML.
Ian Bicking
@Ian Bicking: With some finagling and a little bit of HTML-correction, I've gotten ElementTree to parse all of RosettaCode.org. The most annoying part is adding all the html entities to the parser by hand. There's even an option for this in the etree docs, but it's unimplemented for undocumented reasons. You can see the code here: http://bukzor.hopto.org/svn/software/python/rosetta_pylint.py
bukzor
A: 

Hi!

I have a similar problem, my hosting company doesn't allows me to install additional python modules/libraries.

If there is no way to fetch an HTML page, transform it on XML and use ElementTree to process it, is there any way to include the source code of one HTML Parser (e.g. lxml or BeautifulSoup) in my script?

Thanks for your help, Ricardo

Ricardo
@Ricardo: BeautifulSoup is a single .py file by design. Stick it in the same directory as your script, import it, and you're good to go.
bukzor
@bukzor Thanks, it works good! :)
Ricardo
@Ricardo, this should really be a separate question and answer. That's why you have negative points here.
bukzor