ansaurus

Question

Is it possible to hook up a more robust HTML parser to Python mechanize?

Answer 1

+2 A:

reading from the big example on the first page of the mechanize website:

# Sometimes it's useful to process bad headers or bad HTML:
response = br.response()  # this is a copy of response
headers = response.info()  # currently, this is a mimetools.Message
headers["Content-type"] = "text/html; charset=utf-8"
response.set_data(response.get_data().replace("<!---", "<!--"))
br.set_response(response)

so it seems very possible to preprocess the response with another parser which will regenerate well-formed HTML, then feed it back to mechanize for further processing.

Adrien Plisson 2009-11-23 15:13:27

Yeah, I guess I missed that. Then again, the mechanize docs aren't exactly the best organized ones around. Thanks!

int3 2009-11-24 13:47:40

Answer 2

+1 A:

What you're looking for can be done with lxml.etree which is the xml.etree.ElementTree emulator (and replacement) provided by lxml:

First we take bad mal-formed HTML:

% cat bad.html
<html>
<HEAD>
    <TITLE>this HTML is awful</title>
</head>
<body>
    <h1>THIS IS H1</H1>
    <A HREF=MYLINK.HTML>This is a link and it is awful</a>
    <img src=yay.gif>
</body>
</html>

(Observe the mixed case between opening and closing tags, missing quotation marks).

And then parse it:

>>> from lxml import etree
>>> bad = file('bad.html').read()
>>> html = etree.HTML(bad)
>>> print etree.tostring(html)
<html><head><title>this HTML is awful</title></head><body>
    <h1>THIS IS H1</h1>
    <a href="MYLINK.HTML">This is a link and it is awful</a>
    <img src="yay.gif"/></body></html>

Observe that the tagging and quotation has been corrected for us.

If you were having problems parsing the HTML before, this might be the answer you're looking for. As for the details of HTTP, that is another matter entirely.

jathanism 2009-11-23 15:14:38

I do know how to use lxml, actually.. :P I was just wondering how to get it to work with mechanize.

int3 2009-11-24 13:48:14

Well good luck to you then! Grab your ankles!

jathanism 2009-11-24 15:17:51

Answer 3

+1 A:

Check out twill.

Paul McGuire 2009-11-23 15:16:48

twill is a wrapper around Mechanize - does it actually use a different form parser?

Plumo 2009-11-25 00:16:01

Answer 4

A:

I am struggling with some ugly html too. Did you figure out a good way to preprocess the malformed html? Thanks.

Cygorger 2010-01-13 23:55:14

ansaurus

tags:

views:

answers:

Is it possible to hook up a more robust HTML parser to Python mechanize?

related questions