ansaurus

Question

Python how to search and correct html tags and attributes?

Answer 1

A:

Well, <img ...> is correct HTML, <img .../> not. Dunno what HTML5 says, but XHTML is mostly dead before alive.

Nevertheless, I think the easiest thing would be a regular expression:

re.sub(r"<img(.*?)(?<!/)>", lambda m: "<img%s/>" % m.groups()[0],  html_code)

For the other things, well difficult. I would parse the code, add the tags to the img nodes and write the html from the ast. Parsing should be possible with http://code.google.com/p/html5lib/. But to have the valid height & width you have to read the images (use PIL) probably not worth the effort.

nils 2010-07-29 09:33:23

It's normally best not to parse HTML with Regular Expressions - http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 - but if this is a one off task you'll probably be OK.

Dave Webb 2010-07-29 09:47:21

It is perfectly possible to process (not parse!) an image-tag this way with a regexp because it can’t contain other tags nor < or >. BUT: dosn't work if html is crap (no > at ent of img) but that's another story that the linked html5lib perhaps can handle.

nils 2010-07-29 09:51:40

Actually any tag may contain `>`: `<img title="a>b">` is perfectly valid. Anyhow *XHTML 2.0* is the one that never lived. XHTML 1.0, 1.1 and XHTML5 are very much alive.

bobince 2010-07-29 10:06:37

No, that's just plain wrong. You have to mask it by >. Most browsers to accept <img title="a>b"> but it’s perfectly invalid.

nils 2010-07-29 10:12:01

Answer 2

+1 A:

For the sake of simplicity, I would outsource the potentially irritating issues around parsing (X)HTML to a dedicated library:

Here is a simple example with lxml.html:

import lxml.html

page = """<html>...</html>"""
page = lxml.html.document_fromstring(page)
lxml.html.tostring(page)

lxml.html has a really handy module clean, designed to remove malicious code. It's simple as well:

from lxml.html.clean import clean_html
clean_html(page)

Tim McNamara 2010-07-29 10:03:20

How about the `width` and `height`? Do you have any idea? Thanks for your answer!

sfa 2010-07-29 10:18:28

If you have serialised the HTML, something like this could work:html = serialised_htmlfor el in html.iterdescendants(): if el.tag == 'img': el.attrib['height'] = x el.attrib['width'] = y

Tim McNamara 2010-07-29 10:37:39

ansaurus

tags:

views:

answers:

Python how to search and correct html tags and attributes?

related questions