views:

57

answers:

2

I have to fix all the closing tags of the <img> tag as shown in the text below. Instead of closing the <img> with a >, it should close with />.

Is there any easy way to search for all the <img> in this text and fix the > ?

(If it is closed with a /> already then there is no action required).

Other question, if there is no "width" or "height" to the <img> specified, what is the best way to solve the issue?

Download all the images and get the corresponding attributes of width and height, then add them back to the string?

The correct <img> tag is the one that closes with /> and have the valid width & height.

<a href="http://www.cultofmac.com/daily-deals749-mac-mini-1199-3-0ghz-imac-new-mac-pros/52674"&gt;&lt;img align="left" hspace="5" width="150" src="http://s3.dlnws.com/images/products/images/749000/749208-large" alt="" title=""></a>
Apple today unleashed a number of goodies, including giving iMacs and Mac Pros more oomph with new processors and increased storage options. We have those deals today, along with many more items for the Mac lover. Along with the refreshed line of iMacs and Mac Pros, we’ll also look at a number of software deals [...]
<p><a href="http://feedads.g.doubleclick.net/~a/DL_-gOGSR1JMzKDbErt1EG3re3I/0/da"&gt;&lt;img src="http://feedads.g.doubleclick.net/~a/DL_-gOGSR1JMzKDbErt1EG3re3I/0/di" border="0" ismap></a><br>
<a href="http://feedads.g.doubleclick.net/~a/DL_-gOGSR1JMzKDbErt1EG3re3I/1/da"&gt;&lt;img src="http://feedads.g.doubleclick.net/~a/DL_-gOGSR1JMzKDbErt1EG3re3I/1/di" border="0" ismap></a></p><img src="http://feeds.feedburner.com/~r/cultofmac/bFow/~4/Mq5iLOaT50k" height="1" width="1">

I really need to have width and height in the output, because it will be used as the input to other parser. And that parser says that the <img tag MUST close with a />. I am not using the output to view on the web page. Please suggest a simple solution to achieve this!

A: 

Well, <img ...> is correct HTML, <img .../> not. Dunno what HTML5 says, but XHTML is mostly dead before alive.

Nevertheless, I think the easiest thing would be a regular expression:

re.sub(r"<img(.*?)(?<!/)>", lambda m: "<img%s/>" % m.groups()[0],  html_code)

For the other things, well difficult. I would parse the code, add the tags to the img nodes and write the html from the ast. Parsing should be possible with http://code.google.com/p/html5lib/. But to have the valid height & width you have to read the images (use PIL) probably not worth the effort.

nils
It's normally best not to parse HTML with Regular Expressions - http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 - but if this is a one off task you'll probably be OK.
Dave Webb
It is perfectly possible to process (not parse!) an image-tag this way with a regexp because it can’t contain other tags nor < or >. BUT: dosn't work if html is crap (no > at ent of img) but that's another story that the linked html5lib perhaps can handle.
nils
Actually any tag may contain `>`: `<img title="a>b">` is perfectly valid. Anyhow *XHTML 2.0* is the one that never lived. XHTML 1.0, 1.1 and XHTML5 are very much alive.
bobince
No, that's just plain wrong. You have to mask it by >. Most browsers to accept <img title="a>b"> but it’s perfectly invalid.
nils
+1  A: 

For the sake of simplicity, I would outsource the potentially irritating issues around parsing (X)HTML to a dedicated library:

Here is a simple example with lxml.html:

import lxml.html

page = """<html>...</html>"""
page = lxml.html.document_fromstring(page)
lxml.html.tostring(page)

lxml.html has a really handy module clean, designed to remove malicious code. It's simple as well:

from lxml.html.clean import clean_html
clean_html(page)
Tim McNamara
How about the `width` and `height`? Do you have any idea? Thanks for your answer!
sfa
If you have serialised the HTML, something like this could work:html = serialised_htmlfor el in html.iterdescendants(): if el.tag == 'img': el.attrib['height'] = x el.attrib['width'] = y
Tim McNamara