ansaurus

Question

Finding all *rendered* images in a HTML file

Answer 1

+1 A:

The source code for rendered img tag are something like this:

<img src="img.jpg"></img>

If the img tag is displayed as text(not rendered), the html code would be like this:

 &lt;img src=&quot;styles/BWLogo.jpg&quot;&gt;&lt;/img&gt;

< is "<" character, > is ">" character

To match rendered img tag only,you can use regex to match img tag formed by < and >, not < and >

Img tags in comments also need to be ignored by ingnoring characters between ""

wschenkai 2009-04-07 14:11:57

... except for markup in comments, I guess?

unwind 2009-04-07 14:12:47

Yeah, you are right. I think for comments, you can use regex to ingnore any character between ""

wschenkai 2009-04-07 14:18:25

thanks guys! i'm using this solution as a first-try.

izuzak 2009-04-08 09:43:52

Answer 2

A:

As image tags might be in between some <pre> or <xmp> tag you probably have to walk through the dom (= convert the html to a xml/dom tree and search through it) and find all the <img> nodes. There is a xml.dom class in the python standard library: docs.python.org

You could do that on the client aswell and report it back via ajax (this would mean more load on the server though).

2009-04-07 18:28:30

Answer 3

+2 A:

Use BeautifulSoup. It is an HTML/XML parser for Python that provides simple, idiomatic ways of navigating, searching, and modifying the parse tree. It probably won't be mistaken by fake img tags.

nosklo 2009-04-07 18:31:30

Answer 4

+2 A:

Sounds like a job for BeautifulSoup:

>>> from BeautifulSoup import BeautifulSoup
>>> doc = """
... <html>
... <body>
... <img src="test.jpg">
... &lt;img src="yay.jpg"&gt;
... <!-- <img src="ohnoes.jpg"> -->
... <img src="hurrah.jpg">
... </body>
... </html>
... """
>>> soup = BeautifulSoup(doc)
>>> soup.findAll('img')
[<img src="test.jpg" />, <img src="hurrah.jpg" />]

As you can see, BeautifulSoup is smart enough to ignore comments and displayed HTML.

EDIT: I'm not sure what you mean by the RSS feed escaping ALL images, though. I wouldn't expect BeautifulSoup to figure out which are meant to be shown if they are all escaped. Can you clarify?

Paolo Bergantino 2009-04-07 18:51:01

thanks! i'll give it a go. the scenario is actually a bit more complex - i'm parsing RSS content snippets which have *all* '<' and '<' escaped. so i'm wondering how the parser distinguishes between rendered img tags and nonredered img tags, since both are escaped...hm?

izuzak 2009-04-08 09:42:54

ansaurus

tags:

views:

answers:

Finding all rendered images in a HTML file

related questions