views:

313

answers:

4

Hi all,

I need a way to find only rendered IMG tags in a HTML snippet. So, I can't just regex the HTML snippet to find all IMG tags because I'd also get IMG tags that are shown as text in the HTML (not rendered).

I'm using Python on AppEngine.

Any ideas?

Thanks, Ivan

+1  A: 

The source code for rendered img tag are something like this:

<img src="img.jpg"></img>

If the img tag is displayed as text(not rendered), the html code would be like this:

 &lt;img src=&quot;styles/BWLogo.jpg&quot;&gt;&lt;/img&gt;

&lt; is "<" character, &gt; is ">" character

To match rendered img tag only,you can use regex to match img tag formed by < and >, not &lt; and &gt;

Img tags in comments also need to be ignored by ingnoring characters between "<!--" and "-->"

wschenkai
... except for markup in comments, I guess?
unwind
Yeah, you are right. I think for comments, you can use regex to ingnore any character between "<!--" and "-->"
wschenkai
thanks guys! i'm using this solution as a first-try.
izuzak
A: 

As image tags might be in between some <pre> or <xmp> tag you probably have to walk through the dom (= convert the html to a xml/dom tree and search through it) and find all the <img> nodes. There is a xml.dom class in the python standard library: docs.python.org

You could do that on the client aswell and report it back via ajax (this would mean more load on the server though).

+2  A: 

Use BeautifulSoup. It is an HTML/XML parser for Python that provides simple, idiomatic ways of navigating, searching, and modifying the parse tree. It probably won't be mistaken by fake img tags.

nosklo
+2  A: 

Sounds like a job for BeautifulSoup:

>>> from BeautifulSoup import BeautifulSoup
>>> doc = """
... <html>
... <body>
... <img src="test.jpg">
... &lt;img src="yay.jpg"&gt;
... <!-- <img src="ohnoes.jpg"> -->
... <img src="hurrah.jpg">
... </body>
... </html>
... """
>>> soup = BeautifulSoup(doc)
>>> soup.findAll('img')
[<img src="test.jpg" />, <img src="hurrah.jpg" />]

As you can see, BeautifulSoup is smart enough to ignore comments and displayed HTML.

EDIT: I'm not sure what you mean by the RSS feed escaping ALL images, though. I wouldn't expect BeautifulSoup to figure out which are meant to be shown if they are all escaped. Can you clarify?

Paolo Bergantino
thanks! i'll give it a go. the scenario is actually a bit more complex - i'm parsing RSS content snippets which have *all* '<' and '<' escaped. so i'm wondering how the parser distinguishes between rendered img tags and nonredered img tags, since both are escaped...hm?
izuzak