views:

636

answers:

2

Hi,

I want to remove all images from a HTML page (actually tinymce user input) which do not meet certain criteria (class = "int" or class = "ext") and I'm struggeling with the correct approach. That's what I'm doing so far:

hbody = Hpricot(input)
@internal_images = hbody.search("//img[@class='int']")
@external_images = hbody.search("//img[@class='ext']")

But I don't know how to find images where the class has the wrong value (not "int" or "ext").

I also have to loop over the elements to check other attributes which are not standard html (I use them for setting internal values like the DB id, which I set in the attribute dbsrc). Can I access these attributes too and is there a way to remove certain elements (which are in the hpricot search result) when they don't meet my criteria?

Thanks for your help!

+3  A: 
>> doc = Hpricot.parse('<html><img src="foo" class="int" /><img src="bar" bar="42" /><img src="foobar" class="int"></html>')
=> #<Hpricot::Doc {elem <html> {emptyelem <img class="int" src="foo">} {emptyelem <img src="bar" bar="42">} {emptyelem <img class="int" src="foobar">} </html>}>
>> doc.search("img")[1][:bar]
=> "42"
>> doc.search("img") - doc.search("img.int")
=> [{emptyelem img src"bar" bar"42"}]

Once you have results from search you can use normal array operations. nonstandard attributes are accessible through [].

Ben Hughes
wow, pretty easy, so I can use "collection_one - collection_two" to remove all elements from collection_one which are in collection_two? Thanks!
ole_berlin
Important note, after `z = x - y`, x won't change, z will just contain all of x that isn't in y.
rampion
yep, and + to add collections, e.g. `all_images - (internal_images + external_images)`.
Ben Hughes
+2  A: 

Check out the not CSS selector.

(hbody."img:not(.int)")
(hbody."img:not(.ext)")

Unfortunately, it doesn't seem you can concat not expressions. You might want to fetch all img nodes and remove those where the .css selector doesn't include neither .int nor .ext. Additionally, you could use the difference operator to calculate which elements are not part of both collections.

Use the .remove method to remove nodes or elements: Hpricot Altering documentation.

Simone Carletti
I could finally achieve what I wanted by chaining the searches: hbody.search("img").search(":not(.int)").search(":not(.ext)") returns all images where the class is not "int" and not "ext". After removing them I can check for the other attributes and do basically the same. Thanks again!
ole_berlin