tags:

views:

141

answers:

1

I am trying to scrape some data from a page with a table based layout. So, to get some of the data I need to get something like 3rd table inside 2nd table inside 5th table inside 1st table inside body. I am trying to use enlive, but cannot figure out how to use nth-of-type and other selector steps. To make matters worse, the page in question has a single top level table inside the body, but (select data [:body :> :table]) returns 6 results for some reason. What the hell am I doing wrong?

+3  A: 

For nth-of-type, does the following example help?

user> (require '[net.cgrand.enlive-html :as html])
user> (def test-html
           "<html><head></head><body><p>first</p><p>second</p><p>third</p></body></html>")
#'user/test-html
user> (html/select (html/html-resource (java.io.StringReader. test-html))
                   [[:p (html/nth-of-type 2)]])
({:tag :p, :attrs nil, :content ["second"]})

No idea about the second issue. Your approach seems to work with a naive test:

user> (def test-html "<html><head></head><body><div><p>in div</p></div><p>not in div</p></body></html>")
#'user/test-html
user> (html/select (html/html-resource (java.io.StringReader. test-html)) [:body :> :p])
({:tag :p, :attrs nil, :content ["not in div"]})

Any chance of looking at your actual HTML?

Update: (in response to the comment)

Here's another example where "the second <p> inside the <div> inside the second <div> inside whatever" is returned:

user> (def test-html "<html><head></head><body><div><p>this is not the one</p><p>nor this</p><div><p>or for that matter this</p><p>skip this one too</p></div></div><span><p>definitely not this one</p></span><div><p>not this one</p><p>not this one either</p><div><p>not this one, but almost</p><p>this one</p></div></div><p>certainly not this one</p></body></html>")
#'user/test-html
user> (html/select (html/html-resource (java.io.StringReader. test-html))
                   [[:div (html/nth-of-type 2)] :> :div :> [:p (html/nth-of-type 2)]])
({:tag :p, :attrs nil, :content ["this one"]})
Michał Marczyk
Seems like the second problem might be due to bad HTML. Can I combine nth-of-type with other selectors? If i need to find second table inside second table, can I do something like [:table (nth-of-type 2) :> :table (nth-of-type 2)]?
Mad Wombat
Yes, you can. I've edited in a new example. HTH.
Michał Marczyk
Ah! [] are intersections! The enlightenment is near!
Mad Wombat