views:

604

answers:

3

Hey guys.

I am working on some code that scrapes a page for two css classes on a page. I am simply using the Hpricot search method for this as so:

webpage.search("body").search("div.first_class | div.second_class")

...for each item found i create an object and put it into an array, this works great except for one thing.

The search will go through the entire html page and add an object into an array every time it comes across '.first_class' and then it will go through the document again looking for '.second_class', resulting in the final array containing all of the searched items in the incorrect order in the array, i.e all of the '.first_class' objects, followed by all the '.second_class' objects.

Is there a way i can get this to search the document in one go and add an object into the array each time it comes across one of the specified classes, giving me an array of items that is in the order they are come across in on the page i am scraping?

Any help much appreciated. Thanks

+1  A: 

See the section here on "Checking for a Few Attributes":

http://wiki.github.com/why/hpricot/hpricot-challenge

You should be able to stack the elements in the same way as you do attributes. This feature is apparently possible in Hpricot versions after 2006 Mar 17... An example with elements is:

doc.search("[@href][@type]")
Jon
A: 

Thanks for the tip. I hadn;t spotted that in the documentation and also found a nother page i hadnt seen either. I have fixed this with the following line:

webpage.search("body").search("[@class~='first_class']|[@class~='second_class']")

This now adds an object into the array each time it comes across one of the above classes in the document. Brilliant!

zoltarSpeaks
A: 

Ok so it turned out i was mistaken and this didn't do anything different to what i previously had at all. However, i have come up with a solution, wether it is the most suitable or not i am not sure. It seems like a fairly straight forward for an annoying problem though.

I now perform the search for the two classes above as i mentioned above:

webpage.search("body").search("[@class~='first_class']|[@class~='second_class']")

However this still returned an array firstly containing all the divs with a class of 'first_class' followed by all divs with a class of 'second_class'. So to fix this and get an array of all the items as they appear in order on the page, i simply chain the 'add_class' method with my own custom class e.g. 'foo_bar'. This then allows me to perform another search on the page for all divs with just this one tag, thus returning an array of all the items i am after, in the order they appear on the page.

webpage.search("body").search("[@class~='first_class']|[@class~='second_class']").add_class("foo_bar")

webpage.search("body").search("[@class~='foo_bar']")
zoltarSpeaks