views:

182

answers:

2

I'm trying to use Hpricot to get the value within a span with a class name I don't know. I know that it follows the pattern "foo_[several digits]_bar".

Right now, I'm getting the entire containing element as a string and using a regex to parse the string for the tag. That solution works, but it seems really ugly.

doc = Hpricot(open("http://scrape.example.com/search?q=#{ticker_symbol}"))
elements = doc.search("//span[@class='pr']").inner_html
string = ""
elements.each do |attr|
  if(attr =~ /foo_\d+_bar/)
    string = attr
  end
end
# get rid of the span tags, just get the value
string.sub!(/<\/span>/, "")
string.sub!(/<span.+>/, "")

return string

It seem like there should be a better way to do that. I'd like to do something like:

elements = doc.search("//span[@class='" + /foo_\d+_bar/ + "']").inner_html

But that doesn't run. Is there a way to search with a regular expression?

A: 

One could modify the incoming html before parsing.

html = open("http://scrape.example.com/search?q=#{ticker_symbol}").string
html.gsub!(/class="(foo_\d+_bar)"/){ |s| "class=\"foo_bar #{$1}\"" }
doc = Hpricot(html)

After that you can identify the elements using the foo_bar class. This is far from elegant or general but could prove to be more efficient.

anshul
Thanks for the suggestion. That would have worked, except it returns a string. I'd rather get an hpricot Element object back.
AaronM
+2  A: 

This should do:

doc.search("span[@class^='foo'][@class$='bar']")
Nakul
This looks like what I want. I'll give it a try and see how that goes.
AaronM
Worked perfectly! That's exactly what I wanted.
AaronM