views:

25

answers:

3

I think I need a combo of hpricot and regex here. I need to search for 'a' tags with an 'href' attribute that starts with 'abc/', and returns the text following that until the next forward slash '/'.

So, given:

<a href="/abc/12345/xyz123/">One</a>
<a href="/abc/67890/xyzabc/">Two</a>

I need to get back: '12345' and '67890'

Can anyone lend a hand? I've been struggling with this.

A: 

What about splitting the string by /?

(I don't know Hpricot, but according to the docs):

doc.search("a[@href]").each do |a|
    return a.somemethodtogettheattribute("href").split("/")[2]; // 2, because the string starts with '/'
end
Time Machine
A: 

or use regex:

s = '<a href="/abc/12345/xyz123/">One</a>'
s =~ /abc\/([^\/]*)/
return $1
vurte
`<a lol="/abc/blablabla/screwit" href="/abc/12345/xyz123/">One</a>`
Time Machine
s =~ /href="\/abc\/([^\/]*)/
vurte
href = '/abc/12345/spacesandsinglequotesandtabsandnewlines'
Time Machine
Never parse HTML with RegEx.
Time Machine
of course you should now your code and the possibilities ... in this case s =~ /[Hh][Rr][Ee][Ff]=["']\/abc\/([^\/]*)/
vurte
A: 

You don't need regex but you can use it. Here's two examples, one with regex and the other without, using Nokogiri, which should be compatible with Hpricot for your use, and uses CSS accessors:

require 'nokogiri'

html = %q[
  <a href="/abc/12345/xyz123/">One</a>
  <a href="/abc/67890/xyzabc/">Two</a>
]

doc = Nokogiri::HTML(html)
doc.css('a[@href]').map{ |h| h['href'][/(\d+)/, 1] } # => ["12345", "67890"]
doc.css('a[@href]').map{ |h| h['href'].split('/')[2] } # => ["12345", "67890"]
Greg