tags:

views:

57

answers:

3
+1  Q: 

Ruby Regex Help

I know a little bit of regex, but not mutch. What is the best way to get just the number out of the following html. (I want to have 32 returned). the values of width,row span, and size are all different in this horrible html page. Any help?

<td width=14 rowspan=2 align=right><font size=2 face="helvetica">32</font></td>
+2  A: 

How about

>(\d+)<

Or, if you desperately want to avoid using capturing groups at all:

(?<=>)\d+(?=<)
Joey
This returns >32< but I guess I could just do string.match(/>(\d+)</).match(/\d+/)
bunnyBEARZ
@bun: Well, you'll find the `32` in the first capturing group ... I edited the answer to include an example which doesn't need the group, though.
Joey
Awesome, thanks a lot.
bunnyBEARZ
A: 

May be

<td[^>]*><font[^>]*>\d+</font></td>
Arkadiy
This will certainly match the string above, but won't do anything to extract the `32`.
Joey
Well, if Ruby's regexp synatx is borrowed from Perl, then you need to put \d+ in parentheses. And then use match()[1]
Arkadiy
+2  A: 

Please, do yourself a favor:

#!/usr/bin/env ruby
require 'nokogiri'

require 'test/unit'
class TestExtraction < Test::Unit::TestCase
  def test_that_it_extracts_the_number_correctly
    doc = Nokogiri::HTML('<td width=14 rowspan=2 align=right><font size=2 face="helvetica">32</font></td>')
    assert_equal [32], (doc / '//td/font').map {|el| el.text.to_i }
  end
end
Jörg W Mittag
I agree. Going after HTML content with regex is a lot more error prone over the long term compared to using a parser.
Greg