views:

73

answers:

1

Using Nokogiri, I need to parse a block given:

<div class="some_class">
  12 AB / 4+ CD
  <br/>
  2,600 Dollars
  <br/> 
</div>

So i need get AB, CD and Dollars values (if exist).

ab = p.css(".some_class").text[....some regex....]
cd = p.css(".some_class").text[....some regex....]
dollars = p.css(".some_class").text[....some regex....]

Is that correct? If so, can please someone help me with regex to parse ab, cd and dollars values ?

Thanks!

+4  A: 

To get a better answer you would have to clarify exactly what format the AB, CD and Dollar values take but here is a solution based on the example given. It uses a regexp grouping () to capture the information we're interested in. (see the bottom of the answer for more details)

text = p.css(".some_class").text

# one or more digits followed by a space followed by AB, capture the digits
ab = text.match(/(\d+) AB/).captures[0] # => "12"

# one of more non digits followed by a literal + followed by CD
cd = text.match(/(\d+\+) CD/).captures[0] # => "4+"

# digits or commas followed by "Dollars"
dollars = text.match(/([\d,]+) Dollars/).captures[0] # => "2,600"

Note that if there is no match then String#match returns nil so if the values might not exist you would need a check e.g.

if match = text.match(/([\d,]+) Dollars/)
  dollars = match.captures[0]
end

Additional explanation of captures

To match the amount of AB we need a pattern /\d+ AB/ to identify the right part of the text. However, we're really only interested in the numeric part so we surround that with brackets so that we can extract it. e.g.

irb(main):027:0> match = text.match(/(\d+) AB/)
=> #<MatchData:0x2ca3440>           # the match method returns MatchData if there is a match, nil if not
irb(main):028:0> match.to_s         # match.to_s gives us the entire text that matched the pattern
=> "12 AB"
irb(main):029:0> match.captures     
=> ["12"]
# match.captures gives us an array of the parts of the pattern that were enclosed in ()
# in our example there is just 1 but there could be multiple
irb(main):030:0> match.captures[0]
=> "12"                             # the first capture - the bit we want

Take a look at the documentation for MatchData, in particular the captures method for more details.

mikej
Thanks! nil is OK for me. So everything works fine.
TJY
Note: if there is no match then it's not that your variables e.g. `dollars` will be `nil` that is the issue - it's that it would result in an attempt to call the `captures` method on `nil` which would fail, that's why you might want a check.
mikej
Ah. Ok, i understand. So what captures method is? Because now it works fine without captures to for me.
TJY
I have added some more explanation of captures to the answer. I hope that helps.
mikej
Perfect. Thank you!
TJY