views:

477

answers:

3

In my ruby on rails app, I am trying to build a parser to extract some metadata out of a string.

Let's say the sample string is:

The quick red fox (frank,10) jumped over the lazy brown dog (ralph, 20).

I want to extract the substring out of the last occurence of the ( ).

So, I want to get "ralph, 20" no matter how many ( ) are in the string.

Is there a best way to create this ruby string extraction ... regexp?

Thanks,

John

+1  A: 

I would try this (here my regex assumes the first value is alphanumeric and the second value is a digit, adjust accordingly). Here the scan gets all occurrences as an array and the -1 tells us to grab just the last one, which seems to be just what you're asking for:

>> foo = "The quick red fox (frank,10) jumped over the lazy brown dog (ralph, 20)."
=> "The quick red fox (frank,10) jumped over the lazy brown dog (ralph, 20)."
>> foo.scan(/\(\w+, ?\d+\)/)[-1]
=> "(ralph, 20)"
Chris Bunch
Awesome! ... I ended up changing it to foo.scan(/\(.*,*.*\)/)[-1] since I don't really need to restrict it to the example character types.Thanks
Streamline
Wouldn't s.scan(/\(.*?\)/)[-1]; be easier?
Chas. Owens
Or s.scan(/\([^)]*\)/)[-1] if you don't like non-greedy matches.
Chas. Owens
A: 

It looks like you want a sexeger. They work by reversing the string, running a reversed regex against the string, and then reversing the results. Here is an example (pardon the code, I don't really know Ruby):

#!/usr/bin/ruby

s = "The quick red fox (frank,10) jumped over the lazy brown dog (ralph, 20).";

reversed_s = s.reverse;
reversed_s =~ /^.*?\)(.*?)\(/;
result = $1.reverse;
puts result;

The fact that this is getting no up votes tells me nobody clicked through to read why you want to use a sexeger, so here is are the results of a benchmark:

do they all return the same thing?
ralph, 20
ralph, 20
ralph, 20
ralph, 20
                        user     system      total        real
scan greedy         0.760000   0.000000   0.760000 (  0.772793)
scan non greedy     0.750000   0.010000   0.760000 (  0.760855)
right index         0.760000   0.000000   0.760000 (  0.770573)
sexeger non greedy  0.400000   0.000000   0.400000 (  0.408110)

And here is the benchmark:

#!/usr/bin/ruby

require 'benchmark'

def scan_greedy(s)
    result = s.scan(/\([^)]*\)/x)[-1]
    result[1 .. result.length - 2]
end

def scan_non_greedy(s)
    result = s.scan(/\(.*?\)/)[-1]
    result[1 .. result.length - 2]
end

def right_index(s)
    s[s.rindex('(') + 1 .. s.rindex(')') -1]
end

def sexeger_non_greedy(s)
    s.reverse =~ /^.*?\)(.*?)\(/
    $1.reverse
end

s = "The quick red fox (frank,10) jumped over the lazy brown dog (ralph, 20).";

puts "do they all return the same thing?", 
    scan_greedy(s), scan_non_greedy(s), right_index(s), sexeger_non_greedy(s)

n = 100_000
Benchmark.bm(18) do |x|
    x.report("scan greedy")        { n.times do; scan_greedy(s); end }
    x.report("scan non greedy")    { n.times do; scan_non_greedy(s); end }
    x.report("right index")        { n.times do; scan_greedy(s); end }
    x.report("sexeger non greedy") { n.times do; sexeger_non_greedy(s); end }
end
Chas. Owens
interesting (and thorough!)... is the benchmark indicating the speed of the time it takes to get to the answer?
Streamline
Yes, specifically the time the to run the function 100,000 times. If you want to see an impressive difference, change s to be "(foo)(foo)(foo)(foo)(foo)(bar)" and n to 10000. The sexeger is an order of magnitude faster.
Chas. Owens
+1  A: 

A simple non regular expression solution:

string = "The quick red fox (frank,10) jumped over the lazy brown dog (ralph, 20)."
string[string.rindex('(')..string.rindex(')')]

Example:

irb(main):001:0> string =  "The quick red fox (frank,10) jumped over the lazy brown dog (ralph, 20)."
=> "The quick red fox (frank,10) jumped over the lazy brown dog (ralph, 20)."
irb(main):002:0> string[string.rindex('(')..string.rindex(')')]
=> "(ralph, 20)"

And without the parentheses:

irb(main):007:0> string[string.rindex('(')+1..string.rindex(')')-1]
=> "ralph, 20"
Gdeglin