views:

151

answers:

7

Hello,

I have a problem where I am trying to search for a substring in string. That sub string may or may not be in the string.

str = "hello how are you?"

substr = "how are"

Two ways which I know if can be done are:
1. string.indexOf("how are")
2. regex

But, is there any other "optimized" way? What would have you done?

Can ruby provide a better answer, since we use jruby - answer can be in ruby or java?

A: 

the best way is indexof, regex is slower

sirmak
+3  A: 

In Ruby, use the String#include? method, e.g.

str = "hello how are you?"

substr = "how are"

str.include? substr 

returns true

Jas
A: 

To my knowledge there's no "magic" way to search for substrings really fast, unless you're ready to build some kind of search metadata (think index) beforehand. Doing so will most likely waste more time than it saves you unless you search through the same string a lot.

As the search pattern is simple, I would steer clear of regex.

Martin Törnwall
+1  A: 

If you want only check if substring is in string, you can use: str[substr]

It returns substring or nil.

Nakilon
ruby?...............
name
ruby...........
Nakilon
thanks........... ;)
name
+1  A: 

For an overview of "other ways", you can start from the Wikipedia article on string searching algorithms:

http://en.wikipedia.org/wiki/String_searching_algorithm

Indexing strings is one very obvious way of speeding things up, as mentioned by Martin, which only is appropriate if you are doing several searches over the same string:

http://en.wikipedia.org/wiki/Substring_index

Hostile Fork
A: 

Try KMP

Stas
+2  A: 

"What would have you done?"

I'd do a benchmark and try comparing different ways of accomplishing the same thing to learn which is fastest.

In older versions of Ruby we'd see the regex-based searches running more slowly. The new engine in 1.9.2, which I'm using for the benchmark, makes a big difference. In particular, unanchored searches used to be a lot slower than anchored. Now it's a wash whether you use regex or a fixed string searches for the most part. Using match() without precompiling the regex is a painful hit for speed so if you're doing a lot of loops with the same pattern then it makes sense to assign the pattern to a variable and reference the variable.

I'm currently running a backup, so my results might not be as accurate or stable as they should be.

The times displayed are how long each test took to perform "n" (750,000) iterations, so lower numbers are better.

require 'benchmark'

LOREM = %q{Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ut et convallis purus. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Cras interdum nibh et nunc pellentesque vestibulum. Donec elementum felis malesuada urna vehicula consectetur commodo urna accumsan. Phasellus bibendum euismod tincidunt. Sed pellentesque cursus faucibus. Etiam bibendum tincidunt nibh eget ultrices. Fusce imperdiet, felis id consequat imperdiet, justo est ultrices elit, sed vestibulum dui nibh vel felis. Sed feugiat, libero quis consequat semper, magna tellus facilisis enim, rutrum adipiscing eros mauris commodo metus. Sed lobortis aliquet augue ac sodales. Quisque pharetra odio vel augue tempus porttitor.}

REGEX1 = %r{/porttitor\.$/}
REGEX2 = %r{/porttitor\./}

n = 750_000
puts "word in string"
Benchmark.bm(15) do |x|
  x.report('string[""]:')  { n.times { LOREM['porttitor.']          } }
  x.report('string[//]:')  { n.times { LOREM[/porttitor\./]         } } # unanchored regex
  x.report('string[/$/]:') { n.times { LOREM[/porttitor\.$/]        } } # anchored regex
  x.report('index():')     { n.times { LOREM.index('porttitor.')    } }
  x.report('include?():')  { n.times { LOREM.include?('porttitor.') } }
  x.report('match($):')    { n.times { LOREM.match(/porttitor\.$/)  } }
  x.report('match():')     { n.times { LOREM.match(/porttitor\./)   } }
  x.report('match2($):')   { n.times { LOREM.match(REGEX1)          } } # compiled regex w/ anchor
  x.report('match2():')    { n.times { LOREM.match(REGEX2)          } } # compiled report w/out anchor
end

puts
puts "word not in string"
Benchmark.bm(15) do |x|
  x.report('string[""]:')  { n.times { LOREM['porttit0r.']          } }
  x.report('string[//]:')  { n.times { LOREM[/porttit0r\./]         } } # unanchored regex
  x.report('string[/$/]:') { n.times { LOREM[/porttit0r\.$/]        } } # anchored regex
  x.report('index():')     { n.times { LOREM.index('porttit0r.')    } }
  x.report('include?():')  { n.times { LOREM.include?('porttit0r.') } }
  x.report('match($):')    { n.times { LOREM.match(/porttit0r\.$/)  } }
  x.report('match():')     { n.times { LOREM.match(/porttit0r\./)   } }
end
# >> word in string
# >>                      user     system      total        real
# >> string[""]:      0.790000   0.020000   0.810000 (  0.956823)
# >> string[//]:      0.680000   0.000000   0.680000 (  0.690771)
# >> string[/$/]:     0.680000   0.000000   0.680000 (  0.693121)
# >> index():         0.740000   0.010000   0.750000 (  0.770159)
# >> include?():      0.700000   0.010000   0.710000 (  0.751336)
# >> match($):        1.490000   0.020000   1.510000 (  1.559802)
# >> match():         1.490000   0.020000   1.510000 (  1.638388)
# >> match2($):       0.730000   0.000000   0.730000 (  0.769904)
# >> match2():        0.740000   0.010000   0.750000 (  0.835605)
# >> 
# >> word not in string
# >>                      user     system      total        real
# >> string[""]:      0.730000   0.010000   0.740000 (  0.743445)
# >> string[//]:      0.710000   0.010000   0.720000 (  0.784549)
# >> string[/$/]:     0.690000   0.010000   0.700000 (  0.834742)
# >> index():         0.770000   0.010000   0.780000 (  0.833254)
# >> include?():      0.740000   0.020000   0.760000 (  0.890391)
# >> match($):        0.790000   0.010000   0.800000 (  0.905252)
# >> match():         0.790000   0.020000   0.810000 (  0.952930)

For reference, here's some numbers using Ruby 1.8.7, which is the default for Snow Leopard:

word in string
                     user     system      total        real
string[""]:      1.220000   0.010000   1.230000 (  1.223386)
string[//]:      1.260000   0.010000   1.270000 (  1.281833)
string[/$/]:     1.270000   0.000000   1.270000 (  1.268758)
index():         1.150000   0.010000   1.160000 (  1.164530)
include?():      1.140000   0.000000   1.140000 (  1.146797)
match($):        1.440000   0.000000   1.440000 (  1.437902)
match():         1.430000   0.000000   1.430000 (  1.432468)
match2($):       0.690000   0.000000   0.690000 (  0.690565)
match2():        0.620000   0.010000   0.630000 (  0.632283)

word not in string
                     user     system      total        real
string[""]:      1.130000   0.000000   1.130000 (  1.132296)
string[//]:      0.580000   0.000000   0.580000 (  0.582498)
string[/$/]:     0.580000   0.010000   0.590000 (  0.591405)
index():         1.150000   0.020000   1.170000 (  1.164159)
include?():      1.130000   0.010000   1.140000 (  1.141589)
match($):        0.670000   0.000000   0.670000 (  0.680749)
match():         0.680000   0.000000   0.680000 (  0.678540)
Greg
Nice job! But it remains be strange for me, that the str[substr] is slower than regexps. In theory, str[substr] is the easiest method to implement in the way of good perfomance. I hope, once japans will optimize it.
Nakilon
I haven't looked at the source, but I suspect that str[substr] really treats every substring inside the brackets as a regex, forcing a recompile of the pattern, or skipping to some other code that does the fixed-string search. Doing a fixed-string search in C/C++ is easy so there must be some other processing going on. Or, I did the benchmark wrong but I don't think so.
Greg