ansaurus

Question

Best way to find a sub string in a string

Answer 1

A:

the best way is indexof, regex is slower

sirmak 2010-10-06 19:41:47

Answer 2

+3 A:

In Ruby, use the String#include? method, e.g.

str = "hello how are you?"

substr = "how are"

str.include? substr

returns true

Jas 2010-10-06 19:45:55

Answer 3

A:

To my knowledge there's no "magic" way to search for substrings really fast, unless you're ready to build some kind of search metadata (think index) beforehand. Doing so will most likely waste more time than it saves you unless you search through the same string a lot.

As the search pattern is simple, I would steer clear of regex.

Martin Törnwall 2010-10-06 19:46:42

Answer 4

+1 A:

If you want only check if substring is in string, you can use: str[substr]

It returns substring or nil.

Nakilon 2010-10-06 19:46:57

ruby?...............

name 2010-10-06 19:47:38

ruby...........

Nakilon 2010-10-06 19:49:04

thanks........... ;)

name 2010-10-06 19:52:51

Answer 5

+1 A:

For an overview of "other ways", you can start from the Wikipedia article on string searching algorithms:

http://en.wikipedia.org/wiki/String_searching_algorithm

Indexing strings is one very obvious way of speeding things up, as mentioned by Martin, which only is appropriate if you are doing several searches over the same string:

http://en.wikipedia.org/wiki/Substring_index

Hostile Fork 2010-10-06 19:47:53

Answer 6

A:

Try KMP

Stas 2010-10-06 20:16:50

Answer 7

+2 A:

"What would have you done?"

I'd do a benchmark and try comparing different ways of accomplishing the same thing to learn which is fastest.

In older versions of Ruby we'd see the regex-based searches running more slowly. The new engine in 1.9.2, which I'm using for the benchmark, makes a big difference. In particular, unanchored searches used to be a lot slower than anchored. Now it's a wash whether you use regex or a fixed string searches for the most part. Using match() without precompiling the regex is a painful hit for speed so if you're doing a lot of loops with the same pattern then it makes sense to assign the pattern to a variable and reference the variable.

I'm currently running a backup, so my results might not be as accurate or stable as they should be.

The times displayed are how long each test took to perform "n" (750,000) iterations, so lower numbers are better.

require 'benchmark'

LOREM = %q{Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ut et convallis purus. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Cras interdum nibh et nunc pellentesque vestibulum. Donec elementum felis malesuada urna vehicula consectetur commodo urna accumsan. Phasellus bibendum euismod tincidunt. Sed pellentesque cursus faucibus. Etiam bibendum tincidunt nibh eget ultrices. Fusce imperdiet, felis id consequat imperdiet, justo est ultrices elit, sed vestibulum dui nibh vel felis. Sed feugiat, libero quis consequat semper, magna tellus facilisis enim, rutrum adipiscing eros mauris commodo metus. Sed lobortis aliquet augue ac sodales. Quisque pharetra odio vel augue tempus porttitor.}

REGEX1 = %r{/porttitor\.$/}
REGEX2 = %r{/porttitor\./}

n = 750_000
puts "word in string"
Benchmark.bm(15) do |x|
  x.report('string[""]:')  { n.times { LOREM['porttitor.']          } }
  x.report('string[//]:')  { n.times { LOREM[/porttitor\./]         } } # unanchored regex
  x.report('string[/$/]:') { n.times { LOREM[/porttitor\.$/]        } } # anchored regex
  x.report('index():')     { n.times { LOREM.index('porttitor.')    } }
  x.report('include?():')  { n.times { LOREM.include?('porttitor.') } }
  x.report('match($):')    { n.times { LOREM.match(/porttitor\.$/)  } }
  x.report('match():')     { n.times { LOREM.match(/porttitor\./)   } }
  x.report('match2($):')   { n.times { LOREM.match(REGEX1)          } } # compiled regex w/ anchor
  x.report('match2():')    { n.times { LOREM.match(REGEX2)          } } # compiled report w/out anchor
end

puts
puts "word not in string"
Benchmark.bm(15) do |x|
  x.report('string[""]:')  { n.times { LOREM['porttit0r.']          } }
  x.report('string[//]:')  { n.times { LOREM[/porttit0r\./]         } } # unanchored regex
  x.report('string[/$/]:') { n.times { LOREM[/porttit0r\.$/]        } } # anchored regex
  x.report('index():')     { n.times { LOREM.index('porttit0r.')    } }
  x.report('include?():')  { n.times { LOREM.include?('porttit0r.') } }
  x.report('match($):')    { n.times { LOREM.match(/porttit0r\.$/)  } }
  x.report('match():')     { n.times { LOREM.match(/porttit0r\./)   } }
end
# >> word in string
# >>                      user     system      total        real
# >> string[""]:      0.790000   0.020000   0.810000 (  0.956823)
# >> string[//]:      0.680000   0.000000   0.680000 (  0.690771)
# >> string[/$/]:     0.680000   0.000000   0.680000 (  0.693121)
# >> index():         0.740000   0.010000   0.750000 (  0.770159)
# >> include?():      0.700000   0.010000   0.710000 (  0.751336)
# >> match($):        1.490000   0.020000   1.510000 (  1.559802)
# >> match():         1.490000   0.020000   1.510000 (  1.638388)
# >> match2($):       0.730000   0.000000   0.730000 (  0.769904)
# >> match2():        0.740000   0.010000   0.750000 (  0.835605)
# >> 
# >> word not in string
# >>                      user     system      total        real
# >> string[""]:      0.730000   0.010000   0.740000 (  0.743445)
# >> string[//]:      0.710000   0.010000   0.720000 (  0.784549)
# >> string[/$/]:     0.690000   0.010000   0.700000 (  0.834742)
# >> index():         0.770000   0.010000   0.780000 (  0.833254)
# >> include?():      0.740000   0.020000   0.760000 (  0.890391)
# >> match($):        0.790000   0.010000   0.800000 (  0.905252)
# >> match():         0.790000   0.020000   0.810000 (  0.952930)

For reference, here's some numbers using Ruby 1.8.7, which is the default for Snow Leopard:

word in string
                     user     system      total        real
string[""]:      1.220000   0.010000   1.230000 (  1.223386)
string[//]:      1.260000   0.010000   1.270000 (  1.281833)
string[/$/]:     1.270000   0.000000   1.270000 (  1.268758)
index():         1.150000   0.010000   1.160000 (  1.164530)
include?():      1.140000   0.000000   1.140000 (  1.146797)
match($):        1.440000   0.000000   1.440000 (  1.437902)
match():         1.430000   0.000000   1.430000 (  1.432468)
match2($):       0.690000   0.000000   0.690000 (  0.690565)
match2():        0.620000   0.010000   0.630000 (  0.632283)

word not in string
                     user     system      total        real
string[""]:      1.130000   0.000000   1.130000 (  1.132296)
string[//]:      0.580000   0.000000   0.580000 (  0.582498)
string[/$/]:     0.580000   0.010000   0.590000 (  0.591405)
index():         1.150000   0.020000   1.170000 (  1.164159)
include?():      1.130000   0.010000   1.140000 (  1.141589)
match($):        0.670000   0.000000   0.670000 (  0.680749)
match():         0.680000   0.000000   0.680000 (  0.678540)

Greg 2010-10-07 04:23:04

Nice job! But it remains be strange for me, that the str[substr] is slower than regexps. In theory, str[substr] is the easiest method to implement in the way of good perfomance. I hope, once japans will optimize it.

Nakilon 2010-10-07 17:38:33

I haven't looked at the source, but I suspect that str[substr] really treats every substring inside the brackets as a regex, forcing a recompile of the pattern, or skipping to some other code that does the fixed-string search. Doing a fixed-string search in C/C++ is easy so there must be some other processing going on. Or, I did the benchmark wrong but I don't think so.

Greg 2010-10-07 18:07:11

ansaurus

tags:

views:

answers:

Best way to find a sub string in a string

related questions