Back when I was writing Perl code all the time, I used to grab regular expressions for my string manipulations. Then, one day I wanted to make some code that was searching and parsing string content, and wrote a benchmark to compare some regex and standard string index-based searches. The index-based search blew away regex. It isn't as sophisticated but sometimes we don't need sophisticated when dealing with simple problems.
Rather than instantly grab a regex a String.squeeze(' ')
can handle compressing the repeated spaces a lot faster. Consider the output of benchmark:
#!/usr/bin/env ruby
require 'benchmark'
asdf = 'There is a lot of white space.'
asdf.squeeze(' ') # => "There is a lot of white space."
asdf.gsub(/ +/, ' ') # => "There is a lot of white space."
asdf.gsub(/ {2,}/, ' ') # => "There is a lot of white space."
asdf.gsub(/\s\s+/, ' ') # => "There is a lot of white space."
asdf.gsub(/\s{2,}/, ' ') # => "There is a lot of white space."
n = 500000
Benchmark.bm(8) do |x|
x.report('squeeze:') { n.times{ asdf.squeeze(' ') } }
x.report('gsub1:') { n.times{ asdf.gsub(/ +/, ' ') } }
x.report('gsub2:') { n.times{ asdf.gsub(/ {2,}/, ' ') } }
x.report('gsub3:') { n.times{ asdf.gsub(/\s\s+/, ' ') } }
x.report('gsub4:') { n.times{ asdf.gsub(/\s{2,}/, ' ') } }
end
puts
puts "long strings"
n = 1000
str_x = 1000
Benchmark.bm(8) do |x|
x.report('squeeze:') { n.times{(asdf * str_x).squeeze(' ') }}
x.report('gsub1:') { n.times{(asdf * str_x).gsub(/ +/, ' ') }}
x.report('gsub2:') { n.times{(asdf * str_x).gsub(/ {2,}/, ' ') }}
x.report('gsub3:') { n.times{(asdf * str_x).gsub(/\s\s+/, ' ') }}
x.report('gsub4:') { n.times{(asdf * str_x).gsub(/\s{2,}/, ' ') }}
end
# >> user system total real
# >> squeeze: 1.050000 0.000000 1.050000 ( 1.055833)
# >> gsub1: 3.700000 0.020000 3.720000 ( 3.731957)
# >> gsub2: 3.960000 0.010000 3.970000 ( 3.980328)
# >> gsub3: 4.520000 0.020000 4.540000 ( 4.549919)
# >> gsub4: 4.840000 0.010000 4.850000 ( 4.860474)
# >>
# >> long strings
# >> user system total real
# >> squeeze: 0.310000 0.180000 0.490000 ( 0.485224)
# >> gsub1: 3.420000 0.130000 3.550000 ( 3.554505)
# >> gsub2: 3.850000 0.120000 3.970000 ( 3.974213)
# >> gsub3: 4.880000 0.130000 5.010000 ( 5.015750)
# >> gsub4: 5.310000 0.150000 5.460000 ( 5.461797)
The tests are based on letting squeeze(' ')
or gsub()
strip the duplicated spaces. As I expected, squeeze(' ') blows away the regex. Regex using a space character are faster than the equivalent pattern using \s
.
Of course the regex are more flexible but thinking about whether a regex is needed can make a big difference in the processing speed of your code.