ansaurus

Question

Answer 1

+9 A:

You're close. After trimming whitespace off the left and right,

str.strip.gsub(/\s{2,}/, ' ')

replace any sets of multiple spaces with a single space. This, of course, assumes you're only dealing with actual spaces.

Matchu 2010-07-08 15:49:05

Oh man! :D so close!!!

Trip 2010-07-08 15:50:23

This answer works. It is clearly superior to the trivial `s.gsub(/\s+/, ' ')`.

Steven Xu 2010-07-08 15:51:28

@Steven Xu - though replacing a single space with a single space can be fun, too!

Matchu 2010-07-08 15:53:13

I don't understand why this is better than using /\s+/ instead?

MikeD 2010-07-08 15:56:51

@MikeD - not an expert on regex internals, but replacing a space with a space seems like a waste of the computer's time, since that regex will match single spaces, as well. Since `{2,}` isn't exactly that much more complicated, and much better specifies what you're trying to do while avoiding frivolous replacements, it seems like a better idea overall.

Matchu 2010-07-08 16:07:45

"This, of course, assumes you're only dealing with actual spaces." -- A `\s` will match anything considered whitespace, including tabs/newlines/etc. If the OP is dealing with multiline content and doesn't want that then either just use a literal space, or `(?!\n)\s` to do whitespace without newlines. (With the `{2,}` or `+` or whatever added on - the `\s+` to ` ` method will replace single tabs with single space, so not a waste of time if that's desired.)

Peter Boughton 2010-07-08 17:28:40

@Peter Boughton - it's true. If I were doing it from scratch, I probably would have replaced the `\s` with ` `, but since he started with `\s`, I followed suit. If he's only working with spaces, this works fine, but the formatting switch would probably be wise.

Matchu 2010-07-08 20:09:25

Answer 2

+2 A:

Back when I was writing Perl code all the time, I used to grab regular expressions for my string manipulations. Then, one day I wanted to make some code that was searching and parsing string content, and wrote a benchmark to compare some regex and standard string index-based searches. The index-based search blew away regex. It isn't as sophisticated but sometimes we don't need sophisticated when dealing with simple problems.

Rather than instantly grab a regex a String.squeeze(' ') can handle compressing the repeated spaces a lot faster. Consider the output of benchmark:

#!/usr/bin/env ruby

require 'benchmark'

asdf = 'There                   is             a lot           of           white space.'

asdf.squeeze(' ') # => "There is a lot of white space."
asdf.gsub(/  +/, ' ') # => "There is a lot of white space."
asdf.gsub(/ {2,}/, ' ') # => "There is a lot of white space."
asdf.gsub(/\s\s+/, ' ') # => "There is a lot of white space."
asdf.gsub(/\s{2,}/, ' ') # => "There is a lot of white space."

n = 500000
Benchmark.bm(8) do |x|
  x.report('squeeze:') { n.times{ asdf.squeeze(' ') } }
  x.report('gsub1:') { n.times{ asdf.gsub(/  +/, ' ') } }
  x.report('gsub2:') { n.times{ asdf.gsub(/ {2,}/, ' ') } }
  x.report('gsub3:') { n.times{ asdf.gsub(/\s\s+/, ' ') } }
  x.report('gsub4:') { n.times{ asdf.gsub(/\s{2,}/, ' ') } }
end

puts
puts "long strings"
n     = 1000
str_x = 1000
Benchmark.bm(8) do |x|
  x.report('squeeze:') { n.times{(asdf * str_x).squeeze(' ') }}
  x.report('gsub1:') { n.times{(asdf * str_x).gsub(/  +/, ' ') }}
  x.report('gsub2:') { n.times{(asdf * str_x).gsub(/ {2,}/, ' ') }}
  x.report('gsub3:') { n.times{(asdf * str_x).gsub(/\s\s+/, ' ') }}
  x.report('gsub4:') { n.times{(asdf * str_x).gsub(/\s{2,}/, ' ') }}
end
# >>               user     system      total        real
# >> squeeze:  1.050000   0.000000   1.050000 (  1.055833)
# >> gsub1:    3.700000   0.020000   3.720000 (  3.731957)
# >> gsub2:    3.960000   0.010000   3.970000 (  3.980328)
# >> gsub3:    4.520000   0.020000   4.540000 (  4.549919)
# >> gsub4:    4.840000   0.010000   4.850000 (  4.860474)
# >> 
# >> long strings
# >>               user     system      total        real
# >> squeeze:  0.310000   0.180000   0.490000 (  0.485224)
# >> gsub1:    3.420000   0.130000   3.550000 (  3.554505)
# >> gsub2:    3.850000   0.120000   3.970000 (  3.974213)
# >> gsub3:    4.880000   0.130000   5.010000 (  5.015750)
# >> gsub4:    5.310000   0.150000   5.460000 (  5.461797)

The tests are based on letting squeeze(' ') or gsub() strip the duplicated spaces. As I expected, squeeze(' ') blows away the regex. Regex using a space character are faster than the equivalent pattern using \s.

Of course the regex are more flexible but thinking about whether a regex is needed can make a big difference in the processing speed of your code.

Greg 2010-07-09 07:50:11

ansaurus

tags:

views:

answers:

Regex pop quiz of the day :D

related questions