tags:

views:

54

answers:

5

I'm not sure if "token replace" is the right phrase but here is what i'm trying to do:

In a string if I find two or more consecutive white spaces (\s) aka - spaces, new lines, tabs etc. I want to replace whatever it matched with only one instance of itself.

Example:

a   b   b 

would become

a b b

and:

a


b


c

Would become:

a

b

c

Can this be done using .net regex?

A: 

Yes it can. Use System.Text.RegularExpressions.Regex.Replace :

string str = "a   b   b";
Regex rexReplace = new Regex(" +");
str = rexReplace.Replace(str, new MatchEvaluator(delegate(Match match)
{
    return " ";
}));
Reinderien
A: 

For posterity, my solution from this question:

Regex 
    regex_select_all_multiple_whitespace_chars = 
        new Regex(@"\s+",RegexOptions.Compiled);

var cleanString=
    regex_select_all_multiple_whitespace_chars.Replace(dirtyString.Trim(), " ");

Regex is NOT the best way to do this. Brute force methods seem to be much faster. Take a read of the link above...

spender
I think you need a longer variable name.
Callum Rogers
The # of Regex declarations in the class this code came from makes the name useful. Regular expressions are pretty opaque, so we chose to eliminate some of the opaqueness via explicitly descriptive variable names. I'd usually avoid this, but the language within a language thing occasionally makes this approach handy.
spender
A: 
string str = "a  b  c       a\r\n\r\nb\r\n\r\nc";

string newstr = Regex.Replace(str, "(\u0200)+", " ");

newstr = Regex.Replace(newstr, "(\t)+", "\t");

newstr = Regex.Replace(newstr, "(\r\n)+", "\r\n");
JohnB
Wouldn't this replace a carriage return with a space. If you look at my examples this is not what i'm looking for.
Abe Miessler
Sorry, ok I fixed it.
JohnB
+2  A: 

You'll need to use this if you want to correctly replace double new-lines as well as spaces:

string input = @"a


b


c  d  e";

string result = Regex.Replace(input, @"(\r\n|\s)\1", "$1");

The \1 will look for the character(s) matched by the group (\s|\r\n), and the $1 in the replacement string will replace the match with just a single instance of the group.

If you want to replace more than one duplicate (i.e. 3 in a row) with a single instance, you'll need to use @"(\r\n|\s)\1+" as the pattern, but a side effect of this will be:

a


b


c

will be reduced to:

a
b
c
Bennor McCarthy
+1: Finally, good answer. Captures group then uses it again and deals with annoying `\r\n` on Windows
Callum Rogers
A: 

This is possible with a regex but it gets a but unweildy after adding more than a few choices. Here's a sample of the regex which handles only white space and tabs.

public static string ShrinkWhitespace(string input)
{
    return Regex.Replace(input, @"(((?<t>\s)\s+)|((?<t>\t)\t+))", "${t}");
}

I find methods like this are much easier to follow and maintain if they are instead coded as simple methods. For example.

public string ShrinkWhitespace(string input) {
  var builder = new StringBuilder();
  var i = 0; 
  while ( i < input.Length ) {
    var current = input[i];
    builder.Append(current);
    switch ( current ) {
      case '\t':   
      case ' ': 
      case '\n': 
        i++;
        while ( i < input.Length && input[i] == current ) { 
          i++;
        }
        break;
      default:
        i++;
        break;
    }
  }

  return builder.ToString();
}     
JaredPar