tags:

views:

95

answers:

3

Hello,

I don't have much experience with RegEx so I am using many chained String.Replace() calls to remove unwanted characters -- is there a RegEx I can write to streamline this?

string messyText = GetText();
messyText.Trim().ToUpper().Replace(",", "").Replace(":", "").Replace(".", "").Replace(";", "").Replace("/", "").Replace("\\", "").Replace("\n", "").Replace("\t", "").Replace("\r", "").Replace(Environment.NewLine, "").Replace(" ", "")

Thanks

+2  A: 

Character classes to the rescue!

string messyText = GetText();
string cleanText = Regex.Replace(messyText.Trim().ToUpper(), @"[,:.;/\\\n\t\r ]+", "")
kevingessner
This is not equivalent to the code in the question.
quantumSoup
@quantumSoup: What'd I miss?
kevingessner
@kevingessner: use @"..." or your \t \r \n will get turned into their whitespace equiv.s by .NET. Or escape them but I think @ is more readable.
Dinah
@kevin Your code doesn't replace backslashes
quantumSoup
@quantumSoup, @Dinah: You don't need to escape them or use `@` - the whitespace equivalents will match just fine. Although generally it *is* a good idea to use verbatim strings with regexes. But not necessary here. And of course his code *does* replace backslashes (the only character he (correctly) *did* escape).
Tim Pietzcker
@Tim It does now, but not without the verbatim string.
quantumSoup
@Tim Pietzcker: good point; my mistake. @kevingessner: since you're replacing all of the spaces, you don't need `Trim()`. Also, space, \n, \t, and \r can be collectively replaced with \s as Rogue did.
Dinah
@Tim Yes: http://www.ideone.com/2cIBZ
quantumSoup
@quantumSoup: You're right, the `\\` wouldn't have worked in a non-verbatim string. It's late here, and I should be in bed...
Tim Pietzcker
@Tim If he didn't want to use verbatim strings, he could have used one more backslash, ie: `"[,:.;/\\\\n\t\r ]+"`, then we have 3 backslashes to match a single literal backslash, but that's just too damn ugly.
quantumSoup
@quantumSoup: He would even have needed five backslashes - the fifth one for `\n`...
Tim Pietzcker
@quantumSoup: good catch. It took me a second to figure it out. For anyone else who couldn't figure it out: you need the 2 backslashes to be passed to the regex engine so the engine recognizes it as an escaped backslash. In the original version .NET interpreted as an escaped backslash (ie: 1 backslash) before passing it to the regex engine. To use the non @ version, 4 would have been needed just for the backslash match.
Dinah
+5  A: 

Try this regex:

Regex regex = new Regex(@"[\s,:.;/\\]+");
string cleanText = regex.Replace(messyText, "").ToUpper();

\s is a character class equivalent to [ \t\r\n].


If you just want to preserve alphanumeric characters, instead of adding every non-alphanumeric character in existence to the character class, you could do this:

Regex regex = new Regex(@"[\W_]+");
string cleanText = regex.Replace(messyText, "").ToUpper();

Where \W is any non-word character (not [^a-zA-Z0-9_]).

Rogue
`\s` also contains `\v` and `\f`, but those aren't that commonly used, so it shouldn't be a problem.
Tim Pietzcker
do you need RegexOptions.Multiline or will your regex handle it?
Preet Sangha
@Preet I believe `RegexOptions.Multiline` only affects the behavior of start and end of string anchors `^` and `$`, but I could be wrong.
Rogue
turns out the second option is what I'm *really* looking for
erash
A: 

You would probably want to use a whitelist approach, there is an ocean of funny characters whose effect depending on combination may not be easy to figure.

A simple regex that removes everything but the allowed characters could look like this:

messyText = Regex.Replace(messyText, @"[^a-zA-Z0-9\x7C\x2C\x2E_]", "");

The ^ is there to invert the selection, apart from the alphanumeric characters this regex allows | , . and _ You can add and remove characters and character sets as needed.

eBusiness