views:

169

answers:

6

I'm completely incapable of regular expressions, and so I need some help with a problem that I think would best be solved by using regular expressions.

I have list of strings in C#:

List<string> lstNames = new List<string>();
lstNames.add("TRA-94:23");
lstNames.add("TRA-42:101");
lstNames.add("TRA-109:AD");

foreach (string n in lstNames) {
  // logic goes here that somehow uses regex to remove all special characters
  string regExp = "NO_IDEA";
  string tmp = Regex.Replace(n, regExp, "");
}

I need to be able to loop over the list and return each item without any special characters. For example, item one would be "TRA9423", item two would be "TRA42101" and item three would be TRA109AD.

Is there a regular expression that can accomplish this for me?

Also, the list contains more than 4000 items, so I need the search and replace to be efficient and quick if possible.

Thanks in advance for any help that I receive on this.

EDIT: Sorry, I should have specified that any character beside a-z, A-Z and 0-9 is special in my circumstance.

+5  A: 

This should do it:

[^a-zA-Z0-9]

Basically it matches all non-alphanumeric characters.

Daniel Egeberg
+1 for guessing the definition of special characters correctly!
Mark Byers
+3  A: 

It really depends on your definition of special characters. I find that a whitelist rather than a blacklist is the best approach in most situations:

tmp = Regex.Replace(n, "[^0-9a-zA-Z]+", "");

You should be careful with your current approach because the following two items will be converted to the same string and will therefore be indistinguishable:

"TRA-12:123"
"TRA-121:23"
Mark Byers
The `+` quantifier is redundant. If the character matches, it will also match in a consecutive sequence of these.
Daniel Egeberg
@Daniel Egeberg: It is an optimization in the case that there are multiple symbols in a row, but you are correct that it is probably not going to help if the example input is representative.
Mark Byers
@Daniel, i'd expect the `+` to make the operation considerably faster, of course it won't really matter unless your processing something huge.
Paul Creasey
It doesn't matter that both items would end up being the same, because I'm doing a fuzzy match and I expect to have multiple items returned. List<PdfAndXml> lstPax = lstReports.FindAll(delegate(PdfAndXml o) { return (o.Packed.Contains(findTxt)); });Packed is the property where I'm using the regex to manipulate a certain string attribute of the PdfAndXml class.
Jagd
A: 

[^a-zA-Z0-9] is a character class matches any non-alphanumeric characters.

Alternatively, [^\w\d] does the same thing.

Usage:

string regExp = "[^\w\d]";
string tmp = Regex.Replace(n, regExp, "");
MikeD
A: 

Depending on your definition of "special character", I think "[^a-zA-Z0-9]" would probably do the trick. That would find anything that is not a small letter, a capital letter, or a digit.

Jay
Oh, I notice a pattern developing in the answers.
Jay
Is the pattern regular?
MikeD
A: 
tmp = Regex.Replace(n, @"\W+", "");

\w matches letters, digits, and underscores, \W is the negated version.

Paul Creasey
Since you define _ as special you should go with one of the other answers :)
Paul Creasey
A: 

You can use:

string regExp = "\\W";

This is equivalent to Daniel's "[^a-zA-Z0-9]"

\W matches any nonword character. Equivalent to the Unicode categories [^\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}\p{Pc}].

Dan Diplo
also matches _ so not quite perfect here.
Paul Creasey
Ummm, you're right - wouldn't have thought so from the description. Well spotted.
Dan Diplo