tags:

views:

261

answers:

4

Consider an algorithm that needs to determine if a string contains any characters outside the whitelisted characters.

The whitelist looks like this:

'-.abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ ÇüéâäàåçêëèïîìÄÅÉæÆôöòûùÖÜáíóúñÑÀÁÂÃÈÊËÌÍÎÏÐÒÓÔÕØÙÚÛÝßãðõøýþÿ

Note: spaces and apostrophes are needed to be included in this whitelist.

Typically this will be a static method, but it will be converted to an extension method.

private bool ContainsAllWhitelistedCharacters(string input)
{
  string regExPattern="";// the whitelist
  return Regex.IsMatch(input, regExPattern);
}

Considerations:

Thanks for the performance comments to all the answerers. Performance is not an issue. Quality, readability and maintainability is! Less code = less chance for defects, IMO.

Question:

What should this whitelist regex pattern be?

+1  A: 

You could pattern match using the following:

^([\-\.a-zA-Z ÇüéâäàåçêëèïîíìÄÅÉæÆôöòûùÖÜáíóúñÑÀÁÂÃÈÊËÌÍÎÏÐÒÓÔÕØÙÚÛÝßãðõøýþÿ]+)$

Make it an extension method with:

public static bool IsValidCustom(this string value)
{
    string regExPattern="^([\-\.a-zA-Z ÇüéâäàåçêëèïîíìÄÅÉæÆôöòûùÖÜáíóúñÑÀÁÂÃÈÊËÌÍÎÏÐÒÓÔÕØÙÚÛÝßãðõøýþÿ]+)$";
    return Regex.IsMatch(input, regExPattern);
}

I can't think of an easy way to do a maintainable range with extended characters since the order of the characters is not obvious.

Kelsey
Attention with charsets.
Havenard
I think the above pattern will require a slight tweak to allow blank values.Also, consider using the built-in character class "\w" instead of trying to enumerate the alphabet of all supported languages. That will bring in underscores and digits, which would then require a second RegEx to exclude, but would save a lot of potential maintenance down the road trying to re-invent the character class.
richardtallent
I have tested it and it works for spaces. It's a hardcoded list of values, the space is included after the Z. THe only ranges are the characters from a-z and A-Z
Kelsey
+4  A: 

Why does it have to be a regex?

private bool ContainsAllWhitelistedCharacters(string input)
{
  string whitelist = "abcdefg...";
  foreach (char c in input) {
    if (whitelist.IndexOf(c) == -1)
      return false;
  }
  return true;
}

No need to jump straight into regexes if you aren't sure how to implement the one you need and you haven't profiled that section of code and found out you need the extra performance.

Mark Rushakoff
Umm... this doesn't work so why all the upvotes? It will not catch if there is invalid characters in the string.
Kelsey
Thanks Mark. Unfortunately this suggestion `IndexOfAny` doesn't return filter out un-whitelisted characters like '_' or '5'.
p.campbell
Yeah, I just realized that and took it out -- that really shouldn't have gotten any upvotes. No way anybody who upvoted it even read it :X
Mark Rushakoff
A: 

I don't know how the regex backend is implemented, but it might be the most efficient to use the following to match for anything besides your list:

private bool ContainsAllWhitelistedCharacters(string input)
{
   Regex r = new Regex("[^ your list of chars ]");
   return !r.IsMatch(test)
}
Mark Synowiec
Double negatives aren't not confusing.
Thomas G. Mayfield
I agree, I suggested doing it this way *assuming* it's a more efficient regex in c#. Does anyone know if this is true or not? I'm interested in the answer.
Mark Synowiec
A: 

Note that I do not recommend this unless performance is really a problem but I thought I would point out that, even including precompiling the regex, you can do quite a bit faster:

compare:

static readonly Regex r = new Regex(
  @"^(['\-\.a-zA-Z ÇüéâäàåçêëèïîíìÄÅÉæÆôöòûùÖÜáíóúñÑ"+
   "ÀÁÂÃÈÊËÌÍÎÏÐÒÓÔÕØÙÚÛÝßãðõøýþÿ]+)$");

public bool IsValidCustom(string value)
{
  return r.IsMatch(value);
}

with:

private bool ContainsAllWhitelistedCharacters(string input)
{
    foreach (var c in input)
    {
        switch (c)
        {
            case '\u0020': continue; 
            case '\u0027': continue; 
            case '\u002D': continue; 
            case '\u002E': continue; 
            case '\u0041': continue; 
            case '\u0042': continue; 
            case '\u0043': continue; 
            case '\u0044': continue; 
            case '\u0045': continue; 
            case '\u0046': continue; 
            case '\u0047': continue; 
            case '\u0048': continue; 
            case '\u0049': continue; 
            case '\u004A': continue; 
            case '\u004B': continue; 
            case '\u004C': continue; 
            case '\u004D': continue; 
            case '\u004E': continue; 
            case '\u004F': continue; 
            case '\u0050': continue; 
            case '\u0051': continue; 
            case '\u0052': continue; 
            case '\u0053': continue; 
            case '\u0054': continue; 
            case '\u0055': continue; 
            case '\u0056': continue; 
            case '\u0057': continue; 
            case '\u0058': continue; 
            case '\u0059': continue; 
            case '\u005A': continue; 
            case '\u0061': continue; 
            case '\u0062': continue; 
            case '\u0063': continue; 
            case '\u0064': continue; 
            case '\u0065': continue; 
            case '\u0066': continue; 
            case '\u0067': continue; 
            case '\u0068': continue; 
            case '\u0069': continue; 
            case '\u006A': continue; 
            case '\u006B': continue; 
            case '\u006C': continue; 
            case '\u006D': continue; 
            case '\u006E': continue; 
            case '\u006F': continue; 
            case '\u0070': continue; 
            case '\u0071': continue; 
            case '\u0072': continue; 
            case '\u0073': continue; 
            case '\u0074': continue; 
            case '\u0075': continue; 
            case '\u0076': continue; 
            case '\u0077': continue; 
            case '\u0078': continue; 
            case '\u0079': continue; 
            case '\u007A': continue; 
            case '\u00C0': continue; 
            case '\u00C1': continue; 
            case '\u00C2': continue; 
            case '\u00C3': continue; 
            case '\u00C4': continue; 
            case '\u00C5': continue; 
            case '\u00C6': continue; 
            case '\u00C7': continue; 
            case '\u00C8': continue; 
            case '\u00C9': continue; 
            case '\u00CA': continue; 
            case '\u00CB': continue; 
            case '\u00CC': continue; 
            case '\u00CD': continue; 
            case '\u00CE': continue; 
            case '\u00CF': continue; 
            case '\u00D0': continue; 
            case '\u00D1': continue; 
            case '\u00D2': continue; 
            case '\u00D3': continue; 
            case '\u00D4': continue; 
            case '\u00D5': continue; 
            case '\u00D6': continue; 
            case '\u00D8': continue; 
            case '\u00D9': continue; 
            case '\u00DA': continue; 
            case '\u00DB': continue; 
            case '\u00DC': continue; 
            case '\u00DD': continue; 
            case '\u00DF': continue; 
            case '\u00E0': continue; 
            case '\u00E1': continue; 
            case '\u00E2': continue; 
            case '\u00E3': continue; 
            case '\u00E4': continue; 
            case '\u00E5': continue; 
            case '\u00E6': continue; 
            case '\u00E7': continue; 
            case '\u00E8': continue; 
            case '\u00E9': continue; 
            case '\u00EA': continue; 
            case '\u00EB': continue; 
            case '\u00EC': continue; 
            case '\u00ED': continue; 
            case '\u00EE': continue; 
            case '\u00EF': continue; 
            case '\u00F0': continue; 
            case '\u00F1': continue; 
            case '\u00F2': continue; 
            case '\u00F3': continue; 
            case '\u00F4': continue; 
            case '\u00F5': continue; 
            case '\u00F6': continue; 
            case '\u00F8': continue; 
            case '\u00F9': continue; 
            case '\u00FA': continue; 
            case '\u00FB': continue; 
            case '\u00FC': continue; 
            case '\u00FD': continue; 
            case '\u00FE': continue; 
            case '\u00FF': continue;        
        }
        return false;     
    }    return true; // empty string is true    
}

In very quick testing on a corpus of words with about 60% pass rate I get about a factor of 8 speed up with this approach.

It's not actually that much less readable than the regex without the escape characters either!

ShuggyCoUk