ansaurus

Question

Answer 1

+1 A:

You could pattern match using the following:

^([\-\.a-zA-Z ÇüéâäàåçêëèïîíìÄÅÉæÆôöòûùÖÜáíóúñÑÀÁÂÃÈÊËÌÍÎÏÐÒÓÔÕØÙÚÛÝßãðõøýþÿ]+)$

Make it an extension method with:

public static bool IsValidCustom(this string value)
{
    string regExPattern="^([\-\.a-zA-Z ÇüéâäàåçêëèïîíìÄÅÉæÆôöòûùÖÜáíóúñÑÀÁÂÃÈÊËÌÍÎÏÐÒÓÔÕØÙÚÛÝßãðõøýþÿ]+)$";
    return Regex.IsMatch(input, regExPattern);
}

I can't think of an easy way to do a maintainable range with extended characters since the order of the characters is not obvious.

Kelsey 2009-08-20 16:18:54

Attention with charsets.

Havenard 2009-08-20 16:20:39

I think the above pattern will require a slight tweak to allow blank values.Also, consider using the built-in character class "\w" instead of trying to enumerate the alphabet of all supported languages. That will bring in underscores and digits, which would then require a second RegEx to exclude, but would save a lot of potential maintenance down the road trying to re-invent the character class.

richardtallent 2009-08-20 16:31:53

I have tested it and it works for spaces. It's a hardcoded list of values, the space is included after the Z. THe only ranges are the characters from a-z and A-Z

Kelsey 2009-08-20 16:35:04

Answer 2

+4 A:

Why does it have to be a regex?

private bool ContainsAllWhitelistedCharacters(string input)
{
  string whitelist = "abcdefg...";
  foreach (char c in input) {
    if (whitelist.IndexOf(c) == -1)
      return false;
  }
  return true;
}

No need to jump straight into regexes if you aren't sure how to implement the one you need and you haven't profiled that section of code and found out you need the extra performance.

Mark Rushakoff 2009-08-20 16:22:21

Umm... this doesn't work so why all the upvotes? It will not catch if there is invalid characters in the string.

Kelsey 2009-08-20 16:27:42

Thanks Mark. Unfortunately this suggestion `IndexOfAny` doesn't return filter out un-whitelisted characters like '_' or '5'.

p.campbell 2009-08-20 16:28:35

Yeah, I just realized that and took it out -- that really shouldn't have gotten any upvotes. No way anybody who upvoted it even read it :X

Mark Rushakoff 2009-08-20 16:30:32

Answer 3

A:

I don't know how the regex backend is implemented, but it might be the most efficient to use the following to match for anything besides your list:

private bool ContainsAllWhitelistedCharacters(string input)
{
   Regex r = new Regex("[^ your list of chars ]");
   return !r.IsMatch(test)
}

Mark Synowiec 2009-08-20 16:30:55

Double negatives aren't not confusing.

Thomas G. Mayfield 2009-08-20 16:49:51

I agree, I suggested doing it this way *assuming* it's a more efficient regex in c#. Does anyone know if this is true or not? I'm interested in the answer.

Mark Synowiec 2009-08-20 17:32:06

Answer 4

A:

Note that I do not recommend this unless performance is really a problem but I thought I would point out that, even including precompiling the regex, you can do quite a bit faster:

compare:

static readonly Regex r = new Regex(
  @"^(['\-\.a-zA-Z ÇüéâäàåçêëèïîíìÄÅÉæÆôöòûùÖÜáíóúñÑ"+
   "ÀÁÂÃÈÊËÌÍÎÏÐÒÓÔÕØÙÚÛÝßãðõøýþÿ]+)$");

public bool IsValidCustom(string value)
{
  return r.IsMatch(value);
}

with:

private bool ContainsAllWhitelistedCharacters(string input)
{
    foreach (var c in input)
    {
        switch (c)
        {
            case '\u0020': continue; 
            case '\u0027': continue; 
            case '\u002D': continue; 
            case '\u002E': continue; 
            case '\u0041': continue; 
            case '\u0042': continue; 
            case '\u0043': continue; 
            case '\u0044': continue; 
            case '\u0045': continue; 
            case '\u0046': continue; 
            case '\u0047': continue; 
            case '\u0048': continue; 
            case '\u0049': continue; 
            case '\u004A': continue; 
            case '\u004B': continue; 
            case '\u004C': continue; 
            case '\u004D': continue; 
            case '\u004E': continue; 
            case '\u004F': continue; 
            case '\u0050': continue; 
            case '\u0051': continue; 
            case '\u0052': continue; 
            case '\u0053': continue; 
            case '\u0054': continue; 
            case '\u0055': continue; 
            case '\u0056': continue; 
            case '\u0057': continue; 
            case '\u0058': continue; 
            case '\u0059': continue; 
            case '\u005A': continue; 
            case '\u0061': continue; 
            case '\u0062': continue; 
            case '\u0063': continue; 
            case '\u0064': continue; 
            case '\u0065': continue; 
            case '\u0066': continue; 
            case '\u0067': continue; 
            case '\u0068': continue; 
            case '\u0069': continue; 
            case '\u006A': continue; 
            case '\u006B': continue; 
            case '\u006C': continue; 
            case '\u006D': continue; 
            case '\u006E': continue; 
            case '\u006F': continue; 
            case '\u0070': continue; 
            case '\u0071': continue; 
            case '\u0072': continue; 
            case '\u0073': continue; 
            case '\u0074': continue; 
            case '\u0075': continue; 
            case '\u0076': continue; 
            case '\u0077': continue; 
            case '\u0078': continue; 
            case '\u0079': continue; 
            case '\u007A': continue; 
            case '\u00C0': continue; 
            case '\u00C1': continue; 
            case '\u00C2': continue; 
            case '\u00C3': continue; 
            case '\u00C4': continue; 
            case '\u00C5': continue; 
            case '\u00C6': continue; 
            case '\u00C7': continue; 
            case '\u00C8': continue; 
            case '\u00C9': continue; 
            case '\u00CA': continue; 
            case '\u00CB': continue; 
            case '\u00CC': continue; 
            case '\u00CD': continue; 
            case '\u00CE': continue; 
            case '\u00CF': continue; 
            case '\u00D0': continue; 
            case '\u00D1': continue; 
            case '\u00D2': continue; 
            case '\u00D3': continue; 
            case '\u00D4': continue; 
            case '\u00D5': continue; 
            case '\u00D6': continue; 
            case '\u00D8': continue; 
            case '\u00D9': continue; 
            case '\u00DA': continue; 
            case '\u00DB': continue; 
            case '\u00DC': continue; 
            case '\u00DD': continue; 
            case '\u00DF': continue; 
            case '\u00E0': continue; 
            case '\u00E1': continue; 
            case '\u00E2': continue; 
            case '\u00E3': continue; 
            case '\u00E4': continue; 
            case '\u00E5': continue; 
            case '\u00E6': continue; 
            case '\u00E7': continue; 
            case '\u00E8': continue; 
            case '\u00E9': continue; 
            case '\u00EA': continue; 
            case '\u00EB': continue; 
            case '\u00EC': continue; 
            case '\u00ED': continue; 
            case '\u00EE': continue; 
            case '\u00EF': continue; 
            case '\u00F0': continue; 
            case '\u00F1': continue; 
            case '\u00F2': continue; 
            case '\u00F3': continue; 
            case '\u00F4': continue; 
            case '\u00F5': continue; 
            case '\u00F6': continue; 
            case '\u00F8': continue; 
            case '\u00F9': continue; 
            case '\u00FA': continue; 
            case '\u00FB': continue; 
            case '\u00FC': continue; 
            case '\u00FD': continue; 
            case '\u00FE': continue; 
            case '\u00FF': continue;        
        }
        return false;     
    }    return true; // empty string is true    
}

In very quick testing on a corpus of words with about 60% pass rate I get about a factor of 8 speed up with this approach.

It's not actually that much less readable than the regex without the escape characters either!

ShuggyCoUk 2009-08-20 17:57:23

ansaurus

tags:

views:

answers:

.NET Regex for whitelisted characters

related questions