views:

591

answers:

8

I need to do alot of high-performance case-insensitive string comparisons and realized that my way of doing it .ToLower().Trim() was really stupid due do all the new strings being allocated

So I digged around a little and this way seems preferable:

String.Compare(txt1,txt2, StringComparison.OrdinalIgnoreCase)

The only problem here is that I want to ignore leading or trailing spaces, ie Trim() but if I use Trim I have the same problem with string allocations. I guess I could check each string and see if it StartsWith(" ") or EndsWith(" ") and only then Trim. Either that or figure out the index,length for each string and pass to string.Compare override

public static int Compare
(
    string strA,
    int indexA,
    string strB,
    int indexB,
    int length,
    StringComparison comparisonType
)

but that seems rather messy and I probably have to to use some integers if I dont make a really big if-else statement for every combination of trailing and leading blanks on both strings... so any ideas of an elegant solution?

Here's my current proposal:

public bool IsEqual(string a, string b)
    {
        return (string.Compare(a, b, StringComparison.OrdinalIgnoreCase) == 0);
    }

    public bool IsTrimEqual(string a, string b)
    {
        if (Math.Abs(a.Length- b.Length) > 2 ) // if length differs by more than 2, cant be equal
        {
            return  false;
        }
        else if (IsEqual(a,b))
        {
            return true;
        }
        else 
        {
            return (string.Compare(a.Trim(), b.Trim(), StringComparison.OrdinalIgnoreCase) == 0);
        }
    }
+2  A: 

First make sure that you really need to optimize this code. Maybe creating copies of the strings won't noticeably affect your program.

If you really need to optimize, you can try to process the strings when you first store them instead of when you compare them (assuming it happens in different stages of the program). For example, store trimmed and lowercase versions of the strings, so that when you compare them you can use simply check for equivalence.

Amnon
Well, there is nothing wrong with using a more efficient method in this case. Using String.Compare is not some "clever" hack, it is a built in way to compare strings that is also more efficient than calling ToUpper().ToLower(). It is also more clear in intent, so I don't think that you can make a valid 'premature-optimization" case in this instance/
Ed Swangren
I take it you meant Trim().ToLower()
Vinko Vrsalovic
Errr, yes :-) [15chars]
Ed Swangren
+2  A: 

Can't you just trim (and possibly make it lowercase) each string exactly once (when obtaining it)? Doing more sounds like premature optimization....

Fried Hoeben
Of course in some cases I could do that, just interested to see if one could come up with a optimized general purpose method for doing this
MattiasK
+2  A: 

I would use the code you have

String.Compare(txt1,txt2, StringComparison.OrdinalIgnoreCase)

and add any .Trim() calls as you need them. This will save your initial option 4 strings most of the time (.ToLower().Trim(), and two strings all of the time (.ToLower()).

If you are suffering performance problems after this, then your "messy" option is likely the best bet.

Nate Bross
This is interesting. Mattias: if the majority of your strings don't need the trim() call, then you could generally do it this way, and if the strings don't match, fall-back and try with a trim() call next, then "really" return that they're not matching.
Mike Atlas
Hmm, in that case I guess I should run some tests to see if the needed IsPrefix()/IsSuffix() (four of them) takes more or less performance than simply doing Trim
MattiasK
Ah of course! first do the compare, then do the trim compare (or messy metehod) if not 0, nice
MattiasK
A: 
  1. The thing is, if it needs to be done, it needs to be done. I don't think that any of your different solutions will make a difference. In each case there needs to be a number of comparisons to find the whitespace or removing it.

    Apparently, removing whitespace is part of the problem, so you shouldn't worry about that.

  2. And lowercasing a string before comparing, is a bug if your working with unicode characters and possibly slower than copying a string.

Peter Stuifzand
A: 

The warnings are about premature optimization are correct, but I'll assume you've tested this and found that a lot of time is being wasted copying strings. In that case I would try the following:

int startIndex1, length1, startIndex2, length2;
FindStartAndLength(txt1, out startIndex1, out length1);
FindStartAndLength(txt2, out startIndex2, out length2);

int compareLength = Math.Max(length1, length2);
int result = string.Compare(txt1, startIndex1, txt2, startIndex2, compareLength);

FindStartAndLength is a function that finds the starting index and length of the "trimmed" string (this is untested but should give the general idea):

static void FindStartAndLength(string text, out int startIndex, out int length)
{
    startIndex = 0;
    while(char.IsWhiteSpace(text[startIndex]) && startIndex < text.Length)
        startIndex++;

    length = text.Length - startIndex;
    while(char.IsWhiteSpace(text[startIndex + length - 1]) && length > 0)
        length--;
}
Aaron
+1  A: 

Something like this should do it:

public static int TrimCompareIgnoreCase(string a, string b) {
   int indexA = 0;
   int indexB = 0;
   while (indexA < a.Length && Char.IsWhiteSpace(a[indexA])) indexA++;
   while (indexB < b.Length && Char.IsWhiteSpace(b[indexB])) indexB++;
   int lenA = a.Length - indexA;
   int lenB = b.Length - indexB;
   while (lenA > 0 && Char.IsWhiteSpace(a[indexA + lenA - 1])) lenA--;
   while (lenB > 0 && Char.IsWhiteSpace(b[indexB + lenB - 1])) lenB--;
   if (lenA == 0 && lenB == 0) return 0;
   if (lenA == 0) return 1;
   if (lenB == 0) return -1;
   int result = String.Compare(a, indexA, b, indexB, Math.Min(lenA, lenB), true);
   if (result == 0) {
      if (lenA < lenB) result--;
      if (lenA > lenB) result++;
   }
   return result;
}

Example:

string a = "  asdf ";
string b = " ASDF \t   ";

Console.WriteLine(TrimCompareIgnoreCase(a, b));

Output:

0

You should profile it against a simple Trim and Compare with some real data, to see if there really is any difference for what you are going to use it for.

Guffa
Interesting, thanks! I'll do some comparisons with the different methods and see which one comes up on top
MattiasK
A: 

You could implement your own StringComparer. Here's a basic implementation :

public class TrimmingStringComparer : StringComparer
{
    private StringComparison _comparisonType;

    public TrimmingStringComparer()
        : this(StringComparison.CurrentCulture)
    {
    }

    public TrimmingStringComparer(StringComparison comparisonType)
    {
        _comparisonType = comparisonType;
    }

    public override int Compare(string x, string y)
    {
        int indexX;
        int indexY;
        int lengthX = TrimString(x, out indexX);
        int lengthY = TrimString(y, out indexY);

        if (lengthX <= 0 && lengthY <= 0)
            return 0; // both strings contain only white space

        if (lengthX <= 0)
            return -1; // x contains only white space, y doesn't

        if (lengthY <= 0)
            return 1; // y contains only white space, x doesn't

        if (lengthX < lengthY)
            return -1; // x is shorter than y

        if (lengthY < lengthX)
            return 1; // y is shorter than x

        return String.Compare(x, indexX, y, indexY, lengthX, _comparisonType);
    }

    public override bool Equals(string x, string y)
    {
        return Compare(x, y) == 0;
    }

    public override int GetHashCode(string obj)
    {
        throw new NotImplementedException();
    }

    private int TrimString(string s, out int index)
    {
        index = 0;
        while (index < s.Length && Char.IsWhiteSpace(s, index)) index++;
        int last = s.Length - 1;
        while (last >= 0 && Char.IsWhiteSpace(s, last)) last--;
        return last - index + 1;
    }
}

Remarks :

  • it is not extensively tested and might contain bugs
  • performance is yet to be evaluated (but it's probably better than calling Trim and ToLower anyway)
  • the GetHashCode method is not implemented, so don't use it as a key in a dictionary
Thomas Levesque
A: 

I notice that your first suggestion only compares for equality rather than sorting, that allows some further efficiency savings.

public static bool TrimmedOrdinalIgnoreCaseEquals(string x, string y)
{
    //Always check for identity (same reference) first for
    //any comparison (equality or otherwise) that could take some time.
    //Identity always entails equality, and equality always entails
    //equivalence.
    if(ReferenceEquals(x, y))
        return true;
    //We already know they aren't both null as ReferenceEquals(null, null)
    //returns true.
    if(x == null || y == null)
        return false;
    int startX = 0;
    //note we keep this one further than the last char we care about.
    int endX = x.Length;
    int startY = 0;
    //likewise, one further than we care about.
    int endY = y.Length;
    while(startX != endX && char.IsWhiteSpace(x[startX]))
        ++startX;
    while(startY != endY && char.IsWhiteSpace(y[startY]))
        ++startY;
    if(startX == endX)      //Empty when trimmed.
        return startY == endY;
    if(startY == endY)
        return false;
    //lack of bounds checking is safe as we would have returned
    //already in cases where endX and endY can fall below zero.
    while(char.IsWhiteSpace(x[endX - 1]))
        --endX;
    while(char.IsWhiteSpace(y[endY - 1]))
        --endY;
    //From this point on I am assuming you do not care about
    //the complications of case-folding, based on your example
    //referencing the ordinal version of string comparison
    if(endX - startX != endY - startY)
        return false;
    while(startX != endX)
    {
        //trade-off: with some data a case-sensitive
        //comparison first
        //could be more efficient.
        if(
            char.ToLowerInvariant(x[startX++])
            != char.ToLowerInvariant(y[startY++])
        )
            return false;
    }
    return true;
}

Of course, what is an equality checker without a matching hashcode producer:

public static int TrimmedOrdinalIgnoreCaseHashCode(string str)
{
    //Higher CMP_NUM (or get rid of it altogether) gives
    //better hash, at cost of taking longer to compute.
    const int CMP_NUM = 12;
    if(str == null)
        return 0;
    int start = 0;
    int end = str.Length;
    while(start != end && char.IsWhiteSpace(str[start]))
        ++start;
    if(start != end)
        while(char.IsWhiteSpace(str[end - 1]))
            --end;

    int skipOn = (end - start) / CMP_NUM + 1;
    int ret = 757602046; // no harm matching native .NET with empty string.
    while(start < end)
    {
            //prime numbers are our friends.
        ret = unchecked(ret * 251 + (int)(char.ToLowerInvariant(str[start])));
        start += skipOn;
    }
    return ret;
}
Jon Hanna