tags:

views:

548

answers:

4

I've got a simple procedure to strip a string of all characters illegal in XML:

string SanitizeXml(string xml)
{
    return string.Concat
        (xml.ToCharArray().Where(c => IsLegalXmlChar(c)).ToArray());
}

It's nice and terse. But I'm concerned about its performance. The same thing could easily be accomplished using a simple for loop:

string SanitizeXml(string xml)
{
    var buffer = new StringBuilder();

    foreach(char c in xml)
    {
     if (IsLegalXmlChar(c))
     {
      buffer.Append(c);
     }
    }

    return buffer.ToString();
}

What sticks out to me is that, in the second example, xml is converted to a char[], and Where()'s IEnumerable<char> back to a char[]. I seem to do this a lot with LINQ--change between arrays and enumerables.

Should I be concerned about this? What kind of performance hit am I going to get, in general, for relying on LINQ extension methods when there a clear alternative that may be a little more verbose.

Perhaps this is too broad of a question.

A: 

Personally, I wouldn't use LINQ / extension methods in this scenario; LINQ is a powerful tool, but it shouldn't be used for every problem.

The existing LINQ extension methods using Ienumerable<T> and ToArray etc will, by definition, add some overhead to your scenario. The question is: is it significant to your scenario? For example, if you are about to do a network transmit of the data, a few picoseconds here won't matter. But if you are precessing xml in a tight loop, it might.

A better fix would be to encode it directly with framework code... I'll see if I can find the easiest option for this...

Note: another micro-optimisation here would be to pre-init the StringBuilder with the length of the existing string.

Marc Gravell
Does loading XML data into XText() sanitize it automatically, perhaps?
Chris
Yes (for example: `string s = new XText("x `) - probably **reasonably** efficiently.
Marc Gravell
What's a good "illegal" xml character to use in test?
chakrit
+5  A: 

Well, you don't need the first call to ToCharArray() to start with - string implements IEnumerable<char>. However, I agree that a StringBuilder and a loop would probably be more appropriate in this case.

I'm not sure what string.Concat(char[]) does offhand, btw - why aren't you just using the string constructor which takes a char array? In other words, after these modifications:

static string SanitizeXml(string xml)
{
    return new string (xml.Where(c => IsLegalXmlChar(c)).ToArray());
}

I still prefer a StringBuilder solution, but that could be improved for the common case (where there are few illegal characters) by giving an appropriate capacity to start with:

string SanitizeXml(string xml)
{
    var buffer = new StringBuilder(xml.Length);

    foreach(char c in xml)
    {
        if (IsLegalXmlChar(c))
        {
                buffer.Append(c);
        }
    }

    return buffer.ToString();
}

One alternative I hadn't thought of before might be an extension method on StringBuilder:

// Can't just call it Append as otherwise StringBuilder.Append(object) would
// be used :(
public static StringBuilder AppendSequence(this StringBuilder builder,
                                           IEnumerable<char> sequence)
{
    foreach (char c in sequence)
    {
        builder.Append(c);
    }
    return builder;
}

Then you could use it like this:

xml = new StringBuilder(xml.Length)
            .AppendSequence(xml.Where(IsLegalXmlChar)
            .ToString();

(You could have other overloads for AppendSequence to take IEnumerable etc, if you wanted.)

EDIT: Another alternative might be to avoid calling Append so often, using instead the overload which appends a substring. You could then again build an extension method for StringBuilder, something like (completely untested, I'm afraid - I haven't even tried compiling it):

public static StringBuilder AppendWhere(this StringBuilder builder,
                                        string text,
                                        Func<char, bool> predicate)
{
    int start = 0;
    bool lastResult = false;
    for (int i=0; i < text.Length; i++)
    {
        if (predicate(text[i]))
        {
            if (!lastResult)
            {
                start = i;
                lastResult = true;
            }
        }
        else
        {
            if (lastResult)
            {
                builder.Append(text, start, i-start);
                lastResult = false;
            }
        }
    }
    if (lastResult)
    {
         builder.Append(text, start, text.Length-start);
    }
    return builder;
}

Usage for the example:

xml = new StringBuilder(xml.Length).AppendWhere(xml, IsLegalXmlChar)
                                   .ToString();

Another alternative would be to change it to be an extension method on String, create the StringBuilder lazily, and if you get to the end with start=0, just return the original string.

Jon Skeet
Good points. I've been doing so many very similar operations this evening that I've gotten a little copy-and-paste happy.
Chris
Where(IsLegalXmlChar) might be **marginally** more efficient (I doubt the JIT can inline the anon-method, but at least it should be a static/cached delegate since no captures)
Marc Gravell
I like the AppendSequence extenion method - very neat and tidy.
Marc Gravell
Yes, but I'm not terribly keen on the name now. AppendAll? AppendFrom? Sorry for not thinking of the fact that IsLegalXmlChar is an appropriate method group to start with. Doh!
Jon Skeet
@Jon_Skeet try resharper, it'll yell at you everytime you make some unnecessary method group. :-) I got that a lot!
chakrit
@chakrit: I have R#, but when I wrote the above code I was on a machine without even .NET installed...
Jon Skeet
+2  A: 

For a simple foreach loop, the two version is essentially the same.

Just think about why we have IEnumerable type in the first place, it's for use with foreach loop! If you use a foreach loop then your string gets converted to an IEnumerable behind the scenes and so the logic would essentially be the same as in the LINQ version.

Unless you throw in some optimization, i.e. using StringBuilder, performance won't be too different.

Here's a profiling code: http://pastebin.com/f125a9a46

Credit @Chris, @Marc_Garvell, @Jon_Skeet

This is the result on my machine:

Simple LINQ version                      : 43270ms
For-each-loop version w/ StringBuilder   : 35875ms
For-each-loop version w/ List            : 37595ms
For-index loop w/ StringBuilder          : 37589ms
Jon Skeet's AppendWhere version          : 28980ms

Here's the result with code optimization enabled:

Simple LINQ version                      : 27814ms
For-each-loop version w/ StringBuilder   : 23453ms
For-each-loop version w/ List            : 21374ms
For-index loop w/ StringBuilder          : 22308ms
Jon Skeet's AppendWhere version          : 10884ms

4.3 seconds difference between LINQ and foreach loop doesn't really justify 400,000,000 characters when you also take in the fact that you have utilized the StringBuilder while LINQ have the overhead of rebuilding from a char array.

chakrit
What do you mean by "LINQ/Enumerable version" exactly? In particular, did you get rid of the redundant ToCharArray call?
Jon Skeet
return new string(xml.Where(IsLegalXmlChar).ToArray());
chakrit
you can check the code I use in the pastebin link but my point was that using a "foreach" loop is not so different from just using LINQ.
chakrit
Testing my StringBuilder.AppendWhere now...
Jon Skeet
Just tried AppendWhere - it just about beats LINQ/Enumerable with the initial "1 in 4 illegal" rate; when you put it down to 1 in 20, it wins hands down - finishes in half the time of LINQ/Enumerable.
Jon Skeet
good thing, will try that out :-)
chakrit
A: 

I would use regex to clean the string. It looks much cleaner than the options that are being discussed. Can someone run the performance test with regex and let me know if I'm doing things wrong?

Thanks!

mson