tags:

views:

355

answers:

3

I'm reading a CSV file and the records are recorded as a string[]. I want to take each record and convert it into a custom object.

T GetMyObject<T>();

Currently I'm doing this through reflection which is really slow. I'm testing with a 515 Meg file with several million records. It takes under 10 seconds to parse. It takes under 20 seconds to create the custom objects using manual conversions with Convert.ToSomeType but around 4 minutes to do the conversion to the objects through reflection.

What is a good way to handle this automatically?

It seems a lot of time is spent in the PropertyInfo.SetValue method. I tried caching the properties MethodInfo setter and using that instead, but it was actually slower.

I have also tried converting that into a delegate like the great Jon Skeet suggested here: http://stackoverflow.com/questions/1027980/improving-performance-reflection-what-alternatives-should-i-consider, but the problem is I don't know what the property type is ahead of time. I'm able to get the delegate

var myObject = Activator.CreateInstance<T>();
foreach( var property in typeof( T ).GetProperties() )
{
    var d = Delegate.CreateDelegate( typeof( Action<,> )
    .MakeGenericType( typeof( T ), property.PropertyType ), property.GetSetMethod() );
}

The problem here is I can't cast the delegate into a concrete type like Action<T, int>, because the property type of int isn't known ahead of time.

+1  A: 

You should make a DynamicMethod or an expression tree and build statically typed code at runtime.

This will incur a rather large setup cost, but no per-object overhead at all.
However, it's somewhat difficult to do, and will result in complicated code that is difficult to debug.

SLaks
A: 

Take a look at this article which might help you improve performance.

Darin Dimitrov
+6  A: 

The first thing I'd say is write some sample code manually that tells you what the absolute best case you can expect is - see if your current code is worth fixing.

If you are using PropertyInfo.SetValue etc, then absolutely you can make it quicker, even with juts object - HyperDescriptor might be a good start (it is significantly faster than raw reflection, but without making the code any more complicated).

For optimal performance, dynamic IL methods are the way to go (precompiled once); in 2.0/3.0, maybe DynamicMethod, but in 3.5 I'd favor Expression (with Compile()). Let me know if you want more detail?


Implementation using Expression and CsvReader, that uses the column headers to provide the mapping (it invents some data along the same lines); it uses IEnumerable<T> as the return type to avoid having to buffer the data (since you seem to have quite a lot of it):

using System;
using System.Collections.Generic;
using System.Globalization;
using System.IO;
using System.Linq;
using System.Linq.Expressions;
using System.Reflection;
using LumenWorks.Framework.IO.Csv;
class Entity
{
    public string Name { get; set; }
    public DateTime DateOfBirth { get; set; }
    public int Id { get; set; }

}
static class Program {

    static void Main()
    {
        string path = "data.csv";
        InventData(path);

        int count = 0;
        foreach (Entity obj in Read<Entity>(path))
        {
            count++;
        }
        Console.WriteLine(count);
    }
    static IEnumerable<T> Read<T>(string path)
        where T : class, new()
    {
        using (TextReader source = File.OpenText(path))
        using (CsvReader reader = new CsvReader(source,true,delimiter)) {

            string[] headers = reader.GetFieldHeaders();
            Type type = typeof(T);
            List<MemberBinding> bindings = new List<MemberBinding>();
            ParameterExpression param = Expression.Parameter(typeof(CsvReader), "row");
            MethodInfo method = typeof(CsvReader).GetProperty("Item",new [] {typeof(int)}).GetGetMethod();
            Expression invariantCulture = Expression.Constant(
                CultureInfo.InvariantCulture, typeof(IFormatProvider));
            for(int i = 0 ; i < headers.Length ; i++) {
                MemberInfo member = type.GetMember(headers[i]).Single();
                Type finalType;
                switch (member.MemberType)
                {
                    case MemberTypes.Field: finalType = ((FieldInfo)member).FieldType; break;
                    case MemberTypes.Property: finalType = ((PropertyInfo)member).PropertyType; break;
                    default: throw new NotSupportedException();
                }
                Expression val = Expression.Call(
                    param, method, Expression.Constant(i, typeof(int)));
                if (finalType != typeof(string))
                {
                    val = Expression.Call(
                        finalType, "Parse", null, val, invariantCulture);
                }
                bindings.Add(Expression.Bind(member, val));
            }

            Expression body = Expression.MemberInit(
                Expression.New(type), bindings);

            Func<CsvReader, T> func = Expression.Lambda<Func<CsvReader, T>>(body, param).Compile();
            while (reader.ReadNextRecord()) {
                yield return func(reader);
            }
        }
    }
    const char delimiter = '\t';
    static void InventData(string path)
    {
        Random rand = new Random(123456);
        using (TextWriter dest = File.CreateText(path))
        {
            dest.WriteLine("Id" + delimiter + "DateOfBirth" + delimiter + "Name");
            for (int i = 0; i < 10000; i++)
            {
                dest.Write(rand.Next(5000000));
                dest.Write(delimiter);
                dest.Write(new DateTime(
                    rand.Next(1960, 2010),
                    rand.Next(1, 13),
                    rand.Next(1, 28)).ToString(CultureInfo.InvariantCulture));
                dest.Write(delimiter);
                dest.Write("Fred");
                dest.WriteLine();
            }
            dest.Close();
        }
    }
}

Second version (see comments) that uses TypeConverter rather than Parse:

using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Globalization;
using System.IO;
using System.Linq;
using System.Linq.Expressions;
using System.Reflection;
using LumenWorks.Framework.IO.Csv;
class Entity
{
    public string Name { get; set; }
    public DateTime DateOfBirth { get; set; }
    public int Id { get; set; }

}
static class Program
{

    static void Main()
    {
        string path = "data.csv";
        InventData(path);

        int count = 0;
        foreach (Entity obj in Read<Entity>(path))
        {
            count++;
        }
        Console.WriteLine(count);
    }
    static IEnumerable<T> Read<T>(string path)
        where T : class, new()
    {
        using (TextReader source = File.OpenText(path))
        using (CsvReader reader = new CsvReader(source, true, delimiter))
        {

            string[] headers = reader.GetFieldHeaders();
            Type type = typeof(T);
            List<MemberBinding> bindings = new List<MemberBinding>();
            ParameterExpression param = Expression.Parameter(typeof(CsvReader), "row");
            MethodInfo method = typeof(CsvReader).GetProperty("Item", new[] { typeof(int) }).GetGetMethod();

            var converters = new Dictionary<Type, ConstantExpression>();
            for (int i = 0; i < headers.Length; i++)
            {
                MemberInfo member = type.GetMember(headers[i]).Single();
                Type finalType;
                switch (member.MemberType)
                {
                    case MemberTypes.Field: finalType = ((FieldInfo)member).FieldType; break;
                    case MemberTypes.Property: finalType = ((PropertyInfo)member).PropertyType; break;
                    default: throw new NotSupportedException();
                }
                Expression val = Expression.Call(
                    param, method, Expression.Constant(i, typeof(int)));
                if (finalType != typeof(string))
                {
                    ConstantExpression converter;
                    if (!converters.TryGetValue(finalType, out converter))
                    {
                        converter = Expression.Constant(TypeDescriptor.GetConverter(finalType));
                        converters.Add(finalType, converter);
                    }
                    val = Expression.Convert(Expression.Call(converter, "ConvertFromInvariantString", null, val),
                        finalType);
                }
                bindings.Add(Expression.Bind(member, val));
            }

            Expression body = Expression.MemberInit(
                Expression.New(type), bindings);

            Func<CsvReader, T> func = Expression.Lambda<Func<CsvReader, T>>(body, param).Compile();
            while (reader.ReadNextRecord())
            {
                yield return func(reader);
            }
        }
    }
    const char delimiter = '\t';
    static void InventData(string path)
    {
        Random rand = new Random(123456);
        using (TextWriter dest = File.CreateText(path))
        {
            dest.WriteLine("Id" + delimiter + "DateOfBirth" + delimiter + "Name");
            for (int i = 0; i < 10000; i++)
            {
                dest.Write(rand.Next(5000000));
                dest.Write(delimiter);
                dest.Write(new DateTime(
                    rand.Next(1960, 2010),
                    rand.Next(1, 13),
                    rand.Next(1, 28)).ToString(CultureInfo.InvariantCulture));
                dest.Write(delimiter);
                dest.Write("Fred");
                dest.WriteLine();
            }
            dest.Close();
        }
    }
}
Marc Gravell
You can't do assigment in expressions.
adrianm
For new objects you can (which is what we are doing here) - how do you think lambdas such as `new {Name = x.Name, Id = x.Id}` work.
Marc Gravell
I see how using DynamicMethod would work from Darin Dimitrov's link. How would you use Expression to do this? I can see creating a mapping file where you say what field is mapped to what property by doing Map( m => m.FirstName, "FirstName" ) like FluentNHibernate does. Is that what you were thinking, or something else? I'd really rather not want to create another file for this. If that's the case, using DynamicMethod would be better.
Josh Close
I did some sample code manually which I mentioned was under 20 seconds, so that would be my goal.
Josh Close
And how does it compare?
Marc Gravell
If you notice, there is lots of code there, but it only does the complex stuff *once*, compiling it into a `Func<,>` which it re-uses for the rows.
Marc Gravell
@Marc Gravell Yes, I see. I like it. A lot more readable than DynamicMethod emitting too. Doing this had very fast results also. I'm actually using TypeConverter's to do the type conversion which seems to be the slow down now. I may need to re-think that portion.
Josh Close
I'm very familiar with TypeConverter; that would be fine if "good enough is", perhaps with HyperDescriptor (since both use boxing). But if you need the fastest possible it is better to bypass these (albeit minor) overheads, and use things like `Parse`. Or mix and match ;-p
Marc Gravell
Yeah, I'm going to try and eliminate it where I can. Currently I'm just using it for everything, just to get everything working properly. Obviously, if the type is a string, then no conversion is needed, which I'm currently not even handling. I'm actually looking for TypeConverterAttribute on the properties and if one is specified, use that.
Josh Close
Ok. I have it implemented and it's pretty fast. I'm using parse if parse is available, otherwise grabbing the default type converter for the type, and nothing if string. The only problem I'm having now is if the type is Guid. Doing TypeDescriptor.GetConverter( property.PropertyType ) to get the converter, then getting the method by typeConverter.GetType().GetMethod( "ConvertFrom", new[] { typeof( string ) } ), then passing that into Expression.Call( Expression.New( typeConverter.GetType() ), convertFromMethod, fieldExpression ). When binding this I get the error "Argument types do not match".
Josh Close
In this case I would use the Guid ctor that accepts a string: `ConstructorInfo guidCtor = typeof(Guid).GetConstructor(new[] {typeof(string)});`, and use (for `Guid`) `val = Expression.New(guidCtor, val);`
Marc Gravell
I found that no type converters work. They all have the same "Argument types do not match" issue. What is the proper way to create a type converter and call ConvertFrom on it using Expressions?
Josh Close
Added `TypeConverter` example (in this case *all* type-converter, but you could mix and match easily enough)
Marc Gravell
Thanks a lot Marc! Is there a good place to learn Expressions? The comments on MSDN and intellisense don't really help explaining what things do. You seem to know them pretty well, which may just be from using them. :P
Josh Close
I've blogged about it a bit, including tricks for learning them: http://marcgravell.blogspot.com/search/label/expression (read bottom up) - but apparently the MSDN documentation for 4.0 is better.
Marc Gravell
@Marc Gravell You can view my implementation here http://github.com/JoshClose/CsvHelper/blob/master/src/CsvHelper/CsvReader.cs
Josh Close