views:

46

answers:

2

Hi,

I have several XML files and each file contains data of ‘root objects’ which I parse using Linq to XML and then create actual root objects which I persist using NHibernate and the sharp architecture repository. I have started to optimise the data insert and manage to add 30000 objects in about 1 hour and 40 minutes to the database. However, this is still too slow.

I think one bottle neck is the lookup of objects in the database which requires IO. Objects have to be looked up for reuse.

The root object has several authors:

public virtual IList<Author> Authors { get; set; }

Authors have this structure:

public class Author : Entity
    {   
public virtual Initials Initials { get; set; }
        public virtual ForeName ForeName { get; set; }
        public virtual LastName LastName { get; set; }
    }

I have achieved a great speed up by using a typed Id (something I wouldn't normally do):

public class LastName : EntityWithTypedId<string>, IHasAssignedId<string>
    {
        public LastName()
        {
        }
        public LastName(string Id)
        {
            SetAssignedIdTo(Id);
        }
        public virtual void SetAssignedIdTo(string assignedId)
        {
            Id = assignedId;
        }
    }

Which I look up (and potentially create) like this:

LastName LastName = LastNameRepository.Get(TLastName);

                        if (LastName == null)
                        {
                            LastName = LastNameRepository.Save(new LastName(TLastName));
                            LastNameRepository.DbContext.CommitChanges();
                        }
                        Author.LastName = LastName;

I am looking authors up like this:

propertyValues = new Dictionary<string, object>();               
propertyValues.Add("Initials", Author.Initials);
                    propertyValues.Add("ForeName", Author.ForeName);
                    propertyValues.Add("LastName", Author.LastName);
                    Author TAuthor = AuthorRepository.FindOne(propertyValues);

                    if (TAuthor == null)
                    {
                        AuthorRepository.SaveOrUpdate(Author);
                        AuthorRepository.DbContext.CommitChanges();
                        Root.Authors.Add(Author);
                    }
                    else
                    {
                        Root.Authors.Add(TAuthor);
                    }

Can I improve this? Should I use stored procedures/HQL/pure SQL/ICriteria instead to perform the lookup? Could I use some form of caching to speed up the lookup and reduce IO? The CommitChanges seems to be necessary or should I wrap everything into a transaction?

I already flush my session etc. every 10 root objects.

Any feedback would be very much welcome. Many thanks in advance.

Best wishes,

Christian

A: 

The first thing I would do is simplify the Authors entity as I don't think you need the Initials, ForeName, and LastName objects as separate entities. I think using plain strings would be more efficient:

public class Author : Entity
{   
    public virtual string Initials { get; set; }
    public virtual string ForeName { get; set; }
    public virtual string LastName { get; set; }
}
Dan
thanks - this is a remainder from when I tried to use lookuptables for names - where I had an int rather than the actual string as primary key. I then realised that the lookup of the int takes a lot of time - so I paid for insert efficiency with memory.
csetzkorn
+1  A: 

In all honesty I would say that you shouldn't even be using SA/NHibernate for something like this. It's a bulk data import from XML - an ETL tool like SSIS would be a better choice. Even a hand-cranked process on the DB server would work better - step 1, load XML to a table, step 2, do the UPSERT. Incidentally, SQL 2008 introduced the MERGE command for UPSERT operations, which might be of use.

I would also agree with Dan's comment - is it really necessary to treat initials, forename and surname as separate entities? Treating them as simple strings would boost performance. What in your domain model specifies that they are entities in their own right?

If you really must continue using SA/NHibernate, have a read of this: http://www.lostechies.com/blogs/jimmy_bogard/archive/2010/06/24/bulk-processing-with-nhibernate.aspx

The suggestion in Jimmy's blog about batching SELECTs should help quite a lot. If you plan to process a batch of 250 records at once, do all the SELECTs as a single NH command, process all the data, then do all the updates as another single batch (which I believe your use of EntityWithTypedId and the adonet.batch_size config setting will help achieve)

Finally - regarding the statement "which I parse using Linq to XML" - is that really the best way of doing it? I'm guessing that it might be, given the size of your input file, but are you aware of the approach of simply deserializing the XML file into an object graph? SO won't let me post the link to a page describing this, because I haven't earned enough reputation yet - but if you want to read up on it, Google "don't parse that xml" and the first article will explain it.

Hope this helps. Jon

Jon
I ended up using xsd2code. make sure that, if your xml contains dtd information, use: BlaClassBlaClass = ((BlaClass)(serializer.Deserialize(System.Xml.XmlReader.Create(new XmlTextReader(filepath), new System.Xml.XmlReaderSettings() { ProhibitDtd = false }))));
csetzkorn