ansaurus

Question

Is there a way to speed up this code that finds data changes in two XML files?

Answer 1

+1 A:

Have you considered using XmlDiff?

Mitch Wheat 2010-01-08 13:47:09

yes, I'm experimenting with that as well, if possible I wanted a pure code example which doesn't not include installing anything in the running application.

Edward Tanguay 2010-01-08 13:55:05

BTW, there's a bug in XmlDiff that causes an index-out-of-range exception for blank but non-empty attributes. (such as padding=" ") I posted fix code at cowboycoder.wordpress.com a couple months ago.

GalacticCowboy 2010-01-08 14:04:51

Answer 2

+8 A:

You should be able to get all the elements that have different values using a linq join on the element name.

var name = xdoc1.Root.Name.ToString();
var id = (xdoc1.Descendants("Id").FirstOrDefault()).Value;

var diff =  from o in xdoc1.Root.Elements()
        join n in xdoc2.Root.Elements() on o.Name equals n.Name
        where o.Value != n.Value
        select new HistoryFieldChange() {
                EntityName = name,
                FieldName = o.Name.ToString(),
                KindOfChange = "fieldDataChange",
                ObjectReference = id,
                ValueBefore = o.Value,
                ValueAfter = n.Value,
        };

One of the advantages to this method is that it's easy to parallelize for multicore machines, just use PLinq and the AsParallel extension method.

var diff =  from o in xdoc1.Root.Elements()
        join n in xdoc2.Root.Elements().AsParallel() on o.Name equals n.Name
        where o.Value != n.Value
        ...

Voila, if the query can be parallelized on your computer then PLinq will automatically handle it. This would speed up large documents, but if your documents are small you may get a better speedup by parallelizing the outer loop that calls GetHistoryFieldChanges using something like Parallel.For.

Another advantage is that you can simply return IEnumerable from GetHistoryFieldChanges, not need to waste time allocating a List, the items will be returned as they're enumerated, and the Linq query will not be executed until then.

IEnumerable<HistoryFieldChange> GetHistoryFieldChanges(...)

Here are times for 1M iterations of the original, Yannick's In-order, and My non-parallel Linq-only implementations. Run on my 2.8ghz laptop with this code.

Elapsed Orig    3262ms
All Linq        1761ms
In Order Only   2383ms

One interesting thing I noticed... Run the code in debug mode and then release mode, it's amazing how much the compiler can optimize the pure Linq version. I think returning IEnumerable helps the compiler a lot here.

joshperry 2010-01-08 13:50:54

Answer 3

+3 A:

Edit:

I must admit that I didn't expect the LINQ implementation would be faster than the naive iterative approach with in order data. But that goes to show that the performance bottleneck isn't always in the obvious places.

As I have said, I am assuming you are comparing in order for a reason, so perhaps this implementation can still be of use. Granted it is not as legible as joshperry's implementation. But in terms of performance it should take the crown.

static public IEnumerable<HistoryFieldChange> GetHistoryFieldChanges2(XDocument xdoc1, XDocument xdoc2)
{
  string id = xdoc1.Descendants("Id").FirstOrDefault().Value;
  string name = xdoc1.Root.Name.ToString();

  IEnumerator<XElement> enumerator1 = xdoc1.Root.Elements().GetEnumerator();
  IEnumerator<XElement> enumerator2 = xdoc2.Root.Elements().GetEnumerator();

  for (; enumerator1.MoveNext() && enumerator2.MoveNext(); )
  {
    XElement element1 = enumerator1.Current;
    XElement element2 = enumerator2.Current;

    if (element1.Value != element2.Value)
    {
      HistoryFieldChange hfc = new HistoryFieldChange();
      hfc.EntityName = name;
      hfc.FieldName = element1.Name.ToString();
      hfc.KindOfChange = "fieldDataChange";
      hfc.ObjectReference = id;
      hfc.ValueBefore = element1.Value;
      hfc.ValueAfter = element2.Value;
      yield return hfc;
    }
  }
}

Yannick M. 2010-01-08 13:53:24

Feel free to vote me down, but please let me know why, so I might alter my response.

Yannick M. 2010-01-08 14:10:51

"Is there a faster way to do this in LINQ..." from the question. This simply didn't meet the OP's criteria and IMHO the in-order constraint seriously hampers the usefulness of the technique. You also claim that Linq brings unnecessary overhead, which is obviously not the case.

joshperry 2010-01-08 14:41:14

I wouldn't call treating the order of tags as significant "naïve." While XML defines the order of attributes is insignificant, it makes no such definition for tags. Hence, one must presume that the order of tags is significant to the user.

Craig Stuntz 2010-01-08 15:14:40

@joshperry: You are absolutely right about the LINQ part, but then, my advice was against using LINQ. The in-order constraint indeed limits the usefulness, but we have no way of knowing what the exact constraints are. Considering his initial implementation treated the nodes in-order, there was reason to assume the input data is pretty uniform.

Yannick M. 2010-01-08 15:41:34

@Craig Stuntz, like you said, the XML spec doesn't say anything about the order of elements; it only states that elements constrained to a DTD list must show up in order. Given a DTD or Schema (barring a list definition) a set of defined elements will validate in any order. So elements have order only because of the lexical requirement for order in a document, I don't think we can assume anything.

joshperry 2010-01-08 15:58:21

@Yannick M. Yes, and I felt that the advice to not use Linq was unfounded and misguided, hence the downvote.

joshperry 2010-01-08 16:00:37

ansaurus

tags:

views:

answers:

Is there a way to speed up this code that finds data changes in two XML files?

related questions