tags:

views:

51

answers:

2

I have two documents with a simple schema that I need to compare:

current doc:

<Sections>
  <Section Number="1"/>
  <Section Number="2"/>
  <Section Number="4"/>
  <Section Number="5"/>
</Sections>

previous doc:

<Sections>
  <Section Number="1"/>
  <Section Number="2"/>
</Sections>

The result of the comparison will be a list sections that have been added to the current doc...ie sections in the current doc that are not in the previous doc. In this example section 4 and 5 are new.

The current and previous doc can have upwards of 20,000 records. The following approach produces the results I need but seems like the wrong approach as it passes over the data sets multiple times, and takes a while to run.

get a list of the sections:

List<XElement> currenList = currentDoc.Descendants("Section").ToList();

get attributes in previous list

List<string> previousString = //get the attribute values...
//get the new sections...
var newSections = (from nodes in currentList
                   let att = nodes.Attribute("Number").Value
                   where !previousList.Contains(att)
                   select nodes)

What is a better approach that would involve fewer passes/conversions of the datasets?

A: 

You should look at Except:

IEnumerable<int> currentList = currentDoc.Descendants("Section")
                      .Select(e => (int)e.Attribute("Number"));
IEnumerable<int> previousList = previousDoc.Descendants("Section")
                      .Select(e => (int)e.Attribute("Number"));

IEnumerable<int> newSections = currentList.Except(previousList);
Alex Bagnolini
If you need the `pick Attribute "Number" inside Sections` piece of code in more part of your code, you should think about making it an Extension Method.
Alex Bagnolini
How is Except implemented? If it checks for existence by passing through the whole list with each element, it's still doing and N^2 algorithm.
santosc
Does that really matters? Until you don't have a performance problem, don't bother about improving it.
Alex Bagnolini
he said "it takes a while to run" so I assumed he was concerned for the runtime.
santosc
A: 

You can use a sorted set to keep track.

SortedSet<string> sections = new Sort...
List<XElement> diff = new Li...

foreach (var node in previousList)
    sections.Add(node.Attribute("Section").Value());

foreach (var node in currentList)
    if (!sections.Contains(node.Attribute("Section").Value());
        diff.Add(node);

this uses a little extra memory with SortedSet, but it should run n*log(n) since the Contains() on a sorted set will run log(n).

in the end, diff should contain the list you're looking for.

santosc