tags:

views:

210

answers:

1

Clearly I need to (a) convert both strings to canonical XML or (b) compare their parse-trees. The following doesn't work because the document object returned doesn't have a sensible == defined.

Nokogiri.XML(doc_a) == Nokogiri.XML(doc_b)

Nor does the following, because Nokogiri's to_xml leaves some internal whitespace:

Nokogiri.XML(doc_a).to_xml == Nokogiri.XML(doc_b).to_xml

This is a reasonable approximation of equality (and will work for most cases), but it's not quite right:

Nokogiri.XML(doc_a).to_xml.squeeze(' ') == Nokogiri.XML(doc_b).to_xml.squeeze(' ')

I'm already using Nokogiri, so I'd prefer to stick with it, but I'll use whatever library works.

+1  A: 

Converting them to strings won't be very successful. For example, if an element has two attributes, does the order really matter? In most cases, no. Does the order of children of a given node? Depends what you're doing. But if the answer to one of those questions is "no", then a simple string comparison is a kludge at best.

There isn't anything in Nokogiri to do it for you; you'll have to build it yourself. Aaron Patterson discusses some of the issues here:

As far as the XML document is concerned, no two nodes are ever equal. Every node in a document is different. Every node has many attributes to compare:

  1. Is the name the same?
  2. How about attributes?
  3. How about the namespace?
  4. What about number of children?
  5. Are all the children the same?
  6. Is it's parent node the same?
  7. What about it's position relative to sibling nodes?

Think about adding two nodes to the same document. They can never have the same position relative to sibling nodes, therefore two nodes in a document cannot be "equal".

You can however compare two different documents. But you need to answer those 7 questions yourself as you're walking the two trees. Your requirements for sameness may differ from others.

That's your best bet: walk the trees and make those comparisons.

Pesto
I'm pretty sure canonical XML (http://www.w3.org/TR/xml-c14n) takes care of all seven of those issues.
James A. Rosen