views:

945

answers:

5

Hiya,

I have an XML file, and I want to find nodes that have duplicate CDATA. Are there any tools that exist that can help me do this?

I'd be fine with a tool that does this generally for text documents.

A: 

never heard about anything like that, but it might be an intresting task to write such a program based on a dictionary coder as used in archivers.

lImbus
A: 

The description of the problem is too general.

Could you, please, provide a specific example: the source XML document and the wanted result?

Cheers,

Dimitre Novatchev

Dimitre Novatchev
A: 

Not easily. My first thought is XSLT but it's hard to implement. You'd have to go through each node and then do an XPATH select on every node with the same data. That would find them, but you'd end up processing all of the nodes with the same data later as well (ie, no way to keep track of what node data you've already processed and ignore it). You could do it with a real programming language but that's outside of my experience.

Stephen Friederichs
A: 

You could write a simple C# app that uses Linq to read all the nodes twice as separate entities, then finds all values that are equal.

ck
+1  A: 

Here is a first attempt, written in Python and using only standard libraries. You can improve it in many ways (trim leading and ending whitespaces, computing a hash of the text to decrease memory requirments, better displaying of the elements, with their line number, etc):

import xml.etree.ElementTree as ElementTree
import sys

def print_elem(element):
    return "<%s>" % element.tag

if len(sys.argv) != 2:
    print >> sys.stderr, "Usage: %s filename" % sys.argv[0]
    sys.exit(1)
filename = sys.argv[1]    
tree = ElementTree.parse(filename)
root = tree.getroot()
chunks = {}
iter = root.findall('.//*')
for element in iter:
    if element.text in chunks:
        chunks[element.text].append(element)
    else:
        chunks[element.text] = [element,]
for text in chunks:
    if len(chunks[text]) > 1:
        print "\"%s\" is a duplicate: found in %s" % \
              (text, map(print_elem, chunks[text]))

If you give it this XML file:

<foo>
<bar>Hop</bar><quiz>Gaw</quiz>
<sub>
<und>Hop</und>
</sub>

it will output:

"Hop" is a duplicate: found in ['<bar>', '<und>']
bortzmeyer
That's cool, and I appreciate the extra effort!It looks like this would only work for root-level nodes, though, wouldn't it?
duma
Certainly not. Because of the XPath expression .//* it should process every element.
bortzmeyer