ansaurus

Question

Tool to find duplicate sections in a text (XML) file?

Answer 1

A:

never heard about anything like that, but it might be an intresting task to write such a program based on a dictionary coder as used in archivers.

lImbus 2008-10-29 21:18:43

Answer 2

A:

The description of the problem is too general.

Could you, please, provide a specific example: the source XML document and the wanted result?

Cheers,

Dimitre Novatchev

Dimitre Novatchev 2008-11-15 18:32:05

Answer 3

A:

Not easily. My first thought is XSLT but it's hard to implement. You'd have to go through each node and then do an XPATH select on every node with the same data. That would find them, but you'd end up processing all of the nodes with the same data later as well (ie, no way to keep track of what node data you've already processed and ignore it). You could do it with a real programming language but that's outside of my experience.

Stephen Friederichs 2009-01-23 15:50:55

Answer 4

A:

You could write a simple C# app that uses Linq to read all the nodes twice as separate entities, then finds all values that are equal.

ck 2009-01-23 15:53:22

Answer 5

+1 A:

Here is a first attempt, written in Python and using only standard libraries. You can improve it in many ways (trim leading and ending whitespaces, computing a hash of the text to decrease memory requirments, better displaying of the elements, with their line number, etc):

import xml.etree.ElementTree as ElementTree
import sys

def print_elem(element):
    return "<%s>" % element.tag

if len(sys.argv) != 2:
    print >> sys.stderr, "Usage: %s filename" % sys.argv[0]
    sys.exit(1)
filename = sys.argv[1]    
tree = ElementTree.parse(filename)
root = tree.getroot()
chunks = {}
iter = root.findall('.//*')
for element in iter:
    if element.text in chunks:
        chunks[element.text].append(element)
    else:
        chunks[element.text] = [element,]
for text in chunks:
    if len(chunks[text]) > 1:
        print "\"%s\" is a duplicate: found in %s" % \
              (text, map(print_elem, chunks[text]))

If you give it this XML file:

<foo>
<bar>Hop</bar><quiz>Gaw</quiz>
<sub>
<und>Hop</und>
</sub>

it will output:

"Hop" is a duplicate: found in ['<bar>', '<und>']

bortzmeyer 2009-01-23 16:33:24

That's cool, and I appreciate the extra effort!It looks like this would only work for root-level nodes, though, wouldn't it?

duma 2009-03-12 15:12:06

Certainly not. Because of the XPath expression .//* it should process every element.

bortzmeyer 2009-03-12 21:13:06

ansaurus

tags:

views:

answers:

Tool to find duplicate sections in a text (XML) file?

related questions