ansaurus

Question

How can I anonymise XML data for selected tags?

Answer 1

+5 A:

Bottom line: don't parse XML using regex.

Use your language's DOM parsing libraries instead, and if you know the elements you need to anonymize, grab them using XPath and hash their contents by setting their innerText/innerHTML properties (or whatever your language calls them).

Welbog 2009-02-19 15:35:02

A 50 MB file might be a bit much for DOM processing, depending on the expansion factor of the data structure in memory. At the very least it might be a while before any result comes out of it. Stream or pull processing might be a better idea.

mirod 2009-02-20 13:33:24

Answer 2

+6 A:

You have to do something like the following in Python.

import xml.etree.ElementTree as xml # or lxml or whatever
import hashlib
theDoc= xml.parse( "sample.xml" )
for alphaTag in theDoc.findall( "xpath/to/tag" ):
    print alphaTag, alphaTag.text
    alphaTag.text = hashlib.md5(alphaTag.text).hexdigest()
xml.dump(theDoc)

S.Lott 2009-02-19 15:47:59

Answer 3

+3 A:

As Welbog said, don't try to parse XML with a regex. You'll regret it eventually.

Probably the easiest way to do this is using XML::Twig. It can process XML in chunks, which lets you handle very large files.

Another possibility would be using SAX, especially with XML::SAX::Machines. I've never really used that myself, but it's a stream-oriented system, so it should be able to handle large files. The downside is that you'll probably have to write more code to collect the text inside each tag that you care about (where XML::Twig will collect that text for you).

cjm 2009-02-19 16:27:18

Good tutorial of Twig is at http://xmltwig.com/

Hynek -Pichi- Vychodil 2009-02-19 16:58:01

Answer 4

+2 A:

Using regexps is indeed dangerous, unless you know exactly the format of the file, it's easy to parse with regexps, and you are sure that it will not change in the future.

Otherwise you could indeed use XML::Twig,as below. An alternative would be to use XML::LibXML, although the file might be a bit big to load it entirely in memory (then again, maybe not, memory is cheap these days) so you might have to use the pull mode, which I don't know much about.

Compact XML::Twig code:

#!/usr/bin/perl

use strict;
use warnings;

use XML::Twig;
use Digest::MD5 'md5_base64';

my @tags_to_anonymize= qw( name surname address email phone);

# the handler for each element ($_) sets its content with the md5 and then flushes
my %handlers= map { $_ => sub { $_->set_text( md5_base64( $_->text))->flush } } @tags_to_anonymize;

XML::Twig->new( twig_roots => \%handlers, twig_print_outside_roots => 1)
         ->parsefile( "my_big_file.xml")
         ->flush;

mirod 2009-02-19 19:32:37

I was just about to suggest XML::Twig. It's incredibly easy to transform only that parts of an XML tree you need to touch while leaving the rest alone. :)

brian d foy 2009-02-19 21:35:29

ansaurus

tags:

views:

answers:

How can I anonymise XML data for selected tags?

related questions