views:

183

answers:

4

My question is as follows:

I have to read a big XML file, 50 MB; and anonymise some tags/fields that relate to private issues, like name surname address, email, phone number, etc...

I know exactly which tags in XML are to be anonymised.

 s|<a>alpha</a>|MD5ed(alpha)|e;
 s|<h>beta</h>|MD5ed(beta)|e;

where alpha and beta refer to any characters within, which will also be hashed, using probably an algorithm like MD5.

I will only convert the tag value, not the tags themselves.

I hope, I am clear enough about my problem. How do I achieve this?

+5  A: 

Bottom line: don't parse XML using regex.

Use your language's DOM parsing libraries instead, and if you know the elements you need to anonymize, grab them using XPath and hash their contents by setting their innerText/innerHTML properties (or whatever your language calls them).

Welbog
A 50 MB file might be a bit much for DOM processing, depending on the expansion factor of the data structure in memory. At the very least it might be a while before any result comes out of it. Stream or pull processing might be a better idea.
mirod
+6  A: 

You have to do something like the following in Python.

import xml.etree.ElementTree as xml # or lxml or whatever
import hashlib
theDoc= xml.parse( "sample.xml" )
for alphaTag in theDoc.findall( "xpath/to/tag" ):
    print alphaTag, alphaTag.text
    alphaTag.text = hashlib.md5(alphaTag.text).hexdigest()
xml.dump(theDoc)
S.Lott
+3  A: 

As Welbog said, don't try to parse XML with a regex. You'll regret it eventually.

Probably the easiest way to do this is using XML::Twig. It can process XML in chunks, which lets you handle very large files.

Another possibility would be using SAX, especially with XML::SAX::Machines. I've never really used that myself, but it's a stream-oriented system, so it should be able to handle large files. The downside is that you'll probably have to write more code to collect the text inside each tag that you care about (where XML::Twig will collect that text for you).

cjm
Good tutorial of Twig is at http://xmltwig.com/
Hynek -Pichi- Vychodil
+2  A: 

Using regexps is indeed dangerous, unless you know exactly the format of the file, it's easy to parse with regexps, and you are sure that it will not change in the future.

Otherwise you could indeed use XML::Twig,as below. An alternative would be to use XML::LibXML, although the file might be a bit big to load it entirely in memory (then again, maybe not, memory is cheap these days) so you might have to use the pull mode, which I don't know much about.

Compact XML::Twig code:

#!/usr/bin/perl

use strict;
use warnings;

use XML::Twig;
use Digest::MD5 'md5_base64';

my @tags_to_anonymize= qw( name surname address email phone);

# the handler for each element ($_) sets its content with the md5 and then flushes
my %handlers= map { $_ => sub { $_->set_text( md5_base64( $_->text))->flush } } @tags_to_anonymize;

XML::Twig->new( twig_roots => \%handlers, twig_print_outside_roots => 1)
         ->parsefile( "my_big_file.xml")
         ->flush;
mirod
I was just about to suggest XML::Twig. It's incredibly easy to transform only that parts of an XML tree you need to touch while leaving the rest alone. :)
brian d foy