ansaurus

Question

How can I strip invalid XML characters from strings in Perl?

Answer 1

+3 A:

If you use an XML library to build your XML (as opposed to string concatenation, simple templates, etc), then it should take care of that for you. There is no point in reinventing the wheel.

David Dorward 2009-06-19 08:38:16

@David: do these libraries simply strip the control characters from the incoming string?

AnthonyWJones 2009-06-19 08:46:04

As far as I'm aware, XML::LibXML doesn't do anything to text node content apart from reject it if it contains invalid characters. I'd be suprised if the other libraries did anything either.

Nic Gibson 2009-06-19 15:06:09

newt, that's the point of using an XML library in the first place.

Leonardo Herrera 2009-06-19 16:33:42

Of course it is, but he was asking about how to ensure that he didn't get this problem by ensuring that the text content didn't contain invalid characters.

Nic Gibson 2009-06-19 19:30:27

@newt: I'm not completely sure what you mean by "this problem". I see XML::LibXML stripping out the "illegal" characters, except for nul, which it treats as the end of the data :(

ysth 2009-06-19 20:40:13

Answer 2

A:

You could use a Regular expression to remove control characters for example \cH will match \cL or \x08 and \x0C both match backspace and Formfeed respectively.

AnthonyWJones 2009-06-19 08:45:20

Answer 3

A:

You can use a simple regex to find and replace all control characters in your chunk of text replacing them either with a space or removing them altogether-

# Replace all control characters with a space
$text =~ s/[[:cntrl:]]/ /g;

# or remove them
$text =~ s/[[:cntrl:]]//g;

muteW 2009-06-19 09:15:51

...which also strips linefeeds - so not very useful :)

AndrewR 2009-06-19 09:23:19

Ouch, didn't think about the linefeeds. newt's answer seems ok then for what you're trying to do.

muteW 2009-06-19 10:57:21

Answer 4

+5 A:

As almost everyone else has said, use a regular expression. It's honestly not complex enough to be worth adding to a library. Preprocess your text with a substitution.

Your comment about linefeeds above suggests that the formatting is of some importance to you so you will possibly have to decide exactly what you want to replace some characters with.

The list of invalid characters is clearly defined in the XML spec (here - http://www.w3.org/TR/REC-xml/#charsets - for example). The disallowed characters are the ASCII control characters bar carriage return, linefeed and tab. So, you are looking at a 29 character regular expression character class. That's not too bad surely.

Something like:

$text =~ s/[\x00-\x08 \x0B \x0C \x0E-\x19]//g;

should do it.

Nic Gibson 2009-06-19 09:46:49

Yep. This is pretty much what I ended up doing.

AndrewR 2009-06-19 12:43:39

I must admit that I only posted after I'd searched CPAN because I was convinced that RE must be in Regexp::Common somewhere!

Nic Gibson 2009-06-19 14:39:10

Answer 5

+2 A:

Translate is a lot faster than regex substitution. Especially if all you want to do delete characters. Using newt's set:

$string_to_clean =~ tr/\x00-\x08\x0B\x0C\x0E-\x19//d;

A test like this:

cmpthese 1_000_000
       , { translate => sub { 
               my $copy = $text; 
               $copy =~ tr/\x00-\x08\x0B\x0C\x0E-\x19//d; 
           }
           , substitute => sub { 
               my $copy = $text; 
               $copy =~ s/[\x00-\x08\x0B\x0C\x0E-\x19]//g; 
           }
         };

yeilded:

                Rate substitute  translate
substitute  287770/s         --       -86%
translate  2040816/s       609%         --

And the more characters I needed to delete the faster tr got in relation.

Axeman 2009-06-19 14:21:45

Absolutely true - I generally don't use tr// because it's so limited but this is certainly an appropriate use.

Nic Gibson 2009-06-19 14:40:31

Me too. I practically never have the need for the pared-down abilities of tr. But if I don't care about where the character occurs, I'm going to use it from now on--although, I'm not sure how likely I am to run into that case.

Axeman 2009-06-19 15:33:27

Yes, it's a lot faster, but 287770/s is plenty fast.

ysth 2009-06-19 15:53:20

Answer 6

A:

I haven't done a lot of work with XML containing "invalid" characters before, but it seems to me you have two completely separate problems here.

First, there are characters in your data that you may not want. You should decide what those are and how you want to remove/replace them independent of any XML restrictions. For instance, you may have things like x^H_y^H_z^H_ where you decide you want to strip both the backspace and the following character. Or it's possible that you in fact don't want to adjust your data but feel forced to by the need to represent it in XML.

Update: I've preserved the following paragraphs for posterity, but they are based on a misunderstanding: I thought you could include any character in XML data so long as you encoded it properly, but it seems there are some characters that are outright verboten, even encoded? XML::LibXML strips these out (at least the current version does so), except for the nul character, which it treats as the end of the string, discarding it and anything that follows :(

Second, you may have characters in your data that you've kept that need encoding in XML. Ideally, whatever XML module you use would do this for you, but if it isn't, you should be able to do it manually, with something like:

use HTML::Entities "encode_entities_numeric";
$encoded_string = encode_entities_numeric( $string, "\x00-\x08\x0B\x0C\x0E-\x19");

But that's really just a stopgap measure. Use a proper XML module; see for instance this answer.

ysth 2009-06-19 16:14:00

Answer 7

+3 A:

Okay, this seems to be already answered, but what the hey. If you want to author XML documents, you must use an XML library.

#!/usr/bin/perl
use strict;
use XML::LibXML;

my $doc = XML::LibXML::Document->createDocument('1.0');
$doc->setURI('http://example.com/myuri');
$doc->setDocumentElement($doc->createElement('root-node'));

$doc->documentElement->appendTextChild('text-node',<<EOT);
    This node contains &, ñ, á, <, >...
EOT

print $doc->toString;

This produces the following:

$ perl test.pl
<?xml version="1.0"?>
<root-node><text-node>    This node contains &amp;, &#x6C821;, &lt;, &gt;...
</text-node></root-node>

Edit: I now see that you are already using XML::LibXML. This should do the trick.

Leonardo Herrera 2009-06-19 16:38:25

Thanks for the example; I was a little shocked at the comment that claimed XML::LibXML didn't handle this for you.

ysth 2009-06-19 18:03:26

Of course it does. But the original question was about removing the characters that will cause XML::LibXML to reject the content (characters below ASCII space bar the whitespace chars). This is not quite the same thing.

Nic Gibson 2009-06-19 19:32:51

"use strict" is nice, but warnings are even more important. Don't forget -w or "use warnings"!

ysth 2009-06-19 20:24:18

hmmm ... just came across this ... XML::LibXML does not handle this if your use $node->appendText( $str ) ... but does if you use $parent->appendTextChild( 'node', $str ) ... weirdness

derby 2009-10-19 14:18:35

Answer 8

+3 A:

The complete regex for removal of invalid xml-1.0 characters is:

# #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
$str =~ s/[^\x09\x0A\x0D\x20-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}]//go;

for xml-1.1 it is:

# allowed: [#x1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
$str =~ s/[^\x01-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}]//go;
# restricted:[#x1-#x8][#xB-#xC][#xE-#x1F][#x7F-#x84][#x86-#x9F]
$str =~    s/[\x01-\x08\x0B-\x0C\x0E-\x1F\x7F-\x84\x86-\x9F]//go;

Heiko 2009-09-25 13:34:42

Answer 9

A:

I don't have points to comment on the answers above, but they're not working!!

$ perl -e 'print "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<root>\x{A0}\x{A0}</root>"' > invalid.xml
$ perl -e 'use XML::Simple; XMLin("invalid.xml")'
invalid.xml:2: parser error : Input is not proper UTF-8, indicate encoding !
Bytes: 0xA0 0xA0 0x3C 0x2F
$ perl -ne 's/[^\x09\x0A\x0D\x20-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}]//go; print' invalid.xml > valid.xml
$ perl -e 'use XML::Simple; XMLin("valid.xml")'
invalid.xml:2: parser error : Input is not proper UTF-8, indicate encoding !
Bytes: 0xA0 0xA0 0x3C 0x2F

In fact, the two files invalid.xml and valid.xml are identical.

The thing is that the range "\x20-\x{D7FF}" matches valid representations of those unicode characters, but not e.g. the invalid character sequence "\x{A0}\x{A0}". Anybody has a clue on how to actually solve this problem?

Juan Antonio 2010-06-11 19:13:43

Answer 10

A:

I've found a solution, but it uses the iconv command instead of perl.

$ iconv -c -f UTF-8 -t UTF-8 invalid.utf8 > valid.utf8

Juan Antonio 2010-06-11 19:31:20

ansaurus

tags:

views:

answers:

How can I strip invalid XML characters from strings in Perl?

related questions