tags:

views:

91

answers:

2

I encountered a problem dealing with UTF-8, XML and Perl. The following is the smallest piece of code and data in order to reproduce the problem.

Here's an XML file that needs to be parsed:

<?xml version="1.0" encoding="utf-8"?>
<test>
  <words>בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת</words>
  <words>בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת</words>
  <words>בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת</words>

  [<words> .... </words> 148 times repeated]

  <words>בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת</words>
  <words>בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת</words>
</test>

The parsing is done with this perl script:

use warnings;
use strict;

use XML::Parser;
use Data::Dump;

my $in_words = 0;

my $xml_parser=new XML::Parser(Style=>'Stream');

$xml_parser->setHandlers (
   Start   => \&start_element,
   End     => \&end_element,
   Char    => \&character_data,
   Default => \&default);

open OUT, '>out.txt'; binmode (OUT, ":utf8");
open XML, 'xml_test.xml' or die;
$xml_parser->parse(*XML);
close XML;
close OUT;


sub start_element {
  my($parseinst, $element, %attributes) = @_;

  if ($element eq 'words') {
    $in_words = 1;
  }
  else {
    $in_words = 0;
  }
}

sub end_element {
  my($parseinst, $element, %attributes) = @_;

  if ($element eq 'words') {
    $in_words = 0;
  }
}

sub default {
  # nothing to see here;
}

sub character_data {
  my($parseinst, $data) = @_;

  if ($in_words) {
    if ($in_words) {
      print OUT "$data\n";
    }
  }
}

When the script is run, it produces the out.txt file. The problem is in this file on line 147. The 22th character (which in utf-8 consists of \xd6 \xb8) is split between the d6 and b8 with a new line. This should not happen.

Now, I am interested if someone else has this problem or can reproduce it. And why I am getting this problem. I am running this script on Windows:

C:\temp>perl -v

This is perl, v5.10.0 built for MSWin32-x86-multi-thread
(with 5 registered patches, see perl -V for more detail)

Copyright 1987-2007, Larry Wall

Binary build 1003 [285500] provided by ActiveState http://www.ActiveState.com
Built May 13 2008 16:52:49
+1  A: 

I do not observe this with

C:\Temp> perl -v

This is perl, v5.10.1 built for MSWin32-x86-multi-thread
(with 2 registered patches, see perl -V for more detail)

Copyright 1987-2009, Larry Wall

Binary build 1006 [291086] provided by ActiveState http://www.ActiveState.com
Built Aug 24 2009 13:48:26
C:\Temp> perl -MXML::Parser -e "print $XML::Parser::VERSION"
2.36
Sinan Ünür
This is interesting. I have updated Activesate Perl to V5.10.1 and now, it works as expected.
René Nyffenegger
Yes, any software with a zero as the point release is not really production ready. :)
brian d foy
+2  A: 

What happens when you open your input file with an explicit UTF-8 encoding?

 open XML, '<:utf8', 'xml_test.xml' or die;

Never trust anything to get an encoding correct by guessing. Whenever you can, explicitly add the encoding yourself.

Also, are you sure that the input is correct? Does it pass validation with another tool, such as xmllint. I know XML::Parser should catch that sort of thing, but let's verify it.

Also, can you put just the problematic input into a string and print it again without a problem? What happens when you remove just that part of the XML file? Does the same error pop up for another record?

brian d foy
When I open it with `open XML, '<:utf8'...` AND remove the `binmode (OUT, ":utf8")` the file `out.txt` is written as I expect it, however, the script crashes after writing the file with a `Out of memory!`
René Nyffenegger
I just have checked the input and the XML is well formed XML.
René Nyffenegger
How big is this file?
brian d foy