views:

348

answers:

2

This is a really weird problem. It's taken me practically all day to whittle it down to a small executable script that demonstrates the problem fully.

Problem Summary: I'm using XML::Twig to pull a data snippet from an XML file, then I'm sticking that data snippet into the middle of another piece of data, let's call it parent data. The parent data has this weird non-printable character at its beginning when I start. It's vendor supplied data, so I cannot control it. My problem is that after I stick the data snippet into the middle of the parent data, the final product has a new non-printable character at its beginning in addition to the one it started with originally. This new non-printable character was not in either the parent data nor in the child data snippet. I don't know where it's coming from nor how it's getting into my data.

I'm doubtful that it is an XML::Twig bug because the string corruption occurs while reading a line from a filehandle in a while loop, but I've been unsuccessful at recreating my problem when I remove the XML::Twig code in my scripts so I had to leave it in.

This is my first experience with non-printable characters in strings that I'm trying to process. Do I need to do something special instead of treating them like ordinary strings or something?

I'm using ActiveState Perl 5.10.1 and XML::Twig 3.32 (latest) and the Eclipse 3.5.1 IDE on Windows XP.

Here is a script that demonstrates the problem:

use strict; 
use warnings; 
use XML::Twig; 

my $FALSE = 0;
my $TRUE = 1;
my $name = 'KurtsProgram';
my $task = 'MainTask';
my $hidden_char = "\xBF";
my $data = $hidden_char . 
'(*********************************************
  Data-File-Header-Junk
**********************************************)

    PROGRAM MainProgram ()
    END_PROGRAM

    TASK SecondaryTask ()
    END_TASK

    TASK MainTask ()
        MainProgram;
    END_TASK
';
my $new_data = insertProgram( $name, $task, $data );

# test to see if results start out as expected
if ( $new_data =~ m/^\Q$hidden_char\E/ ) {
    print "SUCCESS\n";
}
else {
    print STDERR "ERROR: What happened?\n";
    print STDERR "ORIGINAL: \n$data\n";
    print STDERR "MODIFIED: \n$new_data\n";
}

sub insertProgram {
    my ( $local_name, $local_task, $local_data ) = @_;

    # get program section from XML template
    my $twig = new XML::Twig;
    $twig->parse( '<?xml version="1.0"?>
<TemplateSet>
    <PROGRAM>PROGRAM <Name>ProgramNameGoesHere</Name> ()
    END_PROGRAM</PROGRAM>
    <TASK>TASK <Name>TaskNameGoesHere</Name> ()
    END_TASK</TASK>
</TemplateSet>
' );   
    my $program = $twig->root->first_child('PROGRAM');

    # replace program name in XML template
    $program->first_child('Name')->set_text($local_name);
    my $insert = $program->text();

    # stick modified program into data
    if ( $local_data =~ s/(\s+PROGRAM\s+[^\s]+\s+\()/\n\n    $insert $1/ ) {
        # found it and inserted new program
    }
    else {
        # not found
        return;
    }

    # add program name to task list
    my $added_program_to_task = $FALSE;
    my $found_start = $FALSE;
    my $found_end = $FALSE;
    my $new_data = "";
    # open string as a filehandle for line by line processing
    my $filehandle;
    open( $filehandle, '<', \$local_data )
        or die("Can't open string as a filehandle: $!");
    while (defined (my $line = <$filehandle>)) {
        # look for start of our task
        if ( 
               ( !$found_start ) &&
               ( $line =~ m/\s+TASK\s+\Q$local_task\E\s+\(/ )
            ) {
            # found the task!
            $found_start = $TRUE;
        }

        # look for end of our task
        if (
                ( $found_start ) && ( !$found_end ) &&
                ( $line =~ m/\s+END_TASK/ )
            )
        {
            # found the end tag for the task section!
            $found_end = $TRUE;

            # add the program name to the bottom of the list
            $line = "        " . $local_name . ";\n" . $line;
            $added_program_to_task = $TRUE;
        }

        # compile new data from processed line or original line
        $new_data = $new_data . $line;
    }
    close($filehandle);

    if ($added_program_to_task) {
        # success
    }
    else {
        # unable to find task
        return;
    }

    return $new_data;
}

When I run this script, I get the following output:

ERROR: What happened?
ORIGINAL: 
¿(*********************************************
      Data-File-Header-Junk
    **********************************************)

        PROGRAM MainProgram ()
        END_PROGRAM

        TASK SecondaryTask ()
        END_TASK

        TASK MainTask ()
            MainProgram;
        END_TASK

MODIFIED: 
¿(*********************************************
      Data-File-Header-Junk
    **********************************************)

        PROGRAM KurtsProgram ()
        END_PROGRAM 

        PROGRAM MainProgram ()
        END_PROGRAM

        TASK SecondaryTask ()
        END_TASK

        TASK MainTask ()
            MainProgram;
            KurtsProgram;
        END_TASK

You can see the extra character that was added to the front of the data right under the M in MODIFIED.

+6  A: 

It has done an ISO-8859-1 to UTF-8 encoding conversion on the character: \xBF -> \xC2\xBF.

XML::Twig converts all its input to UTF-8 (see here).

You could tell Twig to keep the input encoding using the keep_encoding option (also see the XML::Twig FAQ: My XML documents/data are produced by tools that do not grok Unicode, will XML::Twig help me there?).

But perhaps it would be better to keep the UTF-8, or perhaps silently drop the character, depending on what exactly you're going to do with it.

mercator
The data is going right back into a vendor's application, so the special nonprintable character at the front of the data needs to remain exactly as it was originally.
Kurt W. Leucht
In that case `keep_encoding` should do the job.
mercator
A: 

I can't really make sense of your code, it is still too complex to be quickly debugged, but maybe the problem has to do with a BOM (see the Unicode BOM FAQ) that would be ignored at the beginning of an XML document, but not if you copy it in the middle of an other one? just guessing here because of the xBF value, that's part of the BOM for a UTF-8 document.

mirod