views:

109

answers:

2

So I need to edit some text in a Word document. I created a Word document and saved it as XML. It is saved correctly (I can open the XML file in MS Word and it looks exactly like the docx original).

So then I use PHP DOM to edit some text in the file (just two lines) (EDIT - bellow is already fixed working version):

<?php

$firstName = 'Richard';
$lastName = 'Knop';

$xml = file_get_contents('template.xml');

$doc = new DOMDocument();
$doc->loadXML($xml);
$doc->preserveWhiteSpace = false;

$wts = $doc->getElementsByTagNameNS('http://schemas.openxmlformats.org/wordprocessingml/2006/main', 't');

$c1 = 0; $c2 = 0;
foreach ($wts as $wt) {

    if (1 === $c1) {
        $wt->nodeValue .= ' ' . $firstName;
        $c1++;
    }

    if (1 === $c2) {
        $wt->nodeValue .= ' ' . $lastName;
        $c2++;
    }

    if ('First Name' === substr($wt->nodeValue, 0, 10)) {
        $c1++;
    }

    if ('Last Name' === substr($wt->nodeValue, 0, 9)) {
        $c2++;
    }

}

$xml = str_replace("\n", "\r\n", $xml); 

$fp = fopen('final-xml.xml', 'w');
fwrite($fp, $xml);
fclose($fp);

This gets executed properly (no errors). These two lines:

<w:t>First Name:</w:t>
<w:t>Last Name:</w:t>

Get replaced with these:

<w:t>First Name: Richard</w:t>
<w:t>Last Name: Knop</w:t>

However, when I try to open the final-xml.xml file in MS Word, it doesn't open (Word freezes). Any suggestions.

EDIT:

I tried using levenstein():

$xml = file_get_contents('template.xml');
$xml2 = file_get_contents('final-xml.xml');

$str = str_split($xml, 255);
$str2 = str_split($xml2, 255);

$i = 0;
foreach ($str as $s) {
    $dist = levenshtein($s, $str2[$i]);
    if (0 <> $dist) {
        echo $dist, '<br />';
    }
    $i++;
}

Which outputted nothing.

Which is weird. When I open the final-xml.xml file in notepad, I can clearly see that those two lines have changed.

EDIT2:

Here is the template.xml file: http://uploading.com/files/61b2922b/template.xml/

+4  A: 

This is a problem related to DOS vs UNIX line endings. Word 2007 does not tolerate a \n line ending, it requires \r\n whereas Word 2010 is more tolerant and accepts both versions.

To fix the problem make sure that you replace all UNIX line breaks with DOS ones before saving the output file:

$xml = str_replace("\n", "\r\n", $xml); 

Full sample:

<?php

$firstName = 'Richard';
$lastName = 'Knop';

$xml = file_get_contents('template.xml');

$doc = new DOMDocument();
$doc->loadXML($xml);
$doc->preserveWhiteSpace = false;

$wts = $doc->getElementsByTagNameNS('http://schemas.openxmlformats.org/wordprocessingml/2006/main', 't');

foreach ($wts as $wt) {
   echo $wt->nodeValue;

    if ('First Name:' === $wt->nodeValue) {
        $wt->nodeValue = 'First Name: ' . $firstName;
    }

    if ('Last Name:' === substr($wt->nodeValue, 0, 10)) {
        $wt->nodeValue = 'Last Name: ' . $lastName;
    }
}

$xml = $doc->saveXML();

// Replace UNIX with DOS line endings
$xml = str_replace("\n", "\r\n", $xml); 

$fp = fopen('final-xml.xml', 'w');
fwrite($fp, $xml);
fclose($fp);
?>
0xA3
Great! You are a genious. Thanks!
Richard Knop
A: 

XML Word files have certain checksums stored near the top of the dom (to my recollection). You may have to change these, such as the size, or general checksum itself.

I know this was my problem when I was (dumb) enough to make an HTML file in word and save it, it has thousands of useless things in it that only served to make editing worse.

Charles Broughton
These are no checksums, they are just meta data which will be updated by Word once the document is saved again.
0xA3