views:

68

answers:

2

So to start, I have an array of XML files. These files need to be iterated through and checked for certain 'unrecognized' hexadecimal characters and replaced with normal UTF-8 text, or some kind of placeholder.

I've tried iterating through the files and replacing the hex codes using both str_replace and preg_replace with no luck. My ultimate problem, is I'm receiving errors about 'non-utf characters' when trying to open these files with simpleXML.

Here's what I have so far:

class HexadecimalConverter {

    public $filenames = array();

    public function __construct($filenames) {

        $this->filenames = $filenames;
        $this->removeHex();

    }

    public function removeHex() {

        foreach ($this->filenames as $key => $value) {

            $contents = file_get_contents($value);

            $contents = preg_replace("/\x96/", '–', $contents);
            $contents = preg_replace("/\x97/", '—', $contents);
            $contents = preg_replace("/\x85/", "...", $contents);
            $contents = preg_replace("/\xBA/", "", $contents);

            file_put_contents($value, $contents);

        }

    }

}

Here is the error I'm trying to fix: Warning: simplexml_load_file() [function.simplexml-load-file]: ./04R_P455_S1157.xml:5: parser error : Input is not proper UTF-8, indicate encoding ! Bytes: 0x97 0x0D 0x0A 0x69 in C:\xampp\htdocs\hint_updater\libraries\hint_updater_classes.php on line 130

Still no luck, I've tried everything suggested in this thread, but the preg_replace doesn't appear to be replacing all instances of hex code.

A: 

You should first read the preg_replace docs. They clearly state that the function returns the modified string, so you will have to change every preg_replace line in your code by $contents = preg_replace(...); to make your replaces work. Right now you're doing the replace but throwing the resulting string away, and thus in the end you write the original string back to the file.

wimvds
A: 

preg_replace returns the new string.

Try $contents = preg_replace("/\x96/", '–', $contents); and the like.

Borealid
Sorry, that was a typo... I had just reinserted the preg_replace. With the proper $contents = before each preg_replace it still doesn't seem to go through and replace all of the instances of these hex codes
ThinkingInBits
Are you sure you didn't want `foreach ($this->filenames as $value)`? That's the only other thing I can think to be wrong with this code.
Borealid
Shouldn't matter... This just give's me the index along with the value
ThinkingInBits
Attempting to change this anyways... since it's my last hope :)
ThinkingInBits
Still no luck with this
ThinkingInBits
Are you sure your code *isn't* working? Remember that there are several different ways to produce a grave accent... For instance, having a combining accept character, or typing a letter with the accent already in place. Check the hex output coming from the program - it might already have removed the special characters you specified. At the very least, explain how it's still not working...
Borealid
When I attempt to open the XML files with simpleXML following the string replace, I'm getting the error specified in my main post.
ThinkingInBits
What I meant was to save the post-replacement string and check if it still contains invalid characters. If it doesn't, they weren't your problem to begin with.
Borealid