views:

165

answers:

3

Morning,

I am inputting data from an XML file into my database, but have any isse with German words (that are in the XML by mistake)

For example the word für appears in my XML as für and thus appears the same in my database.

I know I could do a simple search/replace for that exact phrase, but I was wondering if there was a smarter way to do it as I can't predict if any other German words may one day appear in the XML?

ADDING SOME MORE DETAIL

The XML source says:

<?xml version="1.0" encoding="UTF-8" ?> 

and in my PHP I have

$domString = utf8_encode($dom->saveXML($element));

If I look into the XML file before I start reading it, it has -

 <title> - <![CDATA[ CoPilot Live v8 Europa für Android 8.0.0.644 ]]> </title> 

Thanks.

Greg

+1  A: 

use the same encoding everywhere and there will be no such problems. and if you have to choose an encoding: use UTF-8!

if you can't change it (why ever...) you have to use utf8_decode to get the correct values.

oezi
This is partially correct, but not the reason why this is happening. If you cannot change the encoding, then have to dig a bit deeper.. :)
danp
I believe I am using UTF-8 everywhere, I've added some more detail to my question...
kitenski
+1  A: 

This normally happens when UTF-8 data is deconded as ISO-8859-1 for example. In UTF-8 the german umlaut ü is represented by two bytes, in ISO-8859-1, it's one byte. the two bytes get decoded one by one resulting in an à and a ¼. You task would be this:

  • read the XML's bytes
  • decode them using UTF-8

Check http://www.utf8-zeichentabelle.de/ for byte values.

However, all in all, the idea of fixing this is pretty bad. You end up guessing encoding, not to talk about wrong encoded/decoded characters are encoded/decoded again... good luck!

Tim Büthe
A: 

Don't forget that if you are using DOMDocument then no matter what encoding your script is in, it converts everything internally to UTF8.

Also if you are using htmlentities, unless you specifically tell it to, it will use ISO-8859-1 encoding by default. Took me a while to figure this out!

Useful comment here, also from a german language perspective.

danp