views:

375

answers:

4

I am using the PHP SimpleXML way of working with XML files on my server. I only need to read the contents of the XML (I have no need to modify it) so I stuck to the simple and easy to use SimpleXML. But SimpleXML is having problems reading a certain XML file because it has some very strange characters. I get the following errors:

Warning: simplexml_load_file() [function.simplexml-load-file]: data/data.xml:348: parser error : PCDATA invalid Char value 3 in C:\xampp\htdocs\VMP\xintel\analyzer.php on line 54

Warning: simplexml_load_file() [function.simplexml-load-file]: Jardin al fte. Hall de recepcion, amplio living comedor. ocina comedor diario c in C:\xampp\htdocs\VMP\xintel\analyzer.php on line 54

I have no control of what goes into the XML file, so I can't stop these characters from being added to the file. Also, I don't know how to solve this issue. The file is supposed to be encoded in utf-8. So I tried things like decoding from UTF-8 to ISO-8859-1 and the reverse, but nothing is happening.

Can somebody help me out? Should I try to change the encoding? Should I try to remove those characters? Anything?

Edit: The stangre characters are all box-drawing characters (see: http://en.wikipedia.org/wiki/Box-drawing%5Fcharacters)

+3  A: 

I have an app that receives XML from untrusted sources, many of which send me unencoded ampersands. To solve the problem, I have an intermediate filter that does a single linear pass and gets rid of / encodes characters where necessary. I don't know if that is possible for you but I think it's a pretty reasonable solution.

danben
Ok, I understand what you are saying. But I am not sure how to do that. Does your app change the encoding or does it actually replace problematic characters with character encodings?
VinkoCM
danben
I think that is what I will do. What would be best though is a way to re-encode the whole xml file so that in the future the script does not crash when it encounters a character I can not checked for.
VinkoCM
Of course, this filter step doesn't have to be done entirely in memory. You could overwrite the XML on disk.
danben
A: 

Normally all character of an XML file are interpreted unless they are into a CDATA section => link text

If it not the case your XML is invalid.

Patrick
The question is how to handle such an invalid XML file when it's not under your control.
ceejayoz
I can say for sure that all the text in the xml is placed in CDATA blocks. So all these characters are found within CDATA.
VinkoCM
Have you a sample xml file ?
Patrick
Well, yes, but Like I said in a previous comment, I can't even paste in the strange characters.
VinkoCM
You can't provide a file somewhere ?
Patrick
A: 

I had a similar problem in a similar configuration (where I had no control whatsoever about the content of the file). I solved it by preprocessing the XML, substituting the extraneous chars with white spaces.

I used string substitution, but for more complex needs you could use a regex.

mac
So could I do something like $content = str_replace("", "", $content); How do I properly specify the box-drawing characters?
VinkoCM
It appears that stackoverflow would not let me put the box-drawing character in the previous comment.
VinkoCM
+1  A: 

Maybe you could pass the input through Tidy to make it well-formed. One simple step of pre-processing before you feed the file to SimpleXML.

For example, tidy::repairFile looks promising.

Tomalak