views:

1533

answers:

2

Hello,

I have a file, which is in XML format (consists just of root start and end tags, and children of the root). The text elements of the children contain the ampersand symbol &. In XML it is not allowed to have this symbol in order the document to be valid, and when I tried to process the file using the DOM API in Java and an XML parser, I obtained parsing errors. Therefore, I have replaced & with &, and I processed the file successfully: I had to extract the values of the text elements in different plain text files.

When I opened these newly created text files, I expected to see &, but there was & instead. Why is this? I have stored the text in text files without any extension (my original file with the XML format also did not have .xml extension), and I do have just & in the text of the new file, no matter how I open the file: as txt or as xml file (these are some of the options in my XML editor). What happens exactly? Does Java (?) convert & to & automatically? Or there is some default encoding? Well, & stands for &, and I suppose there is some "invisible" automatic conversion, but I am confused when and how this happens. Here are examples of my original file and the extracted file which I receive after I processed the original file with Java:

This is my "negative.review" file in XML format:

<review>
<review_text>
I will not wear it as it is too big &amp; looks funny on me. 
</review_text>
</review>

This is my extracted file "negative_1":

I will not wear it as it is too big & looks funny on me.

For me it is important to have the original data as it is (without doing any conversions/replacements), so I thought that I have to process the extracted file "negative_1" converting back &amp; to &. As you see, it seems I don't have to do this. But I don't understand why :(.

Thank you in advance!

+2  A: 

Any XML parser will implicitly translate entities such as &amp;, &lt;, &gt;, into the corresponding characters, as part of the process of parsing the file.

Alex Martelli
+4  A: 

The reason is simple: The XML file really contains an "&" character.

It is just represented differently (i.e. it is "escaped"), because a real "&" on it's own breaks XML files, as you've seen. Read the relevant section in the XML 1.0 spec: "2.4 Character Data and Markup". It's just a few lines, but it explains the issue quite well.

XML is a representation of data (!). Don't think of it as a text file. Example:

You want to store the string "17 < 20" in an XML file. Initially, you can't, since the "<" is reserved as the opening tag bracket. So this would be invalid:

<xml>17 < 20</xml>

Solution: You employ character escaping on the special/reserved character, just for the means of retaining the validity of the file:

<xml>17 &lt; 20</xml>

For all practical purposes the above snippet contains the following data (in JSON representation this time):

{
  "xml": "17 < 20"
}

This is why you see the real "&" in your post-processing. It had been escaped in just the same way, but it's meaning stayed the same all the time.

The above example also explains why the "&" must be treated specially: It is itself part of the XML escaping mechanism. It marks the start of an escape sequence, like in "&lt;". Therefore it must be escaped itself (with "&amp;", like you've done).

Tomalak
Fabulous answer... as usual! +1
Cerebrus