tags:

views:

219

answers:

3

I'm reading a XML file with dom4j. The file looks like this:

...
<Field>&#13;&#10; hello, world...</Field>
...

I read the file with SAXReader into a Document. When I use getText() on a the node I obtain the followin String:

\r\n hello, world...

I do some processing and then write another file using asXml(). But the characters are not escaped as in the original file which results in error in the external system which uses the file.

How can I escape the special character and have &#13;&#10; when writing the file?

+1  A: 

You cannot easily. Those aren't 'escapes', they are 'character entities'. They are a fundamental part of XML. Xerces has some very complex support for 'unparsed entities', but I doubt that it applies to these, as opposed to the species that are defined in a DTD.

bmargulies
ok, so maybe you know how to escape the newline into '' when writing a new xml? I am using the org.dom4j.Document.asXml().
woezelmann
Only by post-processing the Xml. I'm very rusty on Dom4j.
bmargulies
+1  A: 

It depends on what you're getting and what you want (see my previous comment.)

The SAX reader is doing nothing wrong - your XML is giving you a literal newline character. If you control this XML, then instead of the newline characters, you will need to insert a \ (backslash) character following by the "r" or "n" characters (or both.)

If you do not control this XML, then you will need to do a literal conversion of the newline character to "\r\n" after you've gotten your string back. In C# it would be something like:

myString = myString.Replace("\r\n", "\\r\\n");
Andy Shellam
My problem is, that I am reading a xml-file containing '', doing some convertion and than writing a new xml-file. And in this new xml-file I would like to have '' again. I don't want something like "\r\n" or "\\r\\n"
woezelmann
So why are you worried about escaping them then? I believe with Xerces (certainly in the C++ version) if you encode the actual literal newline character, it will come out as you had previously. If you escape them before you re-encode it, then you'll get the characters "\r\n" in your XML instead of Incidentally a double back-slash in C# does come out as a single backslash in a string - it's a way of telling the compiler not to treat it as an escape sequence.
Andy Shellam
+1  A: 

XML entities are abstracted away in DOM. Content is exposed with String without the need to bother about the encoding -- which in most of the case is what you want.

But SAX has some support for how entities are processed. You could try to create a XMLReader with a custom EntityResolver#resolveEntity, and pass it as parameter to the SAXReader. But I feat it may not work:

The Parser will call this method before opening any external entity except the top-level document entity (including the external DTD subset, external entities referenced within the DTD, and external entities referenced within the document element)

Otherwise you could try to configure a LexicalHandler for SAX in a way to be notified when an entity is encountered. Javadoc for LexicalHandler#startEntity says:

Report the beginning of some internal and external XML entities.

You will not be able to change the resolving, but that may still help.

EDIT

You must read and write XML with the SAXReader and XMLWriter provided by dom4j. See reading a XML file and writing an XML file. Don't use asXml() and dump the file yourself.

FileOutputStream fos = new FileOutputStream("simple.xml");
OutputFormat format = OutputFormat.createPrettyPrint();
XMLWriter writer = new XMLWriter(fos, format);
writer.write(doc);
writer.flush();
ewernli
OkI edited my question, maybe you know how to do this, it would also solve my problem.
woezelmann