tags:

views:

434

answers:

3
A: 

While € is valid XHTML entity it is not valid XML one.

Unfortunately, I don't know anything about JDOM, but if it is possible you may try adding DTD entity declarations like <!ENTITY euro "€">. And, maybe, put all XHTML tags into their proper namespace (<parentNode xmlns:x="http://www.w3.org/1999/xhtml"&gt;&lt;x:b&gt;...&lt;/x:b&gt;&lt;/parentNode&gt;)

drdaeman
That solution was considered, however we would have to do this for all possibly HTML (XHTML?) entites - http://www.cookwood.com/html/extras/entities.html
Taras
+2  A: 

I guess you can use JTidy to transform named entities to numbered ones. After that, the XHTML is also valid XML.

Tomalak
This is what I ended up doing: * Parse input XHTML fragment as a HTML into a DOM using JTidy * Extract all child nodes of body using xpath (/html/body/node()) * Insert extracted nodes into target XML DOMThe only caveat was that ' is a valid XHTML entity, yet not a valid HTML one. This meant that the first step wouldn't treat the sequence: ' as an apostrophe, but rather as 6 individual characters. I fixed this by replacing all instances of ' with the numeric reference (bit of a hack, but it works)
Taras
I am sure there is a way to tell JTidy to replace all named entity references to numbered ones. On the command line this is "-n". There is also a switch to make it produce valid XML. I would think that the Java library can do the same thing.
Tomalak
Sorry, the spacing got a bit messed up above. I did find the -n property in JTidy, however, I couldn't find an option for it to parse XHTML instead of HTML - it parses the input as HTML, which means that it doesn't recognise the ' entity. I actually had a look at the source to see if I could add an entity, but no luck. In fact I found the source code responsible for defining the entities (EntityTable), and discovered that ' was not defined (the other 252 HTML entities were
Taras
A: 

Create a string containing

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
                      "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"&gt;
<html>

+

your XHTML content, in this case <b>M&amp;A</b> &euro;

+

</html>

and then parse this string to obtain a document. Then get all the content inside the root element, that will be your XHTML content and place it inside your parentNode element. You may need to take into account that the content comes from a different document.

George Bina
I tried this approach and ran into the problem that when you try to parse the string into the document, because is not a XML entity, the string essentially contains an unescaped ampersand, which is invalid XML.
Taras