views:

134

answers:

2

Hi everyone

I found an interesting bug and wanted to know you think. Brief background: I've written a custom DTD and an example XML file (both UTF-8). I have now implemented a SAX parser in Java which I want to test. I got a SAXException complaining "An invalid XML character (Unicode: 0x7e) was found in the public identifier". Now, the URL of my DTD does contain a tilde character (unicode 0x7e). If I move the DTD file to another URL which does not contain a tilde, then my example XML file parses without causing a SAXException.

So I have a work-around for this problem, but I am interested to know: why does this happen? Is this a bug? If so, is it with UTF-8, Java (1.6.0_18 x86), Windows (Server 2008 R2 x86_64) or what? Or is this one of those little obscure nuances of the XML 1.0 specification?

+1  A: 

It's an obscure nuance of the XML 1.0 specification. I like the phrase!

I believe "production 13" in Extensible Markup Language (XML) 1.0 (Fifth Edition)

[13] PubidChar ::= #x20 | #xD | #xA | [a-zA-Z0-9] | [-'()+,./:=?;!*#@$_%]

defines the character set allowed here.

Now that I've seen T.J. Crowder's comment, I'm unsure if this answer is correct. The section he cited does not seem to reference this rule.

This spec is indeed difficult to untangle.

Don Roby
+3  A: 

You wouldn't normally put a URI (containing ~ or not) in the public identifier. The system identifier is the one that's commonly a URI.

I suspect you're saying:

<!DOCTYPE PUBLIC "http://www.example.com/~foo/x.dtd"&gt;

when you mean:

<!DOCTYPE SYSTEM "http://www.example.com/~foo/x.dtd"&gt;
bobince
Ah, thank you very much!
phantom-99w