views:

128

answers:

1

Hello, everyone!

I am referring to the XML 1.1 spec.

Look at the definition of NameStartChar:

NameStartChar ::= ":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]

If I interpret this correctly, the last range (#x10000-#xEFFFF) goes beyond the UTF16 range of Java's char type. So it must be UTF32, right? So, I need to check pairs of char against this range, instead of single chars, right?

My questions are:

  • How do I check for such character ranges using standard Java methods?
  • How is it possible to define such ranges in JavaCC?
    • JavaCC complains about \u10000 and \uEFFFF

Thank you!

NOTE: Don't worry, I am not trying to write an own XML-parser.
EDIT: I am writing a parser, which would check if text input from miscellaneous (non-XML) text formats would match valid XML names.

+2  A: 

Have a look at Character.toCodePoint(char, char) which will convert a surrogate pair into a full range code point. String.codePointAt may well be useful to you, too.

There's a lot of other surrogate support within Character and String. To know exactly which methods to call, we'd need to know the exact details of your situation.

Jon Skeet
Thank you. OK, I clarified my intentions at bottom of my question (see **EDIT**).
java.is.for.desktop