views:

594

answers:

3

My database (10gR2) is single-byte (NLS_CHARACTERSET = WE8DEC).

I have a Unicode XML file that I would like to parse. If I read the file into a CLOB and try to convert it to an XMLType, Oracle chokes when the XML contains special characters (in this case Norwegian characters such as "øæå").

ORA-31011: XML parsing failed
ORA-19202: Error occurred in XML processing
LPX-00216: invalid character 184 (0xB8)

If I read the file into a NCLOB, then explicitly convert this to a CLOB using TO_CLOB, the XMLType constructor succeeds. However, this conversion produces "ugly" results. For example,

bølle gjær

becomes

bÿlle gjÿr

Is there any way I can perform the conversion from NCLOB with Unicode to single-byte CLOB and still keep the special characters intact? (I am especially interested in proper conversion of just the three Norwegian characters "øæå", other special symbols and characters are not that important in this case.)

+1  A: 

It may be possible to re-encode those characters which do not fit into one byte using character references. This can be done by looking up the unicode value placing it into a reference. For instance, A would look like A

Adam Hawkes
+1  A: 

TO_CLOB is supposed to convert from national character set to database character set correctly. You won't have any problem if all characters can be mapped.

I then suspect that your problem occurs in the read the file into NCLOB part. Unicode is a rather vague information :

  • XML files are very often encoded in the UTF-8 character set (with or without Byte Order Mark).
  • National character set is set to UTF-16 (AL16UTF16) on Oracle by default.

A specific conversion is needed to go from one to another. You should first make sure that the NCLOB containing your XML file has the correct information.

Mac
A: 

I don't know the exact answer to your question but this technique can be useful for you to start with.

Here is a query I use for converting from a character set to another.

SELECT CONVERT(NAME, 'WE8ISO8859P1', 'WE8DEC')  
  FROM table

Try with :

NE8ISO8859P10 ISO 8859-10 North European

NEE8ISO8859P4 ISO 8859-4 North and North-East European

This page list Oracle 8i NLS settings

Jean-Philippe Martin