views:

289

answers:

2

I'm trying to read xml file from vbs script. Xml is encoded in utf-8 and has appropriate header

From vbs script I use microsoft xmldom parser to read xml:

Dim objXMLDoc
Set objXMLDoc = CreateObject( "Microsoft.XMLDOM" )
objXMLDoc.load("vbs_strings.xml")

Inside xml I'm trying to write character by code using &#nnn; notation. Then I read this character from vbscript and try to get it's code using Asc() function. For some characters it works fine and read code is equal to one written. But for some characters Asc() always returns code 63. What could it be?

Examples:

If xml contains <section>&#195;<section> and in script I have Section variable for representing this xml node then code:

Asc(Section.Text)

will return value 195 and it's ok.

If xml contains <section>&#110;<section> then code:

Asc(Section.Text)

will return value 110 and it's ok.

But if xml contains <section>&#130;<section> or <section>&#156;<section> or <section>&#140;<section>

Asc(Section.Text)

will return value 63 and it's definitely not good.

Do you know why?

+1  A: 

The code points decimal 130, 156, and 140 do not correspond to any character in the Unicode character set (123-192 are not defined). The default character mapper that Asc is using will map such errors to ? which is character 63. What characters do think these code points map to?

I suspect that the codes you want are: &#8218; &#339; and &#338;

AnthonyWJones
You're probably right, so up-voting. I updated my answer.
bkail
Thank you! I've understood the reason why it happens.I need these codes because I try to store Russian and Japanese utf-8 characters. F.ex. Russian letter 'М' is represented by Мp.s. I know that I could store these letters directly in utf-8 xml but I need them as bytes.
vkjr
@vkjr: Glad you've got it sorted, however I'm not really sure I understand the difference between "letters" and "bytes" ;)
AnthonyWJones
If in utf-8 xml I write word "Мар" in Russian then it is represented by 6 bytes. When I read this word in xml from vbscript using microsoft xmldom parser I get 3-characters string. Whereas in fact I need bytes representation and I want to get 6 ansi-characters string. That's why I store this string as a set of independent characters for each byte.I'm not sure whether my explanation is understandable :)sorry if not.
vkjr
@vkjr: Sounds like there is something in your system doing things wrong else you wouldn't need this. My guess is that ASP is involved here and the Response.Codepage/Response.CharSet are not being set correctly. It smells very much of sending UTF-8 to something which thinks its getting ANSI. What you really need is to let the other end know "you are getting UTF-8", then you can stop all this awkardness.
AnthonyWJones
+1  A: 

Use AscW instead:

http://msdn.microsoft.com/en-us/library/zew1e4wc%28VS.80%29.aspx

EDIT: That said, AnthonyWJones is likely correct that your document is either using the character references or has misdeclared the input encoding.

bkail
This will return the correct numerical value, the encoding itself in the XML is still incorrect. Using AscW works because there is no need to attempt to map the codepoint from one codepage to another, it merely displays the value found.
AnthonyWJones
That's the point of character references: to reference characters that otherwise cannot be represented in the input document (either carriage returns, single/double quotes in the wrong context, or because the input encoding cannot represent the character).
bkail
Thanks, pointing to AswW() function helps me a lot!
vkjr