views:

147

answers:

2

Hey, I'm trying to use VTD-XML to parse XML given to it as a String, but I can't find how to do it. Any help would be appreciated.

http://vtd-xml.sourceforge.net

+1  A: 

It seems VTD-XML library lets you read byte array data. I'd suggest in that case, convert the String to bytes using the correct encoding.

If there's an encoding signaled in the begining of the XML string:

<?xml version="1.0" encoding="UTF-8"?>

Then use that:

myString.getBytes("UTF-8")

If there's not an encoding, please use one, for VTD-XML know how to decode the bytes:

String withHeader  = "<?xml version=\"1.0\" encoding=\"UTF-8\"?>" + myString;
byte[] bytes = withHeader.getBytes("UTF-8");
VTDGen vg = new VTDGen();
vg.setDoc(bytes);

Note that in the later case you can use any valid encoding because the string you have in memory is encoding-agnosting (it's in UTF-16 but when you ask for the bytes it will be converted).

helios
What method do I then use to load it? setDoc?
Concept
yes, use setDoc works after the conversion
vtd-xml-author
Got it working. Thanks! Yeah it's a Java String object, it's a really fast parser, and I wasn't happy with the block of if statements that SAX requires. The whole token layout is really handy.
Concept
I'll add the setDoc method for documenting purposes.
helios
+1  A: 

VTD-XML doesn't accept a string because string implies UCS-16 encoding, which means it is not really a xml document.. as defined by the spec, xml is usually encoded in utf-8, ascii, iso-8859-1 or UTF-16LE or BE format... does my answer make sense?

vtd-xml-author
not really... you define the encoding of the xml file in the <?...?> header. And a string is in-memory encoded in UCS-16 but you can transform it to match the encoding required.
helios
if by string you mean java's String object, then I stand by my answer... if by string you mean an array of bytes, then you are right about using <? ?> to decide encoding... I feel the question is really about asking about Java's string object, but I could be wrong
vtd-xml-author
Does your answer make sense? No. It's possible that the string may contain a prolog which declares an encoding, as helios's answer suggested. So to convert the string to bytes which are suitable for the parser to use, you would have to extract that encoding first, as helios said. But normally it's the parser's job to determine the encoding. All of the parsers I regularly use can accept a Reader as input, which means the parser can ignore the encoding issues because it already gets chars. So if VTD-XML doesn't have a way of parsing from a Reader then it isn't "advanced and powerful".
Paul Clapham
@Paul: thanks for the comment. I think we should agree on what a string is first. The prolog is to tell the parser what the encoding format is so the byte to char conversion could happen properly. An XML document is a array of bytes, a Reader is just one way to look at it, but not the only one, right? so use Reader to judge teh merit of a parser sounds like a weak argument...
vtd-xml-author
I don't think there's any debate about what a string is. And I agree with your unstated argument that it's kind of peculiar to declare the encoding of something which isn't encoded, but it does happen and I don't think it's unusual. But I don't think it should be especially hard for an XML parser to deal with a Reader, and I do think that a parser which makes grandiose claims for itself should be able to do that little thing.
Paul Clapham
if we agree on what string is, then the string representation of XML is not longer a well-formed XML, that is the point I am trying to make... grandiose or not, vtd-xml parses an XML document and has its own traits and characteristics ... hope you understand
vtd-xml-author
For adding to the very abstract discussion: an XML can be the "abstract XML" (without encoding) or its representation encoded in bytes (and including a <?header?>). So a String containing <elements>...</elements> is for me a valid enough XML (because it's the abstract ideal). Talking about how the parser can't parse Strings I thing that for optimization it uses the original representation and offsets. Two excludin options arise: using the byte[], using the String. The more basic one is byte[] (because of the files) so the Strings must be first converted (they could provide a converter anyway).
helios
but I agree that it could receive a String, and 1) call getBytes() 2) "pretend" that <? xml encoding="UCS-16"?> was read. It's only a method of convenience given that String.getBytes always creates a byte[] you could create yourself.
helios
Oh, I didn't realize I was arguing with the author of VTD! (wow).
helios
@helios: thanks for the suggestion. Our view of wellformedness XML is the byte representation of XML, as defined by xml spec, by converting a string into a byte array, it removes any ambiguity of it.. as to your comment on pretending a UCS16, very interesting idea! will have to think about it..
vtd-xml-author