tags:

views:

574

answers:

10

I have a specific format XML document that I will get pushed. This document will always be the same type so it's very strict.

I need to parse this so that I can convert it into JSON (well, a slightly bastardized version so someone else can use it with DOJO).

My question is, shall I use a very fast lightweight (no need for SAX, etc.) XML parser (any ideas?) or write my own, basically converting into a StringBuffer and spinning through the array? Basically, under the covers I assume all HTML parsers will spin thru the string (or memory buffer) and parse, producing output on the way through.

Thanks

//edit

Thanks for the responses so far :)

The xml will be between 3/4 lines to about 50 max (at the extreme)..

+1  A: 

you can use Dom4j/xstream to read the xml into an equivalent java modal and then use JSONLIB to convert into JSON.

Teja Kantamneni
+ Dom4j has SAX-like API but easier.
Ondra Žižka
+5  A: 

No, you should not try to write your own XML parser for this.

SAX itself is very lightweight and fast, so I'm not sure why think it's too much. Also using a string buffer would actually be much less scalable then using SAX because SAX doesn't require you to load the whole XML file into memory to use it. I've used SAX to parse through multigigabyte XML files, which you wouldn't be able to do using string buffers on a 32 bit machine.

If you have small files and you don't need to worry about performance, look into using the DOM. Java's implementation can be kind of annoying to use (You create a document by using a DocumentBuilder, which comes from a DocumentBuilderFactory)

The code to create a document from a file looks like this:

Document d = DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(new FileInputStream("file.xml"));

(note that keeping a reference to your document builder will speed things up if you need to parse multiple files)

Then you use the function in org.w3c.dom.Document to read or manipulate the contents. For example getElementsByTagName() returns all the Elements with a certain tag name.

Chad Okere
I suspect that by "lightweight", Joe means "is easy to use"; SAX' callback-oriented API is not the most user-friendly.
Michael Borgwardt
I would have +'ed this up more if I could. SAX is about the most efficient way possible to read XML in Java. You'd be hard pressed to write a better correct XML parser. It should be possible to write the callback to produce the JSON directly, I would think. If there is little translation then it may be extremely tiny.
PSpeed
@Michael Borgwardt: I think using the DOM would be easier then writing your own parser :)
Chad Okere
But DOM is _definitely_ not light-weight. For this sort of translation from one format to another, SAX is ideal. Do it right and you could handle files that would never fit in memory. (You wouldn't need it in this case, but that's not the point.:))
PSpeed
@PSpeed: IMHO SAX is not ideal, because event driven approach of SAX is harder to understand and use than pull parsing approach (of kXML parser or similar).
WildWezyr
Yes, JSON does have a toXML and you can make JSON.XMLtoJSON, but i need to add extra bits, and change a few bits around to satisfy the dojo requirements.As the quick bursts will be very strict in format, and typically be 3/4 lines line (50 at the most a (a recurring set of 3/4 line elements) holding in memory will not be too much of an issue..Thanks again for the comments so far..
joe90
PSpeed
push + dispatch is nice (for example) when you are ignoring large portions of the input.
PSpeed
How about vtd-xml's light version,it has all the benenfit of vtd (perforamnce, memory etc), and very light weight (64k jar)
vtd-xml-author
A: 

Use a real XML parser. If you don't, you will probably get bitten when something changes. The document may be "very strict", but in two years time, something will probably get re-factored and it will change structure so that it parses to the same data structure with an XML parser and breaks a homebrew string parser.

David Dorward
I see you point, but already in different areas (i.e the next step in the chain) they have changes bits from pure json to satisfy there requirements.
joe90
So the not-really-JSON parser is set up to take a fall, but there is no need to compound the issue by introducing the same problem by using a not-really-XML parser.
David Dorward
A: 

Try Piccolo for a fast XML parser.

Try XSLT to convert an XML document from one format to another.

Dave Jarvis
+7  A: 

It really depends on the type of XML you're parsing. I wouldn't write your own parser when there's something already there to do the job for you.

The choice of SAX/DOM is really basde on what you're trying to parse, see this for how to decide on which one to use:

http://geekexplains.blogspot.com/2009/04/sax-vs-dom-differences-between-dom-and.html

Even if you don't use SAX/DOM there are still simple options available to you, take a look at Simple : )

http://simple.sourceforge.net/

You may also want to consider STaX.

Jon
Thanks, I will have a look at simple
joe90
+1  A: 

Maybe you should look at kXML 2, a small XML pull parser specially designed for constrained environments, to access, parse, and display XML files for Java 2 Micro Edition-enabled devices. It works well with Java SE/EE too ;-). As it is designed for micro edition, it is really light-weight (small footprint) and IMHO really easy to use (much more easier than SAX/DOM etc. stuff).

From my own experience with kXML 2: I used it to parse XML files larger than 1 GB - Wikipedia dumps and I was very happy with performance / memory consumption etc.

At last ;-) - link: http://kxml.sourceforge.net/kxml2/

WildWezyr
Thanks,. will have a look at that :) as we will need a mobile version at some point too
joe90
A: 

Use Xstream

Adriaan Koster
A: 

parsing on the backend and exposing JSON is probably the right way to go so that you would have general purpose JSON data that you can easily integrate with other sources, but if you have a simple message and this is the only place you think you'd be using JSON, you could try to do the parsing client side. Dojo has an experimental client-side XML parser

peller
A: 

Do you have to use XML?

I found that my own custom text format was much faster than either XML or JSON with any of the off the shelf packages - they were fast, but by controlling my own format and just doing String parsing I was able to cut the time in half against the fastest XML implementation.

Obviously this only works if you're fully in charge of formats and may not be appropriate to your situation, but for any others in this situation: don't think XML is the absolute fastest option you have. It's not.

Brian
A: 

Do you really need to parse/manipulate any of the data in the XML document? If not, you could just create use an XSLT. Really simple, really fast.

Bal