ansaurus

Question

Java XML Parsing

Answer 1

+1 A:

Id use Test Driven Development here

I would write tests like

assertEquals("test", parse("<test />"));

assertEquals("test", parse("<test></test>"));

assertEquals("test\n\tfun", parse("<test><fun></fun></test>"));

etc. and make them all pass.

remember to handle corner cases and bad xml.

parsing you can do:

1) with String.indexOf find first(<) and last(>) xml element in a loop and handle what is inside. using recursion will get you a better grade.

2) you could also use String.split. you could split on first and last xml element or use regex to get what is inside of the xml elements.

2a is the easiest to do, 2b might be the fastest to write if you know regex, but 1 is best because its faster(you might not see it, but it sounds faster) and many teachers only care about performance.

01 2009-09-20 21:38:27

Answer 2

+2 A:

I'm not sure what level is bothering you.

First of all you will be iterating over all the XML characters.

Personally I'd use StringTokenizer. I know it's a little outdated, but you can easily have it parse on the two angle-bracket characters you are interested in, and you can set it up to return "Tokens".

Each token you get will therefore be left-angle, right-angle or some string without angles.

In XML, angle's don't nest, that actually saves you quite a bit of work.

so let's say your XML looked like this (Outside the code blocks I'm converting angle-brackets to parens because I don't feel like typing all the ampersand crap, anyone uncomfortable with this is encouraged to edit my post and fix it):

<tag var="meh">between</tag>

Your first token is "<", then "tag var="meh", then ">", then "between", then "<", then "/tag", then ">"

So far so good.

You need to know if you are inside angle-brackets or outside to know what to do with your strings so track that as a boolean. Since there is no nesting, a single boolean should suffice.

You also need to know about nesting levels of your tags. That should automatically make your brain flash the word "Stack". Stacks are how you implement nesting, period.

So your process is, read a token. If it's "<" set the inside boolean to true and continue, if it's ">" set the boolean to false and continue. Got those two out of the way! You're half done!

Next if you are inside, I would instantiate an object with the following properties:

String tagName
HashMap<String, String> attributes 
// Why the heck do angle-brackets correctly display here SO?!?
String value

and push it onto your stack. (Note, LinkedList has a Stack API)

So you would store the name in "tagName" and the variables in the HashMap and continue. If it were me, this little class would have a constructor that took a string, and the string it took would be the string between the angle-brackets. It would parse the string itself and store the tagName and attributes before returning.

If you did that, then your loop, you just say if you are "inside=true" and the value is not an angle-bracket (you know it's not, those cases were already removed!), then simply:

push(new XMLObject(token));

(whoops, if the "token" at this point starts with a "/", you have to do something different, keep on reading).

Very concise and readable, no?

Finally, the one case remaining is when you are outside and your string is a token. In this case, you know the most recent stack entry was an XMLObject which was filled out except for the "value". The Value should be the token you have right now.

You're done except for printing.

Oh, I forgot, when you hit your closing tag, you need to pop the value off the stack and print it (push for open tag, pop for closing tag)

You can get your indent level by checking the size of the stack (stack.size() should determine how many tabs to print).

If you actually want to divide the responsibilities, I'd have XMLObject implement a toString that formatted the data the way you wanted it printed, then when you reach the end tag, all you have to do is:

Pop the XMLObject.
Print out stack.size() tabs (use print not println!)
Print XMLObject.toString();

That's actually your entire project (aside from the legwork of actually coding it).

Learn to go through an analysis process of examining your data and developing a simple program like this should quickly become second nature.

Good luck with your budding career!

This got a little complicated so I made this list of possibilities. Iterate over it for each token, and handle each case. I'm pretty sure this covers every possibility, and if you put in the error checking where I've specified, it'll actually be production quality (I think).

Note: you can't actually do this as a case. Consider if/if else statements.

case token = "<" //note, if already "inside", this is an error
  if(inside) throw exception // Indicates < tagname <
  inside=true; 
case token = ">" 
  if(!inside) throw exception // Indicates > text or not >
  inside=false;
case inside = false
  stack.peek().setValue(token)
case inside (inside = true)      
  sub-case token startsWith "!-- & endsWith "--"
    comment, ignore
  sub-case token startsWith "/" and endsWith "/" and is more than 1 char long
    create the object, then act as though you found a close tag
  sub-case token = "" 
    (throw exception/print error/bail! // inside must have a tag!)
  sub-case token.startsWtih("/")         
    create & push
  sub-case token doesn't .startWith "/"
    sub-sub-case length=1 (just the slash, lazy close tag)
      pop & print
    sub-sub-case length > 1
      print an error unless (stack.peek().getTagName().equals(tagWithoutSlash))
      pop & print

That's about it. I think each of these cases were 1 line, maybe 2 or 3 if you don't nest things like stack.push(new XMLObject(token)).

I wouldn't have provided this much, but it got more complicated than I thought since I'm including all the error checking and stuff.

I've tried to avoid actual java wherever possible, so use your own style! The XMLObject class as described is a really good idea though. Get in the habit of making new classes wherever possible (I've literally never seen code with too many classes).

This is way too much help, but it's so much fun I may write an XML parser just for the hell of it.

Bill K 2009-09-20 22:19:25

fyi: angle-brackets are only painful to use in Markdown outside of code literals. Inside code literals they are straightforward.

Jason S 2009-09-20 23:02:25

p.s. I'd change your HashMap's name from "variables" to "attributes" since attributes are what they're called.

Jason S 2009-09-20 23:05:48

Thank you Jason, fixed! Note: In case it isn't obvious, you can't use = or ==, you have to use .equals for strings. Also, you probably want to call .trim on a lot of stuff (your tags for instance). to remove leading and trailing spaces.

Bill K 2009-09-21 00:55:47

Answer 3

+1 A:

Slightly tricky cases to be aware of:

comments e.g. 
empty tags (/> ending the tag e.g. <element foo='1' bar='2' baz='3' />)

Jason S 2009-09-20 23:04:52

Ug +1 good point, I forgot both of those! Guess if he likes my answer he'll have to figure that part out for himself--not too hard to add though.

Bill K 2009-09-21 01:02:25

Answer 4

+1 A:

Just go through the XML file character by character and do what you have to do.

Set up some state variable (with values like waiting for opening tag, reading element name, waiting for attribute name, reading attribute value, reading element content -- waiting for closing tag etc.).
Set up depth counter (because you need that for pretty printing).
Use state variable and current character to decide how to change the state or add the new character to what you have read so far.
Assume that your XML is well-formed.

Your actual task is much easier than 'XML parser' sounds.

zilupe 2009-09-20 23:14:09

not that much easier than an XML parser

Jason S 2009-09-21 13:23:25

ansaurus

tags:

views:

answers:

Java XML Parsing

related questions