views:

137

answers:

2

I want to parse a document that is not pure xml. For example

my name is <j> <b> mike</b>  </j>

example 2

 my name is  <mytag1 attribute="val" >mike</mytag1> and yours is <mytag2> john</mytag2>

Means my input is not pure xml. ITs simliar to html but the tags are not html. How can i parse it in java?

+4  A: 

Your examples are valid XML, except for the lack of a document element. If you know this to always be the case, then you could just wrap a set of dummy tags around the whole thing and use a standard parser (SAX, DOM...)

On the other hand if you get something uglier (e.g. tags don't match up, or are spaced out in an overlapping fashion), you'll have to do something custom which will involve a number of rules that you have to decide on that will be unique to your application. (e.g. How do I handle an opening tag that has no close? What do I do if the closing tag is outside the parent?)

Greg Harman
i dont know in advance the way and the order in which the tags will be nested. For example i can have data <b>dat2 </b>data3 dada4 <b><t><i> datan </i> </t> <b> . If the tags are not properly arranged means innermost tag is not close before the closing of outer text i need the parser to throw an exception
Provided that there is a document element, a generic XML parser will handle unknown tags and will check that tags are balanced and correctly nested.
Stephen C
+ for excellent answer
KLE
A: 

There are few parsers that take not well formed html and turn it into well formed xml, here is some comparison with examples, that includes the most popular ones, except maybe HTMLParser. Probably that's what you need.

serg