views:

127

answers:

2

Hi, when I have html:

<html>
<head>
</head>
<body>
 text
  <div>
  text2
    <div>
    text3
    </div>
  </div>
</body>
</html>

how can I get with DOM parser in JAVA content of body: text <div> text2 <div> text3 </div> </div> becasuse method getTextContent return:text text2 text3. - so without tags.

It is possible with SAX, but it is possible with DOM, too?

+1  A: 

The getTextContent is behaving as I would expect - getting the textural content of the HTML fragment. Can you check the API docs for the DOM parser and see if there's a similar method with a name like getHtmlContent?

Richard Ev
I agree; you can treat the entire thing as String and using String.indexOf(..) method subString(..) everything in the body tag.
Samuh
+1  A: 

You would need to parse the document into a DOM and serialise only the portion of the DOM you wanted. Using the DOM Level 3 LS interfaces you can serialise the outer-XML of a single node with:

LSSerializer serializer= implementation.createLSSerializer();
String html= serializer.writeToString(node);

To get the inner-XML you would need to writeToString each child node in turn (eg. into a StringBuffer).

Depending on what DOM implementation you are using there may be alternative non-standard methods. There may also be risks with serialising HTML as XML, if that's what you're doing... eg. a standard XML serialiser may output a self-closing tag for an empty tag, which can confuse browsers parsing the output as legacy-HTML.

bobince