views:

605

answers:

4

How can I change HTML content of tag in Java? For example:

before:

<html>
    <head>
    </head>
    <body>
        <div>text<div>**text**</div>text</div>
    </body>
</html>

after:

<html>
    <head>
    </head>
    <body>
        <div>text<div>**new text**</div>text</div>
    </body>
</html>

I tried JTidy, but it doesn't support getTextContent. Is there any other solution?

+2  A: 

Provided that your HTML is a well-formed XML (if it is not then you may use JTidy to tidify it), you can parse it using DOM or SAX parser. DOM is probably easier if your document is not huge.

Something like this will do the trick if your text is the only child of a node with id="id":

Document d = DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(file);
Element e = d.getElemetById("id");
Element text = e.getFirstChild();
text.setNodeValue(process(text.getNodeValue());

You may save d afterwards to a file.

Dmitry
A: 

There are a bunch of Open source Java HTML parsers listed here.

I'm not sure what's most commonly used, but this one (just called HTML parser) will probably do what you want. It has functions to modify your tree and write it back out.

Chad Okere
+2  A: 

Unless you are absolutely sure that the HTML will be valid and well formed, I'd strongly recommend to use an HTML parser, something like TagSoup, Jericho, NekoHTML, HTML Parser, etc, the two first being especially powerful to parse any kind of crap :)

For example, with HTML Parser (because the implementation is very easy), using a visitor, provide your own NodeVisitor:

public class MyNodeVisitor extends NodeVisitor {
    public MyNodeVisitor() {
    }

    public void visitStringNode (Text string)
    {
        if (string.getText().equals("**text**")) {
            string.setText("**new text**");
        }
    }
}

Then, create a Parser, parse the HTML string and visit the returned node list:

Parser parser = new Parser(htmlString);
NodeList nl = parser.parse(null);
nl.visitAllNodesWith(new MyNodeVisitor());
System.out.println(nl.toHtml());

This is just one way to implement this, pretty straight forward.

Pascal Thivent
A: 

Thanks, I want parse no well-formed HTML. I tried TagSoup, but when I have this code:

<body>
sometext <div>text</div>
</body>

and I want change 'sometext' to 'someAnotherText', and when I use {bodyNode}.getTextContent() it give me: sometext text. And when I use setTextContet("someAnotherText"+{bodyNode}.getTextContent()) and serialize these structure the result is someAnotherText sometext text --so without tags...and this is PROBLEM for me

bugisoft
Please don't post this as an answer. Either update your original question or (better solution here IMHO) post a new question. Comments are not appropriate to solve your new problem
Pascal Thivent