ansaurus

Question

Answer 1

+2 A:

Provided that your HTML is a well-formed XML (if it is not then you may use JTidy to tidify it), you can parse it using DOM or SAX parser. DOM is probably easier if your document is not huge.

Something like this will do the trick if your text is the only child of a node with id="id":

Document d = DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(file);
Element e = d.getElemetById("id");
Element text = e.getFirstChild();
text.setNodeValue(process(text.getNodeValue());

You may save d afterwards to a file.

Dmitry 2009-12-19 22:02:13

Answer 2

A:

There are a bunch of Open source Java HTML parsers listed here.

I'm not sure what's most commonly used, but this one (just called HTML parser) will probably do what you want. It has functions to modify your tree and write it back out.

Chad Okere 2009-12-19 22:10:29

Answer 3

+2 A:

Unless you are absolutely sure that the HTML will be valid and well formed, I'd strongly recommend to use an HTML parser, something like TagSoup, Jericho, NekoHTML, HTML Parser, etc, the two first being especially powerful to parse any kind of crap :)

For example, with HTML Parser (because the implementation is very easy), using a visitor, provide your own NodeVisitor:

public class MyNodeVisitor extends NodeVisitor {
    public MyNodeVisitor() {
    }

    public void visitStringNode (Text string)
    {
        if (string.getText().equals("**text**")) {
            string.setText("**new text**");
        }
    }
}

Then, create a Parser, parse the HTML string and visit the returned node list:

Parser parser = new Parser(htmlString);
NodeList nl = parser.parse(null);
nl.visitAllNodesWith(new MyNodeVisitor());
System.out.println(nl.toHtml());

This is just one way to implement this, pretty straight forward.

Pascal Thivent 2009-12-19 22:50:35

Answer 4

A:

Thanks, I want parse no well-formed HTML. I tried TagSoup, but when I have this code:

<body>
sometext <div>text</div>
</body>

and I want change 'sometext' to 'someAnotherText', and when I use {bodyNode}.getTextContent() it give me: sometext text. And when I use setTextContet("someAnotherText"+{bodyNode}.getTextContent()) and serialize these structure the result is someAnotherText sometext text --so without tags...and this is PROBLEM for me

bugisoft 2009-12-20 16:54:13

Please don't post this as an answer. Either update your original question or (better solution here IMHO) post a new question. Comments are not appropriate to solve your new problem

Pascal Thivent 2009-12-20 17:23:34

ansaurus

tags:

views:

answers:

How to change HTML tag content in Java?

related questions