tags:

views:

678

answers:

5

Hi,

I've been running into problem after problem trying to use the a third party HTML editor to do what (I hoped) was a simple operation. Because of these problems, I'm looking for recommendations for an alternative HTML parser I could use to perform the operations.

Here's my situation, I have span tags in my html (with an ID attribute to identify them) and I simply want to replace their contents based on an update in another area of my client. For example:

<html>
    <body>
        <p>Hello <span id="1">name</span> you are <span id="2">age</span></p>
    </body>
</html>

I've been trying to use the HTMLDocument class in javax.swing.text like this:

Element e;
e = doc.getElement(document.getDefaultRootElement(), Attribute.ID, "1");
document.setInnerHTML(element, "John");
e = doc.getElement(document.getDefaultRootElement(), Attribute.ID, "2");
document.setInnerHTML(element, "99");

but the element returned is a leaf element and won't allow the innerHTML to be set. Unfortunately, the document, reader & parser are all supplied by a 3rd party & so I can't really modify it.

So, what I was hoping for was that someone else has had a similar problem and could recommend an alternative library to do this?

Thanks in advance, B.

A: 

Have you tried HTML Parser? It is a robust, open source HTML parsing library for Java.

kgiannakakis
A: 

HTMLParser is a great library but is LGPL, which might not be suitable for some commercial projects.

If your html is well-formed then you can go in for Dom4J to traverse through the nodes, and in case if your HTML is not well formed you can use Tidy in conjunction with Dom4J

Ram
A: 

I'm having good luck on my current project with TagSoup.

Steven Huwig
+1  A: 

Can you really not accomplish that with java.swing.text.HTMLDocument?

I have never tried this but reading through the API something along the line of

document.replace(e.getStartOffset(), e.getEndOffset()-e.getStartOffset(), "John", null)

instead of using setInnerHtml() could work.

HerdplattenToni
DaddyB
A: 

I used JTidy very successfully. It takes in HTML, removes out the crap, so you have a proper DOM object and then simply use XPath to alter your targets.

stwissel