views:

84

answers:

2

Is there any utility (or sample source code) that truncates HTML (for preview) in Java? I want to do the truncation on the server and not on the client.

I'm using HTMLUnit to parse HTML.

UPDATE:
I want to be able to preview the HTML, so the truncator would maintain the HTML structure while stripping out the elements after the desired output length.

+1  A: 

I think you're going to need to write your own XML parser to accomplish this. Pull out the body node, add nodes until binary length < some fixed size, and then rebuild the document. If HTMLUnit doesn't create semantic XHTML, I'd recommend tagsoup.

If you need an XML parser/handler, I'd recommend XOM.

Stefan Kendall
I guess that is what I have to do. Wanted to see if there was something else out there already ...
sammichy
I've never heard of anyone needing to do this before, so I guess that's why there's no (easy-to-find, at least) solution out there.
Stefan Kendall
@Stefan - Thank you.
sammichy
Also, with XOM at least, you can check the length of your graph pretty easily.root.toXML().getBytes().length() will return the number of bytes for the string representation of the current XML tree. If you build your tree incrementally, you can check the bytes at each step and revert back once bytes > desired bytes.
Stefan Kendall
A: 

I can offer you a Python script I wrote to do this: http://www.ellipsix.net/ext-tmp/summarize.txt. Unfortunately I don't have a Java version, but feel free to translate it yourself and modify it to suit your needs if you want. It's not very complicated, just something I hacked together for my website, but I've been using it for a little more than a year and it generally seems to work pretty well.

If you want something robust, an XML (or SGML) parser is almost certainly a better idea than what I did.

David Zaslavsky
@David - Thank you, I'll check it out.
sammichy