views:

7890

answers:

4

Hello all, I would like to convert some HTML characters back to text using Java Standard Library. I was wondering whether any library would achieve my purpose?

/**
 * @param args the command line arguments
 */
public static void main(String[] args) {
    // TODO code application logic here

    // "Happy & Sad" in HTML form.
    String s = "Happy & Sad";
    System.out.println(s);

    try {
        // Change to "Happy & Sad". DOESN'T WORK!
        s = java.net.URLDecoder.decode(s, "UTF-8");
        System.out.println(s);
    } catch (UnsupportedEncodingException ex) {

    }
}
+1  A: 

I'm not aware of any way to do it using the standard library. But I do know and use this class that deals with html entities.

"HTMLEntities is an Open Source Java class that contains a collection of static methods (htmlentities, unhtmlentities, ...) to convert special and extended characters into HTML entitities and vice versa."

http://www.tecnick.com/public/code/cp_dpage.php?aiocp_dp=htmlentities

rogeriopvl
A: 

java.net.URLDecoder deals only with the application/x-www-form-urlencoded MIME format (e.g. "%20" represents space), not with HTML character entities. I don't think there's anything on the Java platform for that. You could write your own utility class to do the conversion, like this one.

Zach Scrivena
+1  A: 

The URL decoder should only be used for decoding strings from the urls generated by html forms which are in the "application/x-www-form-urlencoded" mime type. This does not support html characters.

After a search I found a Translate class within the HTML Parser library.

Rich
+11  A: 

I think the Jakarta Commons Lang library's StringEscapeUtils.escapeHtml() and unescapeHtml() methods are what you are looking for. See http://commons.apache.org/lang/api-release/org/apache/commons/lang/StringEscapeUtils.html.

Bill.D
I prefer to use Jakarta Commons Lang library, as its large user base give me a better confidence on its quality.
Yan Cheng CHEOK