This is what I'm doing:
public static String htmlToText(String inString)
{
String noentity=StringEscapeUtils.unescapeHtml(inString);
return noentity;
}
This is where I'm invoking it:
String html = "<html><body>string 1<br />—<p>string 2</p></body></html>";
String nohtml = Utility.htmlToText(html);
Log.i("NON HTML STRING:",nohtml);
And this is the output in the log:
10-13 12:38:12.121: INFO/NON HTML STRING:(300): <html><body>string 1<br />â<p>string 2</p></body></html>
According to the reference at http://www.w3.org/TR/html4/sgml/entities.html —
should be replaced by a "—" (which is the output I expect) and not a "â" (which is not what I want).
At first I was using JSoup and the same thing was happening. Thinking it to be a bug, I switched to org.apache.commons.lang and the same thing is happening.
Anyone else know what's going on here? Am I missing something obvious?