views:

119

answers:

2

I am using a html parser called jsoap, to load and parse html files. The problem is that the webpage I'm scraping is encoded in ISO-8859-1 charset while Android is using UTF-8 encoding(?). This is results in some characters shows up as question marks.

So now I guess I should convert the string to UTF-8 format.

Now I have found this Class called CharsetEncoder in the Android SDK, which I guess could help me. But I can't figure out how to implement it in practice, so I wonder if could get som help with by a practical example. Thanks UPDATE: Code to read data (Jsoup)

url = new URL("http://www.example.com");
Document doc = Jsoup.parse(url, 4000);
+2  A: 

You can let Android do the work for you by reading the page into a byte[] and then using the jSoup methods for parsing String objects.

Don't forget to specify the encoding when you create the string from the data read from the server using the correct String constructor.

Al Sutton
A: 

http://java.sun.com/docs/books/tutorial/i18n/text/string.html

public static void main(String[] args) {

      System.out.println(System.getProperty("file.encoding"));
      String original = new String("A" + "\u00ea" + "\u00f1"
                                 + "\u00fc" + "C");

      System.out.println("original = " + original);
      System.out.println();

      try {
          byte[] utf8Bytes = original.getBytes("UTF8");
          byte[] defaultBytes = original.getBytes();

          String roundTrip = new String(utf8Bytes, "UTF8");
          System.out.println("roundTrip = " + roundTrip);

          System.out.println();
          printBytes(utf8Bytes, "utf8Bytes");
          System.out.println();
          printBytes(defaultBytes, "defaultBytes");
      } catch (UnsupportedEncodingException e) {
          e.printStackTrace();
      }

   } // main
droidgren