views:

104

answers:

1

How can I get in Jena (Java language) result in UTF-8 format? My code:

Query query= QueryFactory.create(queryString);
QueryExecution qexec= QueryExecutionFactory.sparqlService("http://lod.openlinksw.com/sparql", queryString);
ResultSet results = qexec.execSelect();
List<QuerySolution> list = ResultSetFormatter.toList(results);  
System.out.println(list.get(i).get("churchname"));
+2  A: 

I assume this is related to http://stackoverflow.com/questions/2671284/utf-8-formatting-in-sparql/2673973#2673973?

Having looked at it here's what's happened:

  • Importer took input 'Chodovská tvrz' encoded in utf-8.
  • In utf-8 that's: '43 68 6f 64 6f 76 73 6b c3 a1 20 74 76 72 7a' (c3 a1 is 'á' in utf-8)
  • Importer read those bytes instead as unicode characters.
  • So instead of 'á' you get the two characters c3 a1, which are 'Ã' and '¡'.

You can reverse that by turning the characters of the string to a byte array, then making a new string from it. I'm sure there must be a simpler way, but here's an example:

public class Convert
{
    public static void main(String... args) throws Exception {
        String in = "Chodovsk\u00C3\u00A1 tvrz";
        char[] chars = in.toCharArray();
        // make a new string by treating chars as bytes
        String out = new String(fix(chars), "utf-8");
        System.err.println("Got: " + out); // Chodovská tvrz
    }

    public static byte[] fix(char[] a) {
        byte[] b = new byte[a.length];
        for (int i = 0; i < a.length; i++) b[i] = (byte) a[i];
        return b;
    }
}

Using this on list.get(i).get("churchname").toString() (which is what you are printing) will fix those names.

Edit:

Or just use:

String churchname = list.get(i).get("churchname").toString();
String out2 = new String(churchname.getBytes("iso-8859-1"), "utf-8");

Which is much simpler.

I think you mean "Importer read those bytes instead as iso-8859-* characters."
Christoffer Hammarström
Does iso-8859-* correspond to the lower unicode codepoints? Ah, it does! That simplifies things.