ansaurus

Question

Answer 1

+2 A:

Try using the Scanner with a specified character set:

public Scanner(InputStream source, String charsetName)

For the default constructor:

Bytes from the stream are converted into characters using the underlying platform's default charset.

Scanner on java.sun.com

parkerfath 2009-02-11 21:58:08

Answer 2

+1 A:

Try using a Reader instead of an InputStream - I think it works something like this:

connection =  new URL("http://it.wikipedia.org/wiki/"+word).openConnection();
String ctype = connection.getContentType();
int csi = ctype.indexOf("charset=");
Scanner scanner;
if (csi > 0)
    scanner = new Scanner(new InputStreamReader(connection.getInputStream(), ctype.substring(csi + 8)));
else
    scanner = new Scanner(new InputStreamReader(connection.getInputStream()));
scanner.useDelimiter("\\Z");
content = scanner.next();
if(word.equals("pubblico"))
    System.out.println(content);
System.out.println("Doing: "+ word);

You could also just pass the charset to the Scanner constructor directly as indicated in another answer.

David Zaslavsky 2009-02-11 22:02:35

Don't use the content-encoding. It specifies the compression used, and has nothing to do with the character encoding.

erickson 2009-02-11 22:07:33

oops, yeah, my mistake... I'll edit to fix that

David Zaslavsky 2009-02-11 22:47:27

Answer 3

+1 A:

You need to use a URLConnection, so that you you can determine the content-type header in the response. This should tell you the character encoding to use when you create your Scanner.

Specifically, look at the "charset" parameter of the content-type header.

To inhibit gzip compression, set the accept-encoding header to "identity". See the HTTP specification for more information.

erickson 2009-02-11 22:03:41

Answer 4

A:

connection =  new URL("http://it.wikipedia.org/wiki/"+word).openConnection();
      connection.addRequestProperty("Accept-Encoding","");
      System.out.println(connection.getContentEncoding());
      Scanner scanner = new Scanner(new InputStreamReader(connection.getInputStream()));
      scanner.useDelimiter("\\Z");
      content = new String(scanner.next());

encoding doesn't change. why?

2009-02-12 16:14:44

Answer 5

A:

connection =  new URL("http://it.wikipedia.org/wiki/"+word).openConnection();
//connection.addRequestProperty("Accept-Encoding","");
//System.out.println(connection.getContentEncoding());

InputStream resultingInputStream = null;       // Stream su cui fluisce la pagina scaricata
String encoding = connection.getContentEncoding();    // Codifica di invio (identity, gzip, inflate)
// Scelta dell'opportuno decompressore per leggere la sorgente
if (connection.getContentEncoding() != null && encoding.equals("gzip")) {
    resultingInputStream = new GZIPInputStream(connection.getInputStream());
}
else if (encoding != null && encoding.equals("deflate")) {
    resultingInputStream = new InflaterInputStream(connection.getInputStream(), new Inflater(true));
}
else {
    resultingInputStream = connection.getInputStream();
}

// Scanner per estrarre dallo stream la pagina per inserirla in una stringa
Scanner scanner = new Scanner(resultingInputStream);
scanner.useDelimiter("\\Z");
content = new String(scanner.next());

So works!!!

2009-02-12 22:37:04

ansaurus

tags:

views:

answers:

java.util.Scanner and Wikipedia

related questions