views:

1423

answers:

5

Hi, I'm trying to use java.util.Scanner to take Wikipedia contents and use it for word based searches. The fact is that it's all fine but when reading some words it give me errors. Looking at code and making some check it turned out that with some words it seems not to recognize the encoding, or so, and the content is no more readable. This is the code used to take the page:

// -Start-

try {
  connection =  new URL("http://it.wikipedia.org
wiki/"+word).openConnection();
                    Scanner scanner = new Scanner(connection.getInputStream());
  scanner.useDelimiter("\\Z");
  content = scanner.next();
//    if(word.equals("pubblico"))
//     System.out.println(content);
  System.out.println("Doing: "+ word);
//End

The problem arises with words as "pubblico" for the italian wikipedia. the result of the println on word pubblico is like this (cutted): ï¿ï¿½]Ksr>�~E �1A���E�ER3tHZ�4v��&PZjtc�¿½ï¿½D�7_|����=8��؋}

Do you have any idea why? Yet looked at page source and headers are the same, with same encoding...

Turned Out that content is gzipped, so can i tell wikipedia not to send me teir pages zipped or it's the only way? thank you

+2  A: 

Try using the Scanner with a specified character set:

public Scanner(InputStream source, String charsetName)

For the default constructor:

Bytes from the stream are converted into characters using the underlying platform's default charset.

Scanner on java.sun.com

parkerfath
+1  A: 

Try using a Reader instead of an InputStream - I think it works something like this:

connection =  new URL("http://it.wikipedia.org/wiki/"+word).openConnection();
String ctype = connection.getContentType();
int csi = ctype.indexOf("charset=");
Scanner scanner;
if (csi > 0)
    scanner = new Scanner(new InputStreamReader(connection.getInputStream(), ctype.substring(csi + 8)));
else
    scanner = new Scanner(new InputStreamReader(connection.getInputStream()));
scanner.useDelimiter("\\Z");
content = scanner.next();
if(word.equals("pubblico"))
    System.out.println(content);
System.out.println("Doing: "+ word);

You could also just pass the charset to the Scanner constructor directly as indicated in another answer.

David Zaslavsky
Don't use the content-encoding. It specifies the compression used, and has nothing to do with the character encoding.
erickson
oops, yeah, my mistake... I'll edit to fix that
David Zaslavsky
+1  A: 

You need to use a URLConnection, so that you you can determine the content-type header in the response. This should tell you the character encoding to use when you create your Scanner.

Specifically, look at the "charset" parameter of the content-type header.


To inhibit gzip compression, set the accept-encoding header to "identity". See the HTTP specification for more information.

erickson
A: 
connection =  new URL("http://it.wikipedia.org/wiki/"+word).openConnection();
      connection.addRequestProperty("Accept-Encoding","");
      System.out.println(connection.getContentEncoding());
      Scanner scanner = new Scanner(new InputStreamReader(connection.getInputStream()));
      scanner.useDelimiter("\\Z");
      content = new String(scanner.next());

encoding doesn't change. why?

A: 
connection =  new URL("http://it.wikipedia.org/wiki/"+word).openConnection();
//connection.addRequestProperty("Accept-Encoding","");
//System.out.println(connection.getContentEncoding());

InputStream resultingInputStream = null;       // Stream su cui fluisce la pagina scaricata
String encoding = connection.getContentEncoding();    // Codifica di invio (identity, gzip, inflate)
// Scelta dell'opportuno decompressore per leggere la sorgente
if (connection.getContentEncoding() != null && encoding.equals("gzip")) {
    resultingInputStream = new GZIPInputStream(connection.getInputStream());
}
else if (encoding != null && encoding.equals("deflate")) {
    resultingInputStream = new InflaterInputStream(connection.getInputStream(), new Inflater(true));
}
else {
    resultingInputStream = connection.getInputStream();
}

// Scanner per estrarre dallo stream la pagina per inserirla in una stringa
Scanner scanner = new Scanner(resultingInputStream);
scanner.useDelimiter("\\Z");
content = new String(scanner.next());

So works!!!