Hi, I'm trying to use java.util.Scanner to take Wikipedia contents and use it for word based searches. The fact is that it's all fine but when reading some words it give me errors. Looking at code and making some check it turned out that with some words it seems not to recognize the encoding, or so, and the content is no more readable. This is the code used to take the page:
// -Start-
try {
connection = new URL("http://it.wikipedia.org
wiki/"+word).openConnection();
Scanner scanner = new Scanner(connection.getInputStream());
scanner.useDelimiter("\\Z");
content = scanner.next();
// if(word.equals("pubblico"))
// System.out.println(content);
System.out.println("Doing: "+ word);
//End
The problem arises with words as "pubblico" for the italian wikipedia. the result of the println on word pubblico is like this (cutted): ï¿ï¿½]Ksr>�~E �1A���E�ER3tHZ�4v��&PZjtc�¿½ï¿½D�7_|����=8��Ø}
Do you have any idea why? Yet looked at page source and headers are the same, with same encoding...
Turned Out that content is gzipped, so can i tell wikipedia not to send me teir pages zipped or it's the only way? thank you