views:

613

answers:

4

I'm trying to get data from website which is encoded in UTF-8 and insert them into the database (MYSQL). Database is also encoded in UTF-8.

This is the method I use to download data from specific site.

public String download(String url) throws java.io.IOException {
        java.io.InputStream s = null;
        java.io.InputStreamReader r = null;
        StringBuilder content = new StringBuilder();
        try {
            s = (java.io.InputStream)new URL(url).getContent();

            r = new java.io.InputStreamReader(s, "UTF-8");

            char[] buffer = new char[4*1024];
            int n = 0;
            while (n >= 0) {
                n = r.read(buffer, 0, buffer.length);
                if (n > 0) {
                    content.append(buffer, 0, n);
                }
            }
        }
        finally {
            if (r != null) r.close();
            if (s != null) s.close(); 
        }
        return content.toString();
    }

If encoding is set to 'UTF-8' (r = new java.io.InputStreamReader(s, "UTF-8"); ) data inserted into database seems to look OK, but when I try to display it, I am getting something like this: C�te d'Ivoire, instead of Côte d'Ivoire.

All my websites are encoded in UTF-8.

Please help.

If encoding is set to 'windows-1252' (r = new java.io.InputStreamReader(s, "windows-1252"); ) everything works fine and I am getting Côte d'Ivoire on my website (), but in java this title looks like 'C?´te d'Ivoire' what breaks other things, such as for example links. What does it mean ?

+2  A: 

Java

The problem seems to lie in the HttpServletResponse , if you have a servlet or jsp page. Make sure to set your HttpServletResponse encoding to UTF-8.

In a jsp page or in the doGet or doPost of a servlet, before any content is sent to the response, just do :

response.setCharacterEncoding("UTF-8");

PHP

In PHP, try to use the utf8-encode function after retrieving from the database.

subtenante
I am using php/apache, and YES, I set encoding to UTF-8:header('Content-Type:text/html; charset=UTF-8');
Martin
Be careful that setting the header does not mean setting the encoding. You should specify in your question that you are using PHP/apache, because your java code makes this ambiguous.
subtenante
you need to define the encoding when you write it as well, don't know how this works in PHP, but what you're setting in the comment is just a instruction on how the client should interpret the content stream.
Tomas
+1  A: 

I would consider using commons-io, they have a function doing what you want to do:link

That is replace your code with this:

public String download(String url) throws java.io.IOException {
    java.io.InputStream s = null;
    String content = null;
    try {
        s = (java.io.InputStream)new URL(url).getContent();
        content = IOUtils.toString(s, "UTF-8")

    }
    finally {
        if (s != null) s.close(); 
    }
    return content.toString();
}

if that nots doing start looking into if you can store it to file correctly to eliminate the possibility that your db isn't set up correctly.

Tomas
Database encoding: UTF-8 Unicode (utf8) , all tables are in UTF-8 (ENGINE=MyISAM DEFAULT CHARSET=utf8;)
Martin
Try using commonsIO http://commons.apache.org/io/Thats instead of the conversion your doing in the first post. You get a one liner for doing that.
Tomas
+1  A: 

Is your database encoding set to UTF-8 for both server, client, connection and have the tables been created with that encoding? Check 'show variables' and 'show create table <one-of-the-tables>'

Confusion
character set client: utf8; character set connection: utf8; character set database: latin1; character set filesystem: binary; character set results: utf8; character set server: latin1; character set system: utf8;
Martin
Well, there you have it. Your server stores the data as 'latin1' (unless you specifically set 'utf8' when creating the tables. You need to set that server 'character set' (it's actually a character encoding, but let's not go into that now) to utf8 as well.
Confusion
+1  A: 

If encoding is set to 'UTF-8' (r = new java.io.InputStreamReader(s, "UTF-8"); ) data inserted into database seems to look OK, but when I try to display it, I am getting something like this: C�te d'Ivoire, instead of Côte d'Ivoire.

Thus, the encoding during the display is wrong. How are you displaying it? As per the comments, it's a PHP page? If so, then you need to take two things into account:

  1. Write them to HTTP response output using the same encoding, thus UTF-8.
  2. Set content type to UTF-8 so that the webbrowser knows which encoding to use to display text.

As per the comments, you have apparently already done 2. Left behind 1, in PHP you need to install mb_string and set mbstring.http_output to UTF-8 as well. I have found this cheatsheet very useful.

BalusC