tags:

views:

103

answers:

3

I'm making a KSSN (Korean ID Number) checker in PHP using a MySQL database. I check if it is working by using a file_get_contents call to an external site.

The problem is that the requests (with Hangul/Korean characters in them) are using the wrong charset. When I echo the string, the Korean characters just get replaced by question marks.

How can I make it to use Korean? Should I change anything in the database too? What should be the charset?

PHP Source and SQL Dump: http://www.multiupload.com/RJ93RASZ31

NOTE: I'm using Apache (HTML), not CLI.

+1  A: 

I don't know the charset, but if you are using HTML to show the results you should set the charset of the html

     <META http-equiv="Content-Type" content="text/html; charset=EUC-JP">

You can also use iconv (php function) to convert the charset to a different charset http://php.net/manual/en/book.iconv.php

And last but not least, check your database encoding for the tables.

But i guess that in your case you will only have to change the meta tag.

aviv
actually meta tag can do nothing. it must be **HTTP** header, not http-equiv surrogate
Col. Shrapnel
@Col: ? You very much *can* change the charset the browser uses from a `<meta http-equiv>`. That's the whole point. Sending an accurate `Content-Type` header *as well* is a good idea though.
bobince
`<meta http-equiv>` is only used if the real HTTP header is *missing*.
David Dorward
+1  A: 

Basically all charset problems stem from the fact that they're being mixed and/or misinterpreted.

A string (text) is a sequence of bytes in a specific order. The string is encoded using some specific charset, that in itself is neither right nor wrong nor anything else. The problem is when you try to read the string, the sequence of bytes, assuming the wrong charset. Bytes encoded using, for example, KS X 1001 just don't make sense when you read them assuming they're UTF-8, that's where the question marks come from.

The site you're getting the text from sends it to you in some specific character set, let's assume KS X 1001. Let's assume your own site uses UTF-8. Embedding a stream of bytes representing KS X 1001 encoded text in the middle of UTF-8 encoded text and telling the browser to interpret the whole site as UTF-8 leads to the KS X 1001 encoded text not making sense to the UTF-8 parser.

UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU
KSKSKSKSKSKSKSKSKSKSKSKSKSKSKSKSKSKSKSKS
UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU

will be rendered as

Hey, this is UTF-8 encoded text, awesome!
???????I?have?no?idea?what?this?is???????
Hey, this is UTF-8 encoded text, awesome!

To solve this problem, convert the fetched text into UTF-8 (or whatever encoding you're using on your site). Look at the Content-Type header of that other site, it should tell you what encoding the site is in. If it doesn't, take a guess.

deceze
+1  A: 

You need to:

  1. tell the browser what encoding you wish to receive in the form submission, by setting Content-Type by header or <meta> as in aviv's answer.

  2. tell the database what encoding you're sending it bytes in, using mysql_set_charset().

Currently you are using EUC-KR in the database so presumably you want to use that encoding in both the above points. In this century I would suggest instead using UTF-8 throughout for all web apps/databases, as the East Asian multibyte encodings are an anachronistic unpleasantness. (With potential security implications, as if mysql_real_escape_string doesn't know the correct encoding, a multibyte sequence containing ' or \ can sneak through an SQL injection.)

However, if enpang.com are using EUC-KR for the encoding of the Name URL parameter you would need either to stick with EUC-KR, or to transcode the name value from UTF-8 to EUC-KR for that purpose using iconv(). (It's not clear to me what encoding enpang.com are using for URL parameters to their name check service; I always get the same results anyway.)

bobince
Well, that's the problem. I don't know which encoding they are using either..
lesderid
Is the web service documented anywhere?
bobince
I don't think so. However, it's ofcourse used on their register page: http://join.enpang.com/member/joinStep1.aspI just checked and that page is using euc-kr.
lesderid
Ah well, you can only try with a known-used/unused username, I guess. (I can't read Hangul other than merely phonetically, so I can't immediately see how to use the site.) Note that when you are creating the URL query string you should use `urlencode` on the parameters to turn them into `%nn` sequences.
bobince