views:

92

answers:

4

Dear all,Now i have this question in my java program,I think it should be classified as URL problem,but not 100% sure.If you think I am wrong,feel free to recategorize this problem,thanks.

I would state my problem as simply as possible. I did a search on the famouse Chinese search engine baidu.com for a Chinese key word "奥巴马" (Obama in English),and the way I do that is to pass a URL (in a Java Program)to the browser like:

http://news.baidu.com/ns?word=奥巴马

and it works perfectly just like I input the "奥巴马”keyword in the text field on baidu.com.

However,now my advisor wants another thing.Since he can not read the Chinese webpages,but he wants to make sure the webpages I got from Baidu.com is related to "Obama",he asked me to google translate it back,i.e,using google translate and translate the Chinese webpage to English one.

This sounds straightforward.However,I met my problem here.

If I simply pass the URL "http://news.baidu.com/ns?word=奥巴马" into Google Translate and tick "Chinese to English" translating option,the result looks awful.(I don't know the clue here,maybe related to Chinese character encoding).

Alternatively,if now my browser opens ""http://news.baidu.com/ns?word=奥巴马" webpage,but I click on the "百度一下" button (that simply means "search"),you will notice the URL will get changed,now if I pass this URL into the Google translate and do the same thing,the result works much better.

I hope I am not making this problem sound too complicated,and I appologize for some Chinese words invovled,but I really need your guys' help here.Becasue I did all this in a Java program,I couldn't figure out how to realize that "百度一下"(pressing search button) step then get the new URL.If I could get that new URL,things are easy,I could just call Google translate in my Java code,and pops out the new window to show my advisor.

Please share any of your idea or thougts here.Thanks a lot.

Robert

+1  A: 

Try calling

URLEncoder.encode("http://news.baidu.com/ns?word=奥巴马", "utf-8")

(or utf-16; I'm not quite familiar with the Chinese characters representation)

Bozho
+2  A: 

You could use

URLEncoder.encode("http://news.baidu.com/ns?word=奥巴马", "utf-8")

then pass the resulting URL to Google Translate like:

http://translate.google.com/translate?js=y&prev=_t&hl=en&ie=UTF-8&layout=1&eotf=1&sl=zh-CN&tl=en&u=YOUR_URL

Cheers

lunohodov
Thanks,Lundodov,but the second statement always open the empty google translate page,what do I go wrong?
Robert
Can you post the code you use for generating the URL as also opening it?
lunohodov
+1  A: 

When you press the search button, the browser encodes the search term into %E5%A5%A5%E5%B7%B4%E9%A9%AC, which is the UTF-8 encoding for 奥巴马. It does this because UTF-8 is the default encoding for HTML forms.

Java uses a UTF-16 encoding internally, so it’s possible that the URL library builds a request in that encoding if you do not specify anything.

However, I could not reproduce your problem with Google translate — pasting that URL appeared to work correctly no matter how I did it.

jleedev
+1  A: 

URLs can contain only ASCII characters. All other characters must be converted to bytes then %-encoded in ASCII. However there is no mandate on what charset is used to convert chars to bytes. UTF-8 is recommended, but not required. As long as a server expresses its preference on charset, the client should respect that and use the same charset for encoding.

You can see from page info that baidu uses gb2312 encoding. The characters 奥巴马 in a form on its page will be converted to bytes in gb2312: B0C2 B0CD C2ED, then %-encoded to %B0%C2%B0%CD%C2%ED. That is what actually sent to baidu server, http://www.baidu.com/s?wd=%B0%C2%B0%CD%C2%ED

Your OS happens to be configured to use gb2312 by default, therefore when you paste http://news.baidu.com/ns?word= 奥巴马 to the browser, browser does the same thing, and baidu gets the correct chars. When I paste that URL in my browser, it screws up, because my OS uses UTF-8, and the browser encodes these chinese characters in UTF-8, not something baidu expectes. (when entering a URL directly in a browser, the browser may not have communicated to the server and does not know the charset the server prefers, therefore the browser uses platform default charset)

Now, Google uses UTF-8. That's why if you paste the URL to google form, it will screw up just like on my OS. The chars are encoded in UTF-8, and baidu will try to parse it as gb2312, and gets totally wrong words.

Solution is easy. Just encode the parameter in the way that the server expects:

"http://news.baidu.com/ns?word=" + URLEncoder.encode("奥巴马", "gb2312")
irreputable