ansaurus

Question

URL encoding for latin characters in Java

Answer 1

+2 A:

Works fine here (returns a 403, it's at least not a 404):

URL url = new URL("http://www.shefinds.com/files/Christian-Louboutin-Décolleté-100-pumps.jpg");
URLConnection connection = url.openConnection();
InputStream input = connection.getInputStream();

When I fix it so that it doesn't return a 403, the picture is correctly retireved:

URL url = new URL("http://www.shefinds.com/files/Christian-Louboutin-Décolleté-100-pumps.jpg");
URLConnection connection = url.openConnection();
connection.setRequestProperty("User-Agent", "Mozilla/4.0");
InputStream input = connection.getInputStream();
OutputStream output = new FileOutputStream("/pic.jpg");
for (int data = 0; (data = input.read()) != -1;) {
    output.write(data));
}

So your problem lies somewhere else. Converting is actually not needed. The initial URL is valid.

Maybe you're obtaining the actual URL from some binary source using the wrong character encoding? The transition of é to Ã© namely suggests that the original source was UTF-8 encoded and that the code has incorrectly read it in in using ISO-8859-1 instead of UTF-8.

Update: or maybe you've actually hardcoded it in the Java source code and saving the source file itself using the wrong encoding. I've configured my editor (Eclipse) to save files using UTF-8 and the -Dfile.encoding is also defaulted to UTF-8, that would explain why it works at my machine ;)

Update 2: as per the comments, in a nutshell, everything should work fine if the encoding used to save the source file matches the default -Dfile.encoding of the runtime platform (and the character encoding in question supports the é). To avoid those unforeseen clashes whenever you like to distribute the code, it's indeed better to replace hardcoded non-ASCII chars by unicode escapes.

BalusC 2010-03-14 17:16:32

Small addition: If you actually need to convert from URI to URL, you may want to use:url = new URI(url.getProtocol(), url.getHost(), url.getPath(), url.getQuery(), null).toURL();Otherwise, query parameters won't work.

Chris Lercher 2010-03-14 17:19:56

Converting is needed. Given this code, `URL` will conatain `?` s instead of non-us-ascii characters.

axtavt 2010-03-14 17:34:30

The URL to URI conversion works for me; From the Javadoc: Note, the {@link java.net.URI} class does perform escaping of its component fields in certain circumstances. The recommended way to manage the encoding and decoding of URLs is to use {@link java.net.URI}, and to convert between these two classes using {@link #toURI()} and {@link URI#toURL()}.

Chris Lercher 2010-03-14 17:42:32

@axtavt: I think I see the problem. I've configured my editor to save source files as UTF-8. You (and probably also the OP) apparently has configured the editor to save the files using another encoding. I'm using Eclipse: *Window > Preferences > General > Workspace > Text File Encoding > Other > UTF-8* should do. This affects "plain vanilla" strings in Java code as well.

BalusC 2010-03-14 17:44:50

@BalusC: No, the source encoding is OK. The problem is that your solution depends on system encoding. With `-Dfile.encoding=UTF-8` it encodes `é` as `0xC3 0xA9`, and it works. With `-Dfile.encoding=latin1` it produces `0xE9`, which fails. In other encodings it produces `?`, which fails too.

axtavt 2010-03-14 17:57:49

@BalusC: Advice: to eliminate dependency on source encoding when debugging encoding problems, replace all non-us-ascii chracters by unicode escapes (`é` -> `\u00e9`)

axtavt 2010-03-14 18:07:21

@axtavt: The `-Dfile.encoding` would fail as well if the source file is saved in wrong encoding. Mine is indeed `UTF-8` as well. In theory, everything should work fine if the encoding used to save the source file matches the default `-Dfile.encoding` of the runtime platform (and the character encoding in question supports the `é`). To avoid those unforeseen clashes whenever you like to distribute the code, it's indeed better to replace hardcoded non-ASCII chars by unicode escapes.

BalusC 2010-03-14 18:17:46

@chris_l22 - Thank you for the correction w.r.t. the query parameters.

sammichy 2010-03-15 15:41:25

Answer 2

A:

I think the technical answer is "you can't." Non-ASCII characters can't be used in a URL according to the standard, and even some ASCII characters must be escaped with "%XX" syntax, where XX is the ASCII value of the character.

If anything, you can escape 'é' with '%E9' but this relies on the server interpreting this as an encoding of the character according to ISO-8859-1. While this isn't technically allowed, I believe many servers will do it.

Sean Owen 2010-03-14 17:19:53

@Sean Owen: *"where XX is the ASCII value of the character"* is not correct: there's no such thing as an ASCII character above 0x7F (ASCII goes from 0 to 127).

Webinator 2010-03-14 19:00:16

Not sure I understand -- ASCII values range from 0x00 to 0x7F, yes. Their encodings go from %00 to %7F. What does the fact that 0x80 is not an ASCII character value have to do with it?

Sean Owen 2010-03-14 19:16:33

Answer 3

A:

The encoding of your source file is to blame. Using your IDE, set it to UTF-8, and then repaste the URL.

Beau Martínez 2010-03-15 21:26:31

ansaurus

tags:

views:

answers:

URL encoding for latin characters in Java

related questions