views:

417

answers:

3

I'm trying to read in an image URL. As mentioned in the java documentation, I tried converting the URL to URI by

String imageURL = "http://www.shefinds.com/files/Christian-Louboutin-Décolleté-100-pumps.jpg";
URL url = new URL(imageURL);
url = new URI(url.getProtocol(), url.getHost(), url.getFile(), null).toURL();  
URLConnection conn = url.openConnection();
InputStream is = conn.getInputStream();

I get the a Java.io.FileNotFound Exception for file http://www.shefinds.com/files/Christian-Louboutin-Décolleté-100-pumps.jpg

What am I doing wrong and what is the right way to encode this URL?

Update:
I'm using Rome to read in RSS feeds. Taking suggestions from BalusC I have printed out the raw input from different stages and seems like that the ROME rss parser is using ISO-8859-1 instead of UTF-8.

+2  A: 

Works fine here (returns a 403, it's at least not a 404):

URL url = new URL("http://www.shefinds.com/files/Christian-Louboutin-Décolleté-100-pumps.jpg");
URLConnection connection = url.openConnection();
InputStream input = connection.getInputStream();

When I fix it so that it doesn't return a 403, the picture is correctly retireved:

URL url = new URL("http://www.shefinds.com/files/Christian-Louboutin-Décolleté-100-pumps.jpg");
URLConnection connection = url.openConnection();
connection.setRequestProperty("User-Agent", "Mozilla/4.0");
InputStream input = connection.getInputStream();
OutputStream output = new FileOutputStream("/pic.jpg");
for (int data = 0; (data = input.read()) != -1;) {
    output.write(data));
}

So your problem lies somewhere else. Converting is actually not needed. The initial URL is valid.

Maybe you're obtaining the actual URL from some binary source using the wrong character encoding? The transition of é to é namely suggests that the original source was UTF-8 encoded and that the code has incorrectly read it in in using ISO-8859-1 instead of UTF-8.

Update: or maybe you've actually hardcoded it in the Java source code and saving the source file itself using the wrong encoding. I've configured my editor (Eclipse) to save files using UTF-8 and the -Dfile.encoding is also defaulted to UTF-8, that would explain why it works at my machine ;)

Update 2: as per the comments, in a nutshell, everything should work fine if the encoding used to save the source file matches the default -Dfile.encoding of the runtime platform (and the character encoding in question supports the é). To avoid those unforeseen clashes whenever you like to distribute the code, it's indeed better to replace hardcoded non-ASCII chars by unicode escapes.

BalusC
Small addition: If you actually need to convert from URI to URL, you may want to use:url = new URI(url.getProtocol(), url.getHost(), url.getPath(), url.getQuery(), null).toURL();Otherwise, query parameters won't work.
Chris Lercher
Converting is needed. Given this code, `URL` will conatain `?` s instead of non-us-ascii characters.
axtavt
The URL to URI conversion works for me; From the Javadoc: Note, the {@link java.net.URI} class does perform escaping of its component fields in certain circumstances. The recommended way to manage the encoding and decoding of URLs is to use {@link java.net.URI}, and to convert between these two classes using {@link #toURI()} and {@link URI#toURL()}.
Chris Lercher
@axtavt: I think I see the problem. I've configured my editor to save source files as UTF-8. You (and probably also the OP) apparently has configured the editor to save the files using another encoding. I'm using Eclipse: *Window > Preferences > General > Workspace > Text File Encoding > Other > UTF-8* should do. This affects "plain vanilla" strings in Java code as well.
BalusC
@BalusC: No, the source encoding is OK. The problem is that your solution depends on system encoding. With `-Dfile.encoding=UTF-8` it encodes `é` as `0xC3 0xA9`, and it works. With `-Dfile.encoding=latin1` it produces `0xE9`, which fails. In other encodings it produces `?`, which fails too.
axtavt
@BalusC: Advice: to eliminate dependency on source encoding when debugging encoding problems, replace all non-us-ascii chracters by unicode escapes (`é` -> `\u00e9`)
axtavt
@axtavt: The `-Dfile.encoding` would fail as well if the source file is saved in wrong encoding. Mine is indeed `UTF-8` as well. In theory, everything should work fine if the encoding used to save the source file matches the default `-Dfile.encoding` of the runtime platform (and the character encoding in question supports the `é`). To avoid those unforeseen clashes whenever you like to distribute the code, it's indeed better to replace hardcoded non-ASCII chars by unicode escapes.
BalusC
@chris_l22 - Thank you for the correction w.r.t. the query parameters.
sammichy
A: 

I think the technical answer is "you can't." Non-ASCII characters can't be used in a URL according to the standard, and even some ASCII characters must be escaped with "%XX" syntax, where XX is the ASCII value of the character.

If anything, you can escape 'é' with '%E9' but this relies on the server interpreting this as an encoding of the character according to ISO-8859-1. While this isn't technically allowed, I believe many servers will do it.

Sean Owen
@Sean Owen: *"where XX is the ASCII value of the character"* is not correct: there's no such thing as an ASCII character above 0x7F (ASCII goes from 0 to 127).
Webinator
Not sure I understand -- ASCII values range from 0x00 to 0x7F, yes. Their encodings go from %00 to %7F. What does the fact that 0x80 is not an ASCII character value have to do with it?
Sean Owen
A: 

The encoding of your source file is to blame. Using your IDE, set it to UTF-8, and then repaste the URL.

Beau Martínez