views:

8108

answers:

8

My Java standalone application gets a URL (which points to a file) from the user and I need to hit it and download it. The problem I am facing is that I am not able to encode the HTTP URL address properly...

Example:

URL:  http://search.barnesandnoble.com/booksearch/first book.pdf

java.net.URLEncoder.encode(url.toString(), "ISO-8859-1");

returns me:

http%3A%2F%2Fsearch.barnesandnoble.com%2Fbooksearch%2Ffirst+book.pdf

But, what I want is

http://search.barnesandnoble.com/booksearch/first%20book.pdf

(space replaced by %20)

I guess URLEncoder is not designed to encode HTTP URLs... The JavaDoc says "Utility class for HTML form encoding"... Is there any other way to do this?

+7  A: 

Yeah URL encoding is going to encode that string so that it would be passed properly in a url to a final destination. For example you could not have http://xxx.com?url=http://yyy.com. UrlEncoding the parameter would fix that parameter value.

So i have two choices for you:

  1. Do you have access to the path separate from the domain? If so you may be able to simply UrlEncode the path. However, if this is not the case then option 2 may be for you.

  2. Get commons-httpclient-3.1. This has a class URIUtil:

    System.out.println(URIUtil.encodePath("http://xxx.com/x y", "ISO-8859-1"));

This will output exactly what you are looking for, as it will only encode the path part of the URI.

FYI, you'll need commons-codec and commons-logging for this method to work at runtime.

Nathan Feger
+1  A: 

URLEncoder.encode() encodes everything, including the forward slashes. These 2 threads may be of interest to you in finding a solution:

http://stackoverflow.com/questions/665354/whats-wrong-with-my-url-encoding

http://stackoverflow.com/questions/591694/url-encoded-slash-in-url

John T
Those threads discuss .NET. This question is about Java.
vocaro
@vocaro the principals are the same.
John T
+4  A: 

URLEncoding can encode HTTP URLs just fine, as you've unfortunately discovered. The string you passed in, "http://search.barnesandnoble.com/booksearch/first book.pdf", was correctly and completely encoded into a URL-encoded form. You could pass that entire long string of gobbledigook that you got back as a parameter in a URL, and it could be decoded back into exactly the string you passed in.

It sounds like you want to do something a little different than passing the entire URL as a parameter. From what I gather, you're trying to create a search URL that looks like "http://search.barnesandnoble.com/booksearch/whateverTheUserPassesIn". The only thing that you need to encode is the "whateverTheUserPassesIn" bit, so perhaps all you need to do is something like this:

String url = "http://search.barnesandnoble.com/booksearch/" + 
       URLEncoder.encode(userInput,"UTF-8");

That should produce something rather more valid for you.

CaptainAwesomePants
That would replace the spaces in userInput with "+". The poster needs them replaced with "%20".
vocaro
+3  A: 

Nitpicking: a string containing a whitespace character by definition is not a URI. So what you're looking for is code that implements the URI escaping defined in Section 2.1 of RFC 3986.

Julian Reschke
+1. This is the real problem. You don't want an algorithm to URL-encode, you want a means of fixing up a broken URL.
bobince
Good point. And how to do that efficiently in Java?
Jan
+12  A: 

The java.net.URI class can help; in the documentation of URL you find

Note, the URI class does perform escaping of its component fields in certain circumstances. The recommended way to manage the encoding and decoding of URLs is to use URI

Use one of the Constructors with more than one argument, like:

URI uri = new URI(
    "http", 
    "search.barnesandnoble.com", 
    "/booksearch/first book.pdf",
    null);
URL url = uri.toURL()
//or String request = uri.toString();

(the single-argument constructor of URI does NOT escape illegal characters)


EDIT: added fully qualified class name to avoid confusion with other URI classes (like apaches httpclient)

Carlos Heuberger
Great! This way it works without adding a bunch of commons libraries... Thanks
Sudhakar R
Please note, the URI class mentioned here is from "org.apache.commons.httpclient.URI" not "java.net" , the "java.net" doesn't URI doesn't accept the illegal characters, unless you will use constructors that builds URL from its components , like the way mentioned in Matt comment below
Mohamed Faramawi
@Mohamed: the class I mentioned and used for testing **actually is** `java.net.URI`: it worked perfectly (Java 1.6). I would mention the fully qualified class name if it was not the standard Java one and the link points to the documentation of `java.net.URI`. And, by the comment of Sudhakar, it solved the problem without including any "commons libraries"!
Carlos Heuberger
A: 

Please be warned that most of the answers above are INCORRECT.

The URLEncoder class, despite is name, is NOT what needs to be here. It's unfortunate that Sun named this class so annoyingly. URLEncoder is meant for passing data as parameters, not for encoding the URL itself.

In other words, "http://search.barnesandnoble.com/booksearch/first book.pdf" is the URL. Parameters would be, for example, "http://search.barnesandnoble.com/booksearch/first book.pdf?parameter1=this&param2=that". The parameters are what you would use URLEncoder for.

The following two examples highlights the differences between the two.

The following produces the wrong parameters, according to the HTTP standard. Note the ampersand (&) and plus (+) are encoded incorrectly.

uri = new URI("http", null, "www.google.com", 80, 
"/help/me/book name+me/", "MY CRZY QUERY! +&+ :)", null);

// URI: http://www.google.com:80/help/me/book%20name+me/?MY%20CRZY%20QUERY!%20+&+%20:)

The following will produce the correct parameters, with the query properly encoded. Note the spaces, ampersands, and plus marks.

uri = new URI("http", null, "www.google.com", 80, "/help/me/book name+me/", URLEncoder.encode("MY CRZY QUERY! +&+ :)", "UTF-8"), null);

// URI: http://www.google.com:80/help/me/book%20name+me/?MY+CRZY+QUERY%2521+%252B%2526%252B+%253A%2529

-Matt

Matt
+1  A: 

The answer from "Matt" is incorrect (though it doesn't have any votes for or against at present).

The following produces the correct URL Encoded output

uri = new URI("http", null, "www.google.com", 80, "/help/me/book name+me/", "MY CRZY QUERY! +&+ :)", null);

// URI: http://www.google.com:80/help/me/book%20name+me/?MY%20CRZY%20QUERY!%20+&+%20:)

You would not want the "&" character to be encoded (as Matt suggests), because '&' characters are considered to be query-parameter separators, and are required to be left alone in order to properly produce an encoded URL.

sappenin
A: 

There is still a problem if you have got an encoded "/" (%2F) in your URL.

RFC 3986 - Section 2.2 says: "If data for a URI component would conflict with a reserved character's purpose as a delimiter, then the conflicting data must be percent-encoded before the URI is formed." (RFC 3986 - Section 2.2)

But there is an Issue with Tomcat:

http://tomcat.apache.org/security-6.html - Fixed in Apache Tomcat 6.0.10

important: Directory traversal CVE-2007-0450

Tomcat permits '\', '%2F' and '%5C' [...] .

The following Java system properties have been added to Tomcat to provide additional control of the handling of path delimiters in URLs (both options default to false):

  • org.apache.tomcat.util.buf.UDecoder.ALLOW_ENCODED_SLASH: true|false
  • org.apache.catalina.connector.CoyoteAdapter.ALLOW_BACKSLASH: true|false

Due to the impossibility to guarantee that all URLs are handled by Tomcat as they are in proxy servers, Tomcat should always be secured as if no proxy restricting context access was used.

Affects: 6.0.0-6.0.9

So if you have got an URL with the %2F character, Tomcat returns: "400 Invalid URI: noSlash"

You can switch of the bugfix in the Tomcat startup script:

set JAVA_OPTS=%JAVA_OPTS% %LOGGING_CONFIG%   -Dorg.apache.tomcat.util.buf.UDecoder.ALLOW_ENCODED_SLASH=true 
simonox