views:

25

answers:

1

About the system

I have URLs of this format in my project:-

http://project_name/browse_by_exam/type/tutor_search/keyword/class/new_search/1/search_exam/0/search_subject/0

Where keyword/class pair means search with "class" keyword.

I have a common index.php file which executes for every module in the project. There is only a rewrite rule to remove the index.php from URL:-

RewriteCond $1 !^(index\.php|resources|robots\.txt)
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule ^(.*)$ index.php [L,QSA]

I am using urlencode() while preparing the search URL and urldecode() while reading the search URL.

Problem

Only the forward slash character is breaking URLs causing 404 page not found error. For example, if I search one/two the URL is

http://project_name/browse_by_exam/type/tutor_search/keyword/one%2Ftwo/new_search/1/search_exam/0/search_subject/0/page_sort/

How do I fix this? I need to keep index.php hidden in the URL. Otherwise, if that was not needed, there would have been no problem with forward slash and I could have used this URL:-

http://project_name/index.php?browse_by_exam/type/tutor_search/keyword/one
%2Ftwo/new_search/1/search_exam/0/search_subject/0

Thanks,

Sandeepan

+5  A: 

Apache denies all URLs with %2F in the path part, for security reasons: scripts can't normally (ie. without rewriting) tell the difference between %2F and / due to the PATH_INFO environment variable being automatically URL-decoded (which is stupid, but a long-standing part of the CGI specification so there's nothing can be done about it).

You can turn this feature off using the AllowEncodedSlashes directive, but note that other web servers will still disallow it (with no option to turn that off), and that other characters may also be taboo (eg. %5C), and that %00 in particular will always be blocked by both Apache and IIS. So if your application relied on being able to have %2F or other characters in a path part you'd be limiting your compatibility/deployment options.

I am using urlencode() while preparing the search URL

You should use rawurlencode(), not urlencode() for escaping path parts. urlencode() is misnamed, it is actually for application/x-www-form-urlencoded data such as in the query string or the body of a POST request, and not for other parts of the URL.

The difference is that + doesn't mean space in path parts. rawurlencode() will correctly produce %20 instead, which will work both in form-encoded data and other parts of the URL.

bobince
Ah, so that is why the slash is denied. Perfect diagnosis and treatment.
Pekka
+1 I tried explaining some of this in one of his other questions, but you did it far more coherently than I was able to.
Tim Stone
Hi Bobince, `rawurlencode()` too converts forward slashes to `%2F` which is still breaking my URL. I did not understand actually how `rawurlencode()` fix my problem.
sandeepan
It doesn't, that's a side-issue to do with `+` vs. `%20`. The fix is `AllowEncodedSlashes`, although relying on that reduces your deployment possibilities (ie. you can't deploy it on IIS, and other users—if there are any—won't be able to deploy it if they are using shared hosting with no access to the `httpd.conf`). Also some tools or spiders might get confused by it. Although `%2F` to mean `/` in a path part is correct as per the standard, most of the web avoids it.
bobince
sandeepan
ok I got it! in the default LAMP URLs, encoded forward slashes are allowed in the query string part and I am trying to allow in the path part of my URL.
sandeepan
Yes, any sequence of encoded bytes must be allowed in the query string. Whilst any encoded byte is technically valid in a path component as per the URL RFC, servers have trouble with some of them due to the path part traditionally being used for filenames. Apart from `%00`, `%2F` and `%5C`, IIS will also give you trouble with non-ASCII byte sequences in the path that are not valid UTF-8 sequences.
bobince