views:

177

answers:

1

I am seeing strange redirect behavior with URLs that have encoded characters. For example, the following two URLs differ ONLY by the case of the "e/E" in the first encoded character (i.e. "%e2" versus "%E2").

URL 1: http://youlookfab.com/welookfab/topic/your-favourite-80%e2%80%99s-music-bands

  • "200 OK" HTTP status, page loads fine

URL 2: http://youlookfab.com/welookfab/topic/your-favourite-80%E2%80%99s-music-bands

  • causes a "302 Found" redirect

  • in a browser, the page redirects to the correct URL above (lowercase "e")

  • using web-sniffer.net, the content length is zero

I originally started looking into this because Google Webmaster Tools was showing crawl errors ("Redirect error", to be specific) on a bunch of pages that had URL encoded characters. Although my sitemap file specifies these characters in lowercase, GWT is showing them in uppercase.

I can't see any reason in .htaccess for lowercase URL encoded characters to redirect to uppercase. The site is based on bbPress, and I don't see any reason in the bbPress code for this to happen either.

Could mod_rewrite be doing something strange? I know there was a bug in the past where URL encoded characters were handled incorrectly.

Any insight you have would be much appreciated.

[This is an integrated bbPress/WPMU installation running LAMP, hosted on a MediaTemple (dv) server]

A: 

After digging deeper I found that the redirect was actually happening inside bbPress, which detects the uppercase hex in the incoming URL and sees that as a discrepancy with the "correct" permalink (which has lowercase hex).

I have written this up in a little more detail, along with a simple bbPress plugin to address the issue, at http://theblogeasy.com/2009/12/26/bbpress-and-encoded-urls-with-uppercase-hex/

Regarding the Google crawl error... my theory is that this is caused when the crawler (which converts encoded URLs to uppercase hex) and bbPress (which redirects them to lowercase hex) get into an infinite loop. The crawler probably detects this condition when it gets the same URL back repeatedly and throws an error.

Greg