We have a PHP app with a dynamic URL scheme which requires characters to be percent-encoded, even "unreserved characters" like parentheses or aphostrophes which aren't actually required to be encoded. URLs which the app deems to be encoded the "wrong" way are canonicalized and then redirected to the "right" encoding.
But Google and other user agents will canonicalize percent-encoding/decoding differently, meaning when Googlebot requests the page it will ask for the "wrong" URL, and when it gets back a redirect to the "right" URL, Googlebot will refuse to follow the redirect and will refuse to index the page.
Yes, this is a bug on our end. The HTTP specs require that servers treat percent-encoded and non-percent-encoded unreserved characters identically. But fixing the problem in the app code is non-straightforward right now, so I was hoping to avoid a code change by using an Apache rewrite rule which would ensure that URLs are encoded "properly" from the point-of-view of the app, meaning that apopstrophes, parentheses, etc. are all percent-encoded and that spaces are encoded as +
and not %20
.
Here's one example, where I want to rewrite the first and end up with the second form:
- www.splunkbase.com/apps/All/4.x/Add-On/app:OPSEC+LEA+for+Check+Point+(Linux)
- www.splunkbase.com/apps/All/4.x/Add-On/app:OPSEC+LEA+for+Check+Point+%28Linux%29
Here's another:
- www.splunkbase.com/apps/All/4.x/app:Benford's+Law+Fraud+Detection+Add-on
- www.splunkbase.com/apps/All/4.x/app:Benford%27s+Law+Fraud+Detection+Add-on
Here's another:
- www.splunkbase.com/apps/All/4.x/app:Benford%27s%20Law%20Fraud%20Detection%20Add-on
- www.splunkbase.com/apps/All/4.x/app:Benford%27s+Law+Fraud+Detection+Add-on
If the app sees only the second form of these URLs, then it won't send any redirects and Google will be able to index the page.
I'm a newbie with rewrite rules, and it was clear from my read of the mod-rewrite documentation that mod_rewrite does some automatic encoding/decoding which may help or hurt what I want to do, although not sure.
Any advice for rewrite rules to handle the above cases? I'm OK with a rule for each special character since there's not many of them, but a single rule (if possible) would be ideal.