views:

73

answers:

2

We have a PHP app with a dynamic URL scheme which requires characters to be percent-encoded, even "unreserved characters" like parentheses or aphostrophes which aren't actually required to be encoded. URLs which the app deems to be encoded the "wrong" way are canonicalized and then redirected to the "right" encoding.

But Google and other user agents will canonicalize percent-encoding/decoding differently, meaning when Googlebot requests the page it will ask for the "wrong" URL, and when it gets back a redirect to the "right" URL, Googlebot will refuse to follow the redirect and will refuse to index the page.

Yes, this is a bug on our end. The HTTP specs require that servers treat percent-encoded and non-percent-encoded unreserved characters identically. But fixing the problem in the app code is non-straightforward right now, so I was hoping to avoid a code change by using an Apache rewrite rule which would ensure that URLs are encoded "properly" from the point-of-view of the app, meaning that apopstrophes, parentheses, etc. are all percent-encoded and that spaces are encoded as + and not %20.

Here's one example, where I want to rewrite the first and end up with the second form:

  • www.splunkbase.com/apps/All/4.x/Add-On/app:OPSEC+LEA+for+Check+Point+(Linux)
  • www.splunkbase.com/apps/All/4.x/Add-On/app:OPSEC+LEA+for+Check+Point+%28Linux%29

Here's another:

  • www.splunkbase.com/apps/All/4.x/app:Benford's+Law+Fraud+Detection+Add-on
  • www.splunkbase.com/apps/All/4.x/app:Benford%27s+Law+Fraud+Detection+Add-on

Here's another:

  • www.splunkbase.com/apps/All/4.x/app:Benford%27s%20Law%20Fraud%20Detection%20Add-on
  • www.splunkbase.com/apps/All/4.x/app:Benford%27s+Law+Fraud+Detection+Add-on

If the app sees only the second form of these URLs, then it won't send any redirects and Google will be able to index the page.

I'm a newbie with rewrite rules, and it was clear from my read of the mod-rewrite documentation that mod_rewrite does some automatic encoding/decoding which may help or hurt what I want to do, although not sure.

Any advice for rewrite rules to handle the above cases? I'm OK with a rule for each special character since there's not many of them, but a single rule (if possible) would be ideal.

+1  A: 

mod_rewrite is not the best tool to do this kind of work. Because with mod_rewrite you can only replace a fixed amount of occurrences at a time. But it is possible:

RewriteCond %{THE_REQUEST} ^[A-Z]+\ /([^?\ ]*)%20([^?\ ]*)
RewriteRule ^ /%1+%2 [R=301,NE]
RewriteCond %{THE_REQUEST} ^[A-Z]+\ /([^?'\ ]*)'([^?'\ ]*)
RewriteRule ^ /%1\%27%2 [R=301,NE]
RewriteCond %{THE_REQUEST} ^[A-Z]+\ /([^?(\ ]*)\(([^?(\ ]*)
RewriteRule ^ /%1\%28%2 [R=301,NE]
RewriteCond %{THE_REQUEST} ^[A-Z]+\ /([^?)\ ]*)\)([^?)\ ]*)
RewriteRule ^ /%1\%29%2 [R=301,NE]

This will replace one %20, ', (, or ) at a time and responds with a 301 redirect. So if a URL path contains 10 characters that needs to be replaced, it needs 10 redirects to do so.

Since this might not be the best solution, it is possible to do all replacements except the last internal using the N flag and only the last replacement externally with a redirect:

RewriteCond %{THE_REQUEST} ^[A-Z]+\ /(([^?%\ ]|%(2[1-9a-fA-F]|[013-9][0-9a-fA-F]))*)%20(([^?%\ ]|%(2[1-9a-fA-F]|[013-9][0-9a-fA-F]))*%20[^?\ ]*)
RewriteRule ^ /%1+%4 [R=301,NE]
RewriteCond %{THE_REQUEST} ^[A-Z]+\ /([^?\ ]*)%20([^?\ ]*)[?\ ]
RewriteRule ^ /%1+%2 [R=301,NE]
RewriteCond %{THE_REQUEST} ^[A-Z]+\ /([^?'\ ]*)'([^?'\ ]*'[^?\ ]*)
RewriteRule ^ /%1\%27%2 [N,NE]
RewriteCond %{THE_REQUEST} ^[A-Z]+\ /([^?'\ ]*)'([^?'\ ]*)[?\ ]
RewriteRule ^ /%1\%27%2 [R=301,NE]
RewriteCond %{THE_REQUEST} ^[A-Z]+\ /([^?(\ ]*)\(([^?(\ ]*\([^?\ ]*)
RewriteRule ^ /%1\%28%2 [N,NE]
RewriteCond %{THE_REQUEST} ^[A-Z]+\ /([^?(\ ]*)\(([^?(\ ]*)[?\ ]
RewriteRule ^ /%1\%28%2 [R=301,NE]
RewriteCond %{THE_REQUEST} ^[A-Z]+\ /([^?)\ ]*)\)([^?)\ ]*\)[^?\ ]*)
RewriteRule ^ /%1\%29%2 [N,NE]
RewriteCond %{THE_REQUEST} ^[A-Z]+\ /([^?)\ ]*)\)([^?)\ ]*)[?\ ]
RewriteRule ^ /%1\%29%2 [R=301,NE]

But using the N flag can be dangerous as it doesn’t increment the internal recursion counter and thus can easily lead to infinite recursion.

Gumbo
Hmmm. At the level of complexity above, it's probably easier to ask the dev team to rewrite their redirection code. :-) The other answer looks to be simpler, so I'll accept it. But I like your general idea of repeating rules-- it may not be the solution I'd want here, but may be useful in other circumstances. Thanks! +1
Justin Grant
@Justin Grant: Yes, probably.
Gumbo
+1  A: 

The solution actually may be fairly simple, though it will only work in Apache 2.2 and later due to the use of the B flag. I'm not sure whether or not it takes care of every case correctly (admittedly I'm a bit skeptical it doesn't involve more work than this), but I'm led to believe it should by the source code.

Keep in mind too that the value of REQUEST_URI is not updated by mod_rewrite transformations, so if your application relies on that value to determine the requested URL, the changes you make won't be visible anyway.

The good news is that this can be done in .htaccess, so you have the option of leaving the main configuration untouched if that works better for you.

RewriteEngine On

# Make sure this is only done once to avoid escaping the escapes...
RewriteCond %{ENV:REDIRECT_STATUS} ^$
# Check if we have anything to bother escaping (likely unnecessary...)
RewriteCond $0 [^\w]+
# Rewrite the entire URL by escaping the backreference
RewriteRule ^.*$ $0 [B]

So, why is there a need to use the B flag instead of letting mod_rewrite escape the rewritten URL automatically? When mod_rewrite automatically escapes the URL, it uses ap_escape_uri (which apparently has been turned into a macro for ap_os_escape_path for some reason...), a function that escapes a limited subset of characters. The B flag, however, uses an internal module function called escape_uri, which is modeled on PHP's urlencode function.

The implementation of escape_uri in the module suggests that alphanumeric characters and underscores are left as-is, spaces are converted to +, and everything else is converted to its escaped equivalent. This seems to be the behaviour that you want, so presumably it should work.

If not, you do have the option of setting up an external program RewriteMap that could manipulate your incoming URLs into the correct format. This requires manipulating the Apache configuration though, and a renegade script could cause problems for the server on the whole, so I don't consider it an ideal solution if it can be avoided.

Tim Stone
This looks like the answer I was looking for. Nice!
Justin Grant