I see the post http://stackoverflow.com/questions/2565864/validating-utf-8-in-htaccess-rewrite-rule and I think that is great, but a more fundamental problem I am having first:
I needed to expand to handle utf-8 chars for query string parameters, names of directories, files, and used in displays to users etc.
I configured my Apache with DefaultCharset utf-8 and also my php if that matters. My original rewrite rule filtered everything except regular A-Za-z and underscore and hyphen. and it worked. Anything else would give you a 404 (which is what I want!) Now, however it seems that everything matches, including stuff I don't want, however, although it seems to match it doesn't go in the query string unless it is a regular A-Za-z_- character string.
I find this confusing, because the rule says put whatever you matched into the query string:
Here is the original rule:
RewriteRule ^/puzzle/([A-Za-z_-]+)$ /puzzle.php?g=$1 [NC]
and here is the revised rule:
RewriteRule ^/puzzle/(\w+)$ /puzzle.php?g=$1 [NC]
I made the change because somewhere I read that \w matches ALL the alpha chars where as A-Zetc. only matches the ones without accents and stuff.
It doesn't seem to matter which of those rules I use: Here is what happens:
In the application I have this:
echo $_GET['g'];
If I feed it a url like http://mydomain.com/puzzle/USA it echoes out "USA" and works fine.
If I feed it a url like http://mydomain.com/puzzle/México it echoes nothing for that and warns me that index g is not defined and of course doesn't get resources for Mexico.
if I feed it a url like http://mydomain.com/puzzle/fuzzle/buzzle/j.qle it does the same thing.
This last case should be a 404!
And it does this no matter which of the above rules I use. I configured a rewrite log
RewriteLogLevel 5
RewriteLog /opt/local/apache2/logs/puzzles.httpd.rewrite
but it is empty.
Here is from the regular access log (it gives a status of 200)
[26/May/2010:11:21:42 -0700] "GET /puzzle/M%C3%A9xico HTTP/1.1" 200 342
[26/May/2010:11:21:54 -0700] "GET /puzzle/M/l.foo HTTP/1.1" 200 342
What can I do to get these $%#$@(*#@!!! characters but not slash, dot or other non-alpha into my program, and once there, will it decode them correctly??? Would posix char classes work any better? Is there anything else I need to configure?