views:

404

answers:

4

I see the post http://stackoverflow.com/questions/2565864/validating-utf-8-in-htaccess-rewrite-rule and I think that is great, but a more fundamental problem I am having first:

I needed to expand to handle utf-8 chars for query string parameters, names of directories, files, and used in displays to users etc.

I configured my Apache with DefaultCharset utf-8 and also my php if that matters. My original rewrite rule filtered everything except regular A-Za-z and underscore and hyphen. and it worked. Anything else would give you a 404 (which is what I want!) Now, however it seems that everything matches, including stuff I don't want, however, although it seems to match it doesn't go in the query string unless it is a regular A-Za-z_- character string.

I find this confusing, because the rule says put whatever you matched into the query string:

Here is the original rule:

RewriteRule ^/puzzle/([A-Za-z_-]+)$ /puzzle.php?g=$1 [NC]

and here is the revised rule:

RewriteRule ^/puzzle/(\w+)$ /puzzle.php?g=$1 [NC]

I made the change because somewhere I read that \w matches ALL the alpha chars where as A-Zetc. only matches the ones without accents and stuff.

It doesn't seem to matter which of those rules I use: Here is what happens:

In the application I have this:

echo $_GET['g'];

If I feed it a url like http://mydomain.com/puzzle/USA it echoes out "USA" and works fine.
If I feed it a url like http://mydomain.com/puzzle/México it echoes nothing for that and warns me that index g is not defined and of course doesn't get resources for Mexico.
if I feed it a url like http://mydomain.com/puzzle/fuzzle/buzzle/j.qle it does the same thing.
This last case should be a 404!

And it does this no matter which of the above rules I use. I configured a rewrite log

   RewriteLogLevel 5
   RewriteLog /opt/local/apache2/logs/puzzles.httpd.rewrite

but it is empty.

Here is from the regular access log (it gives a status of 200)

[26/May/2010:11:21:42 -0700] "GET /puzzle/M%C3%A9xico HTTP/1.1" 200 342
[26/May/2010:11:21:54 -0700] "GET /puzzle/M/l.foo HTTP/1.1" 200 342

What can I do to get these $%#$@(*#@!!! characters but not slash, dot or other non-alpha into my program, and once there, will it decode them correctly??? Would posix char classes work any better? Is there anything else I need to configure?

A: 

On...

RewriteRule ^/puzzle/(\w+)$ /puzzle.php?g=$1 [NC]

Someone correct me if I'm wrong, but wouldn't this mean get requests asking for subdirectories simply bypass this rule?

Also, a lazy way to solve this is to also group in the '%' character. As far as I know, all you're allowed to work with is on any url path is url encoding. Actually, see: http://www.blooberry.com/indexdot/html/topics/urlencoding.htm

I'm sure there are more advanced and better ways to do this, but that should solve your immediate problem.

bob_the_destroyer
Sorry, I think I gave inaccurate info. According to Apache docs (http://httpd.apache.org/docs/2.2/mod/mod_rewrite.html), the engine unescapes the url before processing, and then escapes the rewritten output. Try playing with the B flag to keep non-alpha characters escaped within backreferences during processing. This should let you use the pattern...RewriteRule ^/puzzle/([a-zA-Z0-9_-]|%[a-fA-F0-9]{2})+$ [NC,B]Otherwise, you'll have to include searches for "\xFF" values within ranges covering only irregular alphabetical characters.
bob_the_destroyer
From that point, use PHP to translate ['g'] or simply return '404' header if you don't like the value.
bob_the_destroyer
Ugh! I meant 'RewriteRule ^/puzzle/(([a-zA-Z0-9_-]|%[a-fA-F0-9]{2})+)$ /puzzle.php?g=$1 [NC,B]' so the entire repeated group is captured. Maybe I should just step away from the computer for a while instead.
bob_the_destroyer
Thanks. I'm still completely confused about the big picture but I do see what your second rewrite rule is trying to do, so I'm testing it back on the Mac localhost. I had all kinds of problems when I tried to do some tests on Linux.
It acts the same. What should I be feeding it: http://mydomain.com/puzzle/México or http://mydomain.com/puzzle/M%E9xico or something else? BTW neither worked.
A: 

This is a response to the destroyer's answer but it got too long.

I'm down with URL encoding the unicode cuz it's easy enough to decode it for display. So maybe that's the basic problem. Eventually I'll just use url_encode in php to do this but I thought I'd try an online one just to test things: I went to http://www.opinionatedgeek.com/dotnet/tools/urlencode/Encode.aspx and tried to encode México and it came out M%c3%a9xico. I went to the site you indicated and tried it and it came out M%E9xico different!! Which is it??? I guess I'd have to accept whatever the php function would actually give me. But both of those has a 9 in it which mean I have to accept digits as well as %. Is that ALL I'd have to include?

I would hope that requests asking for genuine subdirectories would NOT match this rule if that's what you mean by bypassing it, I'd rather they actually render the static pages at the subdirectories. That's why I really want to exclude / which I thought I did. But seems to be matching anything after the / including nested subdirectories and going to the puzzle.php file.

Here is what I tried, but no joy: I used this rule: RewriteRule ^/puzzle/([A-Za-z0-9_%-]+)$ /puzzle.php?g=$1 [NC] as you see I added the % and 0-9 to the group. Do I need to escape the % or something? I read that only \ needs escaping inside square brackets. I hope that's what you mean. Would these be the only additional character you would get by encoding any possible unicode string? then I passed the 2 different url encoded version of Mexico in. For M%E9xico I am now getting 404 and this message: The requested URL /puzzle/México was not found on this server. For M%c3%a9xico I am now getting this message on the 404: The requested URL /puzzle/México was not found on this server. And for non existent subdirectories it is now giving 404 as it should. So now it is just the rewrite rule not working. That's progress. Also the rewrite log started getting stuff in it: Here is some. I will google for how to read these logs:

kidd108d-mac3:logs tpdick$ cat puzzles.httpd.rewrite 
::1 - - [26/May/2010:15:54:37 --0700] [puzzles.net/sid#886b00][rid#904858/initial] (2) init rewrite engine with requested uri /puzzle/M?xico
::1 - - [26/May/2010:15:54:37 --0700] [puzzles.net/sid#886b00][rid#904858/initial] (3) applying pattern '^/puzzle/([A-Za-z0-9_%-]+)$' to uri '/puzzle/M?xico'
::1 - - [26/May/2010:15:54:37 --0700] [puzzles.net/sid#886b00][rid#904858/initial] (1) pass through /puzzle/M?xico
::1 - - [26/May/2010:15:54:37 --0700] [puzzles.net/sid#886b00][rid#910858/subreq] (3) [perdir /Users/tpdick/Sites/puzzles/] add path info postfix: /Users/tpdick/Sites/puzzles/puzzle.php -> /Users/tpdick/Sites/puzzles/puzzle.php/M?xico
::1 - - [26/May/2010:15:54:37 --0700] [puzzles.net/sid#886b00][rid#910858/subreq] (3) [perdir /Users/tpdick/Sites/puzzles/] strip per-dir prefix: /Users/tpdick/Sites/puzzles/puzzle.php/M?xico -> puzzle.php/M?xico
::1 - - [26/May/2010:15:54:37 --0700] [puzzles.net/sid#886b00][rid#910858/subreq] (3) [perdir /Users/tpdick/Sites/puzzles/] applying pattern '^(.*)/GeoP-Test/puzzle/(.*)$' to uri 'puzzle.php/M?xico'
::1 - - [26/May/2010:15:54:37 --0700] [puzzles.net/sid#886b00][rid#910858/subreq] (1) [perdir /Users/tpdick/Sites/puzzles/] pass through /Users/tpdick/Sites/puzzles/puzzle.php
::1 - - [26/May/2010:15:54:37 --0700] [puzzles.net/sid#886b00][rid#904858/initial] (3) [perdir /Users/tpdick/Sites/puzzles/] add path info postfix: /Users/tpdick/Sites/puzzles/puzzle.php -> /Users/tpdick/Sites/puzzles/puzzle.php/M?xico
::1 - - [26/May/2010:15:54:37 --0700] [puzzles.net/sid#886b00][rid#904858/initial] (3) [perdir /Users/tpdick/Sites/puzzles/] strip per-dir prefix: /Users/tpdick::1 - - [26/May/2010:15:54:37 --0700] [puzzles.net/sid#886b00][rid#904858/initial] (2) init rewrite engine with requested uri /puzzle/México
::1 - - [26/May/2010:15:54:37 --0700] [puzzles.net/sid#886b00][rid#904858/initial] (3) applying pattern '^/puzzle/([A-Za-z0-9_%-]+)$' to uri '/puzzle/México'
::1 - - [26/May/2010:15:54:37 --0700] [puzzles.net/sid#886b00][rid#904858/initial] (1) pass through /puzzle/México
::1 - - [26/May/2010:15:54:37 --0700] [puzzles.net/sid#886b00][rid#910858/subreq] (3) [perdir /Users/tpdick/Sites/puzzles/] add path info postfix: /Users/tpdick/Sites/puzzles/puzzle.php -> /Users/tpdick/Sites/puzzles/puzzle.php/México
::1 - - [26/May/2010:15:54:37 --0700] [puzzles.net/sid#886b00][rid#910858/subreq] (3) [perdir /Users/tpdick/Sites/puzzles/] strip per-dir prefix: /Users/tpdick/Sites/puzzles/puzzle.php/México -> puzzle.php/México
::1 - - [26/May/2010:15:54:37 --0700] [puzzles.net/sid#886b00][rid#910858/subreq] (3) [perdir /Users/tpdick/Sites/puzzles/] applying pattern '^(.*)/GeoP-Test/puzzle/(.*)$' to uri 'puzzle.php/México'
::1 - - [26/May/2010:15:54:37 --0700] [puzzles.net/sid#886b00][rid#910858/subreq] (1) [perdir /Users/tpdick/Sites/puzzles/] pass through /Users/tpdick/Sites/puzzles/puzzle.php
::1 - - [26/May/2010:15:54:37 --0700] [puzzles.net/sid#886b00][rid#904858/initial] (3) [perdir /Users/tpdick/Sites/puzzles/] add path info postfix: /Users/tpdick/Sites/puzzles/puzzle.php -> /Users/tpdick/Sites/puzzles/puzzle.php/México

Now what??

I failed to mention that this is a Mac and I've heard Macs are inconsistent about unicode. It's almost like the MacOS is saying oh I know that, it's unicode and then reencoding it before feeding it to apache or something. I'm gonna go play around on a linux box I'm lucky enough to have access to and see if I get better results.
+1  A: 

I'd suggest you activate MultiViews and forget mod_rewrite. Add to your apache configuration in the relevant Directory/VirtualHost section:

Options +MultiViews
#should already be set to this, but it doesn't hurt:
AcceptPathInfo Default

No you can always omit the extensions as long as the client includes the correspondent mime type in its Accept header.

Now a request for /puzzle/whatever will map to /puzzle.php and $_SERVER['PATH_INFO'] will be filled with /whatever.


If you want to do it with mod_rewrite it's also possible. The test string for RewriteRule is unescaped (the %xx portions are converted to the actual bytes they represent). You can get the original escaped string using %{REQUEST_URI} or %{THE_REQUEST} (the last one also contains the HTTP method and version).

By convention, web browsers use UTF-8 encoding in URLs. This means that "México" will be urlencoded to M%C2%82xico, not M%82xico, which would be expected if the browsers used ISO-8859-1. Also, [a-zA-Z] will not match é. However, this should work:

RewriteCond %{REQUEST_URI} ^/puzzle/[^/]*$
RewriteRule ^/puzzle/(.*)$ /puzzle.php?q=$1 [B,L]

You need B to escape the backreference because you're using it in a query string, in which the set of characters that are allowed is smaller than for the rest of the URI.

The thing you should be aware of is that RewriteRule is not unicode aware. Anything other than .* can give (potentially) incorrect results. Even [^/] may not work because the / "character" (read: byte) may be part of a multi-byte character sequence. If RewriteRule were unicode aware, your solution with \w should work.

Since you do not want to match subdirectories, and RewriteRule ^/puzzle/[^/]* is not an option, that check is deferred to a RewriteCond that uses the (escaped) %{REQUEST_URI}.

Artefacto
A: 

This solution is based on: http://www.dracos.co.uk/code/apache-rewrite-problem/

Try this rewrite rules:

AddDefaultCharset UTF-8
RewriteEngine On
RewriteCond %{THE_REQUEST} /puzzle/([^?\ /]+)
RewriteRule ^puzzle/(.*)$ puzzle.php/%1 [L]

How to get the query param:

<?php
// Get query param
$g = substr($_SERVER['PATH_INFO'], 1); 
echo "<p>g: $g</p>";

// Test if '/' is present in URL for 404's
$g2 = substr($_SERVER['REQUEST_URI'], 8); 
if (strpos($g2, '/') === false) {
    // do stuff
} else {
    // Send 404 header here
    echo "<p>404</p>";
}
?>

With this solution you have to send the 404 from php.

flo