views:

55

answers:

2

I am struggling with a very basic regex problem in my .htaccess file that I hope someone may be able to shed some light on. The basic premise is that I would like to teach Apache to switch any .html extension into a .var extension. I had thought that the rule would be positively trivial:

RewriteRule ^([^.]+)\.html$ $1.var

But the [^.] part simply doesn't work. Bizarrely, it works like so

RewriteRule ^([^A-Z]+)\.html$ $1.var

I do not understand why this latter rule works. Assume I am looking for a file called "index.html" then $1 should match to "index." and the ".html" bit should actually fail to match.

To widen the scope of the question slightly, I am actually racking my brain on how to implement a multi-lingual site. I don't like Apache's MultiView option because it forces upon me a flat directory structure with file extensions that aren't recognizable to many development tools. I could go the .var type-map route but am finding that the default config for Apache doesn't support this all that well either (hence my excursions into regex land). So while I am using mod_rewrite, I am thinking that I might go the whole hog: whenever a request for a name.html file is received and this file does not exist, check whether there exists a XX/name.html file instead, where "XX" is the language code according to the user's preferences.

This would give me a neater directory structure, though it does perhaps not perform as well as the .var approach in a situation where the language preference of the user's browser is not supported in by my site (in which situation .var would substitute EN or similar).

Any thoughts? Thanks.

A: 

Why don't you just use ^(.*)\.html$? This will match any string that ends in .html. After all, filenames can contain more than one dot.

[^A-Z]+ matches index if the regex is applied case-sensitively. Perhaps that's the reason? Why [^.]+ should fail is beyond me, though.

Tim Pietzcker
Ok, I worked out that I have been an idiot. Your answer is quite correct. Alas, I did not consider the implications of my (unconditional) RuleRewrite: Once I had morphed index.html into index.var, Apache's type map jumped into action and looked inside the index.var file for a resource to map. And it pulled DE/index.html out of the hat. THEN Apache subjected DE/index.html to yet another rewrite process which ended up mangling that name to DE/index.var. And THAT file then did not exist. Isn't computing wonderful :-)))
Ollie2893
Oops :) Nice detective work.
Tim Pietzcker
A: 

The . matches everything but newlines.
Inside of a character class, the ^ means "not".
The + means one or more of the preceding character class.

So when you write ([^.]+), that says "match one or more newlines". So unless you have a URL composed of newlines followed by ".html", this will not work.

^([^A-Z]+)\.html$ works because it matches one or more characters that are not uppercase letters. If you have any uppercase letters before the ".html" in your URL, this one will fail too.

Tim Pietzcker's suggestion is correct: just use ^(.*)\.html$,keeping in mind that this won't work in the odd case that you have newlines in your URL.

In the odd case that you actually have URL's with newlines in them, you can use ^([\d\D]+)\.html$, which will match digits and non-digits (i.e. everything) up until the ".html".

Nick
Ok ... interesting. Two things confuse me:(1) My understanding from regex is that each expression tries to gobble the longest match. So how does the expression ^(.*)\.html$ function? It seems to me that .* should swallow ".html". To then match .html after, it would have to retrace its steps?(2) Are you quite sure that "." inside a character class [] retains the meaning you ascribe (which, I agree, it has outside such a class)? If so, I also tried [^\.]+ with no more joy. Surely, the \ should have escaped the regular meaning?
Ollie2893
PS: Incidentally, "^(.*)\.html$ $1.var" also fails. Before you think I am looking at a fundamental failure, "^(index)\.html$ $1.var" works (for target index.html).
Ollie2893