views:

27

answers:

2

I have a website that will have millions of pages in a directory. I'd like to store those files on-disk in a bunch of subdirectories based on the first characters of the page name.

For example http://mysite.com/hugedir/somefile.html

would be stored in /var/www/html/hugedir/s/o/m/e/f/ile.html

That is fairly trivial to do with a RewriteRule like so:

RewriteRule ^hugedir/(.)(.)(.)(.)(.)(.*).html   /hugedir/{$1}/{$2}/{$3}/{$4}/{$5}/$6.html
RewriteRule ^hugedir/(.)(.)(.)(.)(.*).html      /hugedir/{$1}/{$2}/{$3}/{$4}/{$5}.html
RewriteRule ^hugedir/(.)(.)(.)(.*).html         /hugedir/{$1}/{$2}/{$3}/{$4}.html
RewriteRule ^hugedir/(.)(.)(.*).html            /hugedir/{$1}/{$2}/{$3}.html
RewriteRule ^hugedir/(.)(.*).html               /hugedir/{$1}/{$2}.html
RewriteRule ^hugedir/(.*).html                  /hugedir/{$1}.html

However, the file name may contain hyphens or other non-standard characters and I'd really like to avoid having a directory named with a strange character. Ideally, I'd like to have a list of 'approved' characters and either eliminate or transform the unapproved characters to an underscore.

Can anybody think of a way to do that? Or something else equivalent? Part of the requirement is that these be physical files on disk and it not be parsed with a scripting language.

A: 

Apache mod_rewrite allows you to specify an external program to make the mapping. (Search for "External Rewriting Program"). You could do it in Perl, for example.

For example:

#!/usr/bin/perl
$| = 1;
while (<STDIN>) {
     chomp;
     $dir= $_ . "_________";
     $file = $_;
     $dir =~ tr/a-zA-Z0-9/X/c;
     $dir =~ s!^(.)(.)(.)(.).*!$1/$2/$3/$4!;
     print "$dir/$file\n";
}
leonbloy
A: 

By transforming characters into underscores, you'll run into problems with collisions. For example, --a and -=a would both be transformed into _/_/a.

A better way of dealing with the problem would be escaping the characters using RewriteMap and the builtin escape function:

RewriteMap escape int:escape
RewriteRule hugedir/(.*).html /hugedir/${escape:1}.html
RewriteRule hugedir/(.)(.*).html /hugedir/${escape:1}/${escape:2}.html
David Wolever
Collissions would not be a problem if the first characters (escaped) are used to build the directories, but they are also kept in the filename. For example, RewriteRule ^hugedir/(.)(.)(.*).html /hugedir/{$1}/{$2}/{$1}{$2}{$3}.html
leonbloy
Ah, very smart. I hadn't considered that.
David Wolever
THe rewriteMap seems promising, but does "escape" not produce the hex encoding? (eg $ => %25)
leonbloy
I realized that collisions might be a problem, but I'm willing to accept that as an expense of having a more manageable directory.
Brandon
Yes, it would convert them into their hex encodings. But, AFAIK, that should be portable across everything. If you don't like having a '%' in the dirname, though, then you could go with leonbloy's collision avoidance suggestion, or just a script (also as leonbloy suggested).
David Wolever
However, last time I wrote a system like this, I hashed the file names then stored them in `hash[0:4]/hash[4:8]/filename` to avoid the problem of having one directory get over loaded (for example, if you're storing pictures, the `DCIM_` path might quickly fill up.
David Wolever