views:

106

answers:

2

In a simple webapp I need to map URLs to filenames or filepaths.

This app has a requirement that it can depend only on modules in the core Perl ditribution (5.6.0 and later). The problem is that filename length on most filesystems is limited to 255. Another limit is about 32k subdirectories in a single folder.

My solution:

my $filename = $url;

if (length($filename) > $MAXPATHLEN) { # if filename longer than 255
    my $part1 = substr($filename, 0, $MAXPATHLEN - 13);        # first 242 chars
    my $part2 = crypt(0, substr($filename, $MAXPATHLEN - 13)); # 13 chars hash
    $filename = $part1.$part2;
}
$filename =~ s!/!_!g; # escape directory separator

Is it reliable ? How can it be improved ?

+4  A: 

crypt on most platforms will ignore anything after the first 8 characters of input. Given your requirements, I would suggest Digest::MD5.

Update: Given the new 5.6.0 requirement, look up a hashing algorithm and implement it to get a number, then base64 encode it (manually, since MIME::Base64 also isn't core until 5.7.3.) A quick way to do so would be to just copy the md5_base64 subroutine from Digest::Perl::MD5 on CPAN (and the other subroutines and constants there that it calls/uses).

ysth
UPD: this code must work in perl 5.6 also.Seems that Digest::MD5 is not shipped with 5.6
eugene y
sigh. readers note the original post said 5.8.8 and above :)
ysth
Could you just save Digest::Perl::MD5 to your include path? It's not ideal but you have some fairly strict rules anyway. http://cpansearch.perl.org/src/DELTA/Digest-Perl-MD5-1.8/lib/Digest/Perl/MD5.pm
C4H5As
@i.: I've given up suggesting the obvious workarounds to those who say "core modules only". There are tons of helpful posts about for adjusting the include path or using PAR, etc, to get around all the possible problems and the repetition gets a little old.
ysth
@ysth: Well, I was not asking how to include some module from cpan to my project.
eugene y
@eugene y: I know; including a module from cpan would be the easy solution; that's why I told you instead to either a) look up an algorithm and implement it or b) copy about 20 lines from a cpan module into your code
ysth
@ysth: Why do you suggest base64 encoding instead of hex, which should be quicker judging by the source of _encode_hex() and _encode_base64() ?
eugene y
hex would be fine too, but base64 uses just 22 characters instead of 32, allowing more of the original url to be preserved.
ysth
@ysth: thanks for useful advices.
eugene y
you are welcome
ysth
A: 

For simplicity I'd try breaking the URL into it's (logical) constituent parts so you end up with a nice neat directory structure that maps to the URL:

/
/http
/https
/http/com
/http/com/google
/http/com/stackoverflow
/http/com/stackoverflow/questions
/http/com/stackoverflow/questions/2173839

This would probably make good sense if you're processing a large variety of different domains & websites but I haven't seen your sample data so I can't tell.

If you're likely to run into collisions with this (or any) style of URL mapping then try treating the file system as a hash structure. You could consider the root directory as a hash (with anywhere from 32k to 255^255 buckets, depending on the system) and place files directly in there. How you deal with collisions will depend on the volume of data & likelihood of occurrence.

C4H5As
This directory structure won't solve a 255 chars filename limit and 32k subdirs limit. Hashing seems to be the simplest way
eugene y