ansaurus

Question

How can I map URLs to filenames with perl?

Answer 1

+4 A:

crypt on most platforms will ignore anything after the first 8 characters of input. Given your requirements, I would suggest Digest::MD5.

Update: Given the new 5.6.0 requirement, look up a hashing algorithm and implement it to get a number, then base64 encode it (manually, since MIME::Base64 also isn't core until 5.7.3.) A quick way to do so would be to just copy the md5_base64 subroutine from Digest::Perl::MD5 on CPAN (and the other subroutines and constants there that it calls/uses).

ysth 2010-02-01 00:14:06

UPD: this code must work in perl 5.6 also.Seems that Digest::MD5 is not shipped with 5.6

eugene y 2010-02-01 08:00:03

sigh. readers note the original post said 5.8.8 and above :)

ysth 2010-02-01 08:21:04

Could you just save Digest::Perl::MD5 to your include path? It's not ideal but you have some fairly strict rules anyway. http://cpansearch.perl.org/src/DELTA/Digest-Perl-MD5-1.8/lib/Digest/Perl/MD5.pm

C4H5As 2010-02-01 08:52:27

@i.: I've given up suggesting the obvious workarounds to those who say "core modules only". There are tons of helpful posts about for adjusting the include path or using PAR, etc, to get around all the possible problems and the repetition gets a little old.

ysth 2010-02-01 09:03:26

@ysth: Well, I was not asking how to include some module from cpan to my project.

eugene y 2010-02-01 09:49:30

@eugene y: I know; including a module from cpan would be the easy solution; that's why I told you instead to either a) look up an algorithm and implement it or b) copy about 20 lines from a cpan module into your code

ysth 2010-02-01 10:12:22

@ysth: Why do you suggest base64 encoding instead of hex, which should be quicker judging by the source of _encode_hex() and _encode_base64() ?

eugene y 2010-02-01 16:27:08

hex would be fine too, but base64 uses just 22 characters instead of 32, allowing more of the original url to be preserved.

ysth 2010-02-01 18:25:46

@ysth: thanks for useful advices.

eugene y 2010-02-01 21:37:59

you are welcome

ysth 2010-02-02 00:10:32

Answer 2

A:

For simplicity I'd try breaking the URL into it's (logical) constituent parts so you end up with a nice neat directory structure that maps to the URL:

/
/http
/https
/http/com
/http/com/google
/http/com/stackoverflow
/http/com/stackoverflow/questions
/http/com/stackoverflow/questions/2173839

This would probably make good sense if you're processing a large variety of different domains & websites but I haven't seen your sample data so I can't tell.

If you're likely to run into collisions with this (or any) style of URL mapping then try treating the file system as a hash structure. You could consider the root directory as a hash (with anywhere from 32k to 255^255 buckets, depending on the system) and place files directly in there. How you deal with collisions will depend on the volume of data & likelihood of occurrence.

C4H5As 2010-02-01 08:49:49

This directory structure won't solve a 255 chars filename limit and 32k subdirs limit. Hashing seems to be the simplest way

eugene y 2010-02-01 09:40:01

ansaurus

tags:

views:

answers:

How can I map URLs to filenames with perl?

related questions