views:

299

answers:

4

I have a Perl DBM hash containing a list of URLs that I want to pick randomly from to load balance spidering sites. As a result I want to pick a key at random, or select the nth element (so I can randomise n).

I'm aware this goes against the concept of a hash, but is this possible?

NOTE: missed a valuable point that hash size will be too large to load all the keys to randomly select.

+1  A: 

Of course, it is possible. First, get a list of the keys. Then, randomize the list, using shuffle from List::Util.

Then, loop over the keys.

If there are too many keys (so keeping them all in a list and shuffling is not possible), just remember that you are using tied hashes. Just use each to iterate over key value pairs.

The order will be deterministic but AFAIK, it will not be alphabetical or order of insertion. That, by itself, might be able to get you what you want.

Sinan Ünür
Thanks for the advice. Unfortunately my hash is too big to load all the keys.
PaulMdx
PaulMdx - `each` returns a key and a value, iteratively over the hash. So, you can deal with them one at a time.
daotoad
+2  A: 

Picking a random element from an array is simpler so you can use keys(%foo) to get the array of keys and pick randomly from that.

I believe this will return a random element $x from an array:

$x = $array[rand @array];

If you want to shuffle an array, consider List::Util::shuffle. See http://search.cpan.org/perldoc/List::Util#shuffle_LIST

dreeves
There is already `List::Util::shuffle`. See http://search.cpan.org/perldoc/List::Util#shuffle_LIST
Sinan Ünür
There's no need to write your own `shuffle` when the one from List::Util is more robust (and faster).
friedo
Thanks! I'm editing my answer now.
dreeves
Thanks for the advice it's a spot on solution where hash size is manageable. Is there an alternative for hashes too large to load all the keys?
PaulMdx
+3  A: 

I don't think any of the DBM packages have an API for retrieving a random key, or for retrieving keys by index number. You can look up a particular key, or you can read through all the keys in whatever order the database chooses to return them in (which may change if the database is modified, and may or may not be "random" enough for whatever you want to do).

You could read through all the keys and pick one, but that would require reading the entire database each time (or at least a sizable chunk of it), and that's probably too slow.

I think you'll need to rearrange your data structure.

  1. You could use a real SQL database (like SQLite), so you could look up rows both by a sequential row number and by URL. This would be the most flexible.

  2. You could use a sequential integer as the key for your DBM file. That would make picking a random one easy, but you could no longer look up entries by URL.

  3. You could use two DBM files: the one you have now and a second keyed by sequential integer with the URL as value. (Actually, since URLs don't look like integers, you could store both sets of records in the same DBM file, but that would complicate any code that uses each.) This would use twice the disk space, and would make inserting/removing entries a bit more complicated. You'd probably be better off with approach #1, unless you can't install SQLite for some reason.

cjm
related question for 'real SQL databases': http://stackoverflow.com/questions/19412/how-to-request-a-random-row-in-sql
plusplus
A: 

You could use DBM::Deep instead of a traditional DB file to keep your data.

tie %hash, "DBM::Deep", {
    file => "foo.db",
    locking => 1,
    autoflush => 1
};

# $hash{keys} = [ ... ]
# $hash{urls} = { ... } <- same as your current DB file.

my $like_old = $hash{urls}; # a ref to a hash you can use like your old hashref.
my $count = @{$hash{keys}};

With that you can pull out random values as needed.

daotoad