tags:

views:

426

answers:

6

I had an earlier question that received the following response from the noted Perl expert, Perl author and Perl trainer brian d foy:

[If] you're looking for a fixed sequence of characters at the end of each filename. You want to know if that fixed sequence is in a list of sequences that interest you. Store all the extensions in a hash and look in that hash:
    my( $extension ) = $filename =~ m/\.([^.]+)$/;
    if( exists $hash{$extension} ) { ... }
You don't need to build up a regular expression, and you don't need to go through several possible regex alternations to check every extension you have to examine.

Thanks for the advice brian.

What I now want to know is what is the best practice in a case like the above. Should one only define the keys, which is all I need to achieve what's described above, or should one always define a value as well?

+4  A: 

You can't create a hash key without a value. The value can be undef but it will be there. How else would you construct a hash. Or was your question regarding whether the value can be undef? In which case I would say that the value you store there (undef, 1, 0...) is entirely up to you. If a lot of folks are using it then you probably want to store some true value though incase some one else uses if ($hash{$extension}) {...} instead of exists because they weren't paying attention.

Jeremy Wall
You can certainly create a hash key without a value. That's why that nifty defined() operator exists. :)
brian d foy
perl -le '@hash{ qw(a b c ) } = (); print keys %hash'
brian d foy
+1  A: 

undef is a value.

Of course, stuff like that is always depndent on what you are currently doing. But $foo{bar} is just a variable like $bar and I don't see any reason why either one should not be undef every now and then.

PS: That's why exists exists.

innaM
Actually, undef is the absence of value. It's what you get before you assign anything.
brian d foy
@brian d foy: I can understand your position but I tend to agree with Manni's perspective. An undefined value is still a value and Perl's undefined is unusually well-defined. :P
Michael Carman
+4  A: 

It's usually preferable to set a defined value for every key. The idiomatic value (when you don't care about the value) is 1.

my %hash = map { $_ => 1 } @array;

Doing it this way makes the code the uses the hash slightly simpler because you can use $hash{key} as a Boolean value. If the value can be undefined you need to use the more verbose exists $hash{key} instead.

That said, there are situations where a value of undef is desirable. For example: imagine that you're parsing C header files to extract preprocessor symbols. It would be logical to store these in a hash of name => value pairs.

#define FOO 1
#define BAR

In Perl, this would map to:

my %symbols = ( FOO => 1, BAR => undef);

In C a #define defines a symbol, not a value -- "defined" in C is mapped to "exists" in Perl.

Michael Carman
The problem with just testing `$hash{$key}` is that if `$key` does not exist in the hash, you just modified the hash. The bugs that can result from this are epic.
daotoad
@daotoad: Modified how? Autovivification doesn't occur for simple retrieval (but would if you did something like $hash{$key}{subkey}).
Michael Carman
+3  A: 

As others have said, the idiomatic solution for a hashset (a hash that only contains keys, not values) is to use 1 as the value because this makes the testing for existence easy. However, there is something to be said for using undef as the value. It will force the users to test for existence with exists which is marginally faster. Of course, you could test for existence with exists even when the value is 1 and avoid the inevitable bugs from users who forget to use exists.

Chas. Owens
I use exists() because I think of this as a set operation rather than a dictionary lookup.
brian d foy
Exactly, it is just a better way of doing it (faster and clearer), but the problem comes from people who are used to the idiom where they can check the value. That is why I say the safest thing is to use 1 as the value, but use exists.
Chas. Owens
Using exists() because it's faster is almost certainly a case of improper optimization. Whether or not it's clearer is debatable; not everyone has brian's clarity of thought and attention to nuance. That said, +1 for the combination of precise expression (using exists for testing) but tolerant coding (using 1 for values).
Michael Carman
@Michael Carman When choosing between two options that seem roughly equal in value determining which one is faster as a tie breaker seems reasonable. I wouldn't suggest benchmarking every idiom and using the faster one. For instance, I often prefer to use a map over a for loop for initializing arrays. I have benchmarked it before and the for loop with a push is slightly faster, but the clarity and conciseness of the map outweighs the minor (and possibly version dependent) performance gain.
Chas. Owens
Given all that has been said, is a value of 1 better than some other value? For instance if the purpose of the hash is to determine whether something should be included, does it make sense to use a value of "include" or does that not accomplish anything that a value of 1 can't except waste memory?
Dr. Faust
Dr. Faust You could easily use "include", but since the value is never used (except in the case where someone forgets to use exists) it would mean more typing for little reason.
Chas. Owens
@Dr. Faustus: Any true value would work but by using 1 it's clearer that all you really care about is Boolean truth since that's the normal value for "true." If you use anything other than 0 or 1 it implies that the value is something other than a Boolean.
Michael Carman
That's a very point Michael - Thanks.
Dr. Faust
Don't set your values to 1. The problem with using 1 is that you silently add keys to the hash when you test for a value. If this is used in a code base that also tests existence you have just created one hell of an intermittent run-time bug. Use undef and test for existence. Consider locking your hashes.
daotoad
Yes. My application is just testing the existence of keys. So your are saying I should have their values set to undef.
Dr. Faust
@daotoad: Hash retrieval does not trigger autovivification. It's using the value (e.g. by dereferencing or incrementing it) that causes it to autovivify. Running perl -e "%h = qw(a 1 b 2); print $h{x}; print keys %h" prints "ab"
Michael Carman
@daotoad Autovivification is a separate issue, if (exists $foo{bar}{baz}) {} autovivifies a bar key in %foo even though you are using exists.
Chas. Owens
@Fred Zeppelin No, you should use 1 as the values, but you should also use the exists function rather than trusting the value to be true.
Chas. Owens
+1  A: 

Storing '1' in a Set-hash Considered Harmful

I know using Considered Harmful is considered harmful, but this is bad; almost as bad as unrestrained goto usage.

Ok, I've harped on this in a few comments, but I think I need a full response to demonstrate the issue.

Lets say we have a daemon process that provides back-end inventory control for a shop that sells widgets.

my @items = qw(
    widget
    thingy
    whozit
    whatsit
);

my @items_in_stock = qw(
    widget
    thingy
);

my %in_stock;
my @in_stock(@items_in_stock) = (1) x @items_in_stock;  #initialize all keys to 1

sub Process_Request {
    my $request = shift;

    if( $request eq REORDER ) {
        Reorder_Items(\@items, \%in_stock);
    }
    else { 
        Error_Response( ILLEGAL_REQUEST );
    }
}

sub Reorder_Items{
   my $items = shift;
   my $in_stock =  shift;

   # Order items we do not have in-stock.
   for my $item ( @$items ) {

       Reorder_Item( $item ) 
           if not exists $in_stock->{$item};
   }

}

The tool is great, it automatically keeps items in stock. Very nice. Now, the boss asks for automatically generated catalogs of in-stock items. So we modify Process_Request() and add catalog generation.

sub Process_Request {
    my $request = shift;

    if( $request eq REORDER ) {
        Reorder_Items(\@items, \%in_stock);
    }
    if( $request eq CATALOG ) {
        Build_Catalog(\@items, \%in_stock);
    }
    else { 
        Error_Response( ILLEGAL_REQUEST );
    }
}

sub Build_Catalog {
    my $items = shift;
    my $in_stock = shift;

    my $catalog_response = '';
    foreach my $item ( @$items ) {
        $catalog_response .= Catalog_Item($item)
            if $in_stock->{$item};
    }

    return $catalog_response;
}

In testing Build_Catalog() works fine. Hooray, we go live with the app.

Oops. For some reason nothing is being ordered, the company runs out of stock of everything.

The Build_Catalog() routine adds keys to the %in_stock, so Reorder_Items() now sees everything as in stock and never makes an order.

Using Hash::Util's lock_hash can help prevent accidental hash modification. If we locked the %in_stock before calling 'Build_Catalog()` we would have gotten a fatal error and would never have gone live with the bug.

In summary, it is best to test existence of keys rather than truth of your set-hash values. If you are using existence as a signifier, don't set your values to '1' because that will mask bugs and make them harder to track. Using lock_hash can help catch these problems.

If you must check for the truth of the values, do so in every case.

daotoad
+1  A: 

Using undef as a value in hash is more memory efficient than storing 1.