tags:

views:

840

answers:

8

Is it more preferrable, when assigning to a hash of just keys (where the values aren't really needed), to say:

$hash{$new_key} = "";

Or to say:

$hash{$new_key} = 1;

One necessitates that you check for a key with exists, the other allows you to say either:

if (exists $hash{$some_key})

or

if ($hash{$some_key})

I would think that assigning a 1 would be better, but are there any problems with this? Does it even matter?

+13  A: 

It depends on whether you need the key to exist or to have a true value. Test for the thing you need. If you are using a hash merely to see if something is in a list, exists() is the way to go. If you are doing something else, checking the value might be the way to go.

brian d foy
Alright, so one's not really preferred over another? Just curious.
Joe
Preferred for what? It depends on what you want to do.
brian d foy
Well, what I was doing when I thought of the question was checking to see if I'd seen something before. So if I hadn't, then add it to the hash. I'd always seen this done as $hash{$key} = "", but earlier today someone else did the same thing as $hash{$key} = 1. I was just looking for a good, idiomatic answer. Perhaps my question should have been more specific.
Joe
I do like assigning "1" and then just checking for truthfulness because it's less typing :)
mpeters
I like assigning 1 and then checking for existence because it's slightly more forgiving. If I assign empty string by mistake, it still works. If I check for truth instead of existence by mistake, it still works. Only if I make both mistakes will it break. This is a case of preferring robustness over correctness, which I guess comes down do your needs and personal style.
Adam Bellaire
+2  A: 

As the prior answer says, it depends on what you are trying to achieve; if you are just trying to get (for instance) unique values from some set (whose elements then form the keys), you can just use exists (could also help to catch duplicates if you check for exists first before assigning a value).

Without knowing the application, it's difficult to be more specific.

+3  A: 

Assume you actually needed to check existence of keys, but you wrote code that checks for truth. It checks for truth throughout your program in various places. Then it suddenly appears that you misunderstood something and you should actually store a mapping from your keys to string values; the strings should be used in the same dataflow as you've already implemented.

And the strings can be empty!

Hence you should either refactor your program or create another hash, because truth checks no longer check existence. That wouldn't happen if you checked for existence from the very beginning.

(edited coz dunno why got voted down.)

Pavel Shved
Good point. So it just depends on what you're trying to do...
Joe
+10  A: 

When the values aren't needed, you'll often see this idiom:

my %exists;
$exists{$_}++ for @list;

Which has the effect of setting it to be 1.

Chris Simmons
Well, it sets it to the count of the times the key appears in @list.
brian d foy
Is this the actual "perl idiom"? I'd seen a few before in the Camel, but didn't remember if this was one of them.
Joe
Yeah, brian d foy (almost said Brian - THAT would have been a Perl faux pas!) is right, you get the count of how many times the key appears in @list. I consider this an idiom - I believe this is mentioned in the Perl Cookbook on the section on finding the intersection of lists (but I could be wrong - I don't have it in front of me at the moment).
Chris Simmons
This is a nice idiom IMHO, though if memory serves correctly an earlier version of Perl (maybe 5.6?) would produce warnings under `use strict;`, complaining that you were manipulating an undefined variable on the 1st increment of any key.
j_random_hacker
I don't think 5.6 did that - perhaps earlier (my company was using 5.6 extensively up until last year).
Chris Simmons
In recent perlsyn docs it says:"When used as a number, 'undef' is treated as 0; [...] If you enable warnings, you'll be notified of an uninitialized value whenever you treat 'undef' as a string or a number. Well, usually. [...] Operators such as '++', '--', '+=', '-=', and '.=', that operate on undefined left values such as: my $a; $a++;are also always exempt from such warnings."Don't know how far back this behavior goes, but maybe someone can look through older versions of the perlsyn docs to find out...
Michael Krebs
Or this idiom: `my %exists; @exists{@list} = ();` or this idiom: `my %seen; ... unless ( $seen{$item}++ ) { print "first time for $item" }`
ysth
+5  A: 

If you're trying to save memory (which generally only matters if you have a very large hash), you can use undef as the value and just test for its existence. Undef is implemented as a singleton, so thousands of undefs are all just pointers to the same value. Setting each value to the empty string or 1 would allocate a different scalar value for each element.

my %exists;
@exists{@list} = ();

In light of your later comment about your intended use, this is the idiom I've seen and used many times:

my %seen;
while (<>) {
    next if $seen{$_}++; # false the first time, true every successive time
    ...process line...
}
KingPong
I like this. But is `()` the same as `undef`?
Joe
It accomplishes the same thing in this list context because there are more elements on the left side than on the right. What's happening here is the same thing that would happen if you wrote "($foo,$bar) = (1);". $foo gets 1 and $bar gets undef. Conceptually, Perl extends the right side of the assignment with enough "ghost" undefs to fill the left side.That being said, I would almost always use the $exists{$_}++ idiom -- just to save myself some debugging time. It's easier to not have to remember whether $exists{$foo} might be extant but undefined.
KingPong
+2  A: 

* Update: * Sinan points out that my cautious approach to hash element creation is dated and not an issue on newer Perls. I've edited my post below, and added some new thoughts on the matter.

The problem with just testing for truth is that you can modify the hash with thecrufty old version of Perl that I learnt on. This code is safe with Perl 5.8:

my %foo = ();

if( $foo{bar} ) {
   print "never happens";
}

print keys %foo;

This is the bad part of the mixed blessing of auto-vivification (over all I like auto-viv, but this is where it hurts).

In many situations, this is no big deal. But it is a potential issue to be aware of. I address this in my code by locking any hash that must remain unmodified.

In practice I either wind up always doing an exists test before a boolean test as well.

if( exists $foo{bar} and $foo{bar} ) {
    # hash is not modified due to short circuit
}

The same kind of alteration of data structures can occur with arrays. If you access $foo[2000], then the array will be extended. So it can be a good idea to test for existence before you accidentally extend an array. In practice this has been much less of an issue than the corresponding hash behavior. <-- The irony here is that you can only use exists on an array on perls 5.6 and newer, where presumably this problem has been fixed.

If I need to go digging into data structures, I use Data::Diver. It automatically checks existence at each level in the structure to prevent accidental alteration of your data structure.

The most important thing is to be consistent within each script/program. The easiest way to run into problems is to test for existence here, but truth there. Especially if you are accessing the same hash for both sets of tests.

Final thoughts on my update regarding autovivification: A flurry of research showed several things. I should have tested my code before posting--by failing to do so, I spread misinformation, which I apologize for. I also discovered that there are still some sneaky issues with autovivification lingering--enough that there is an open todo item to make things right. So, while it may be wrong-headed, old-fashioned and dumb, I will continue to explicitly take steps to control autovivification and restrict it to occurring only when I want it to occur. FWIW, autovivification is a great thing when it works. I think special casing if to prevent autoviv is the right thing to do--it gets rid of the need for a lot of extra code, but I wish I could find some docs that detailed that behavior.

daotoad
+1, good point, sometimes autovivification doesn't DWIM. Hash locking is interesting but I notice it is for Perl >= 5.8.
j_random_hacker
`if ( $foo{bar} )` will not auto-vivify `$foo{bar}` -- at least with recent perls. I do not know if the behavior was different at some point in time.
Sinan Ünür
Actually Sinan is right. The bottom level *does not* autovivify (and I think that has always been the case); however multilevel accesses (e.g. `if ($foo{bar}{baz}{qux})` will autovivify all but the bottom level (i.e. `$foo{bar}{baz}` would be created in this case).
j_random_hacker
Excellent and thorough update! And if being wrong-headed, old-fashioned and dumb is wrong then I don't want to be right. :)
j_random_hacker
+2  A: 

See also Autovivification : What is it and why do I care?.

Sinan Ünür
A: 

I usually check for defined values. That's the middle case that you're leaving out. Not quite "truth" not quite "exists" either. (Mostly, but not quite.)

Now in theory, the more general way is exists, as in

if ( exists $hash{$key} ) return 'strawberry';

This covers the case where the key exists and the value is 0, or when the key has been assigned undef. The key just needs to exist to pass this test.

However, I have rarely found the need to test the existence of a key.

  1. Hashes are often part of a defined API, and if you're processing them, you have some idea of the range of values that can be stored. The configuration item will be looking for specific things; and as unordered parameter keys, subroutines will be looking for specific things.

  2. I find the idea of an "infinite table" a very flexible concept. And exists x <=> defined x works for that. Every conceivable value is "set" in the table, but only a finite number of keys are defined, the rest are considered to be undefined.

    As a result, usually though, unless a value is defined in a hash, I don't care what it is. I consider it a false value. Storing undef and not storing anything at all are equivalent in most things that I write. This is further motivated by the item below.

  3. Most of the time that I might need to know if a key is in the table, I need to use it for something else. First I store the value locally, and then test if for a defined value.

     my $value = $hash{$key};
     if ( defined $value ) { 
         push @valid_values, $value;
     }
    

    If I could be sure that there was some local common-subexpression optimization between the lookup for exists and the lookup to use the value, then I wouldn't be so picky about this. But I don't like to retrieve from a hash more than once. So I 1) cache the value and 2) check it--every time.

That said, I can tighten the criteria is I know that the value should not be 0, such as in a lookup or a parameter table. So I sometimes test for truth. But I also can tighten up the test for anything, anyway.

     if ( ( $hash{$key} || '' ) =~ m/^(?:Bears|Lions|Packers|Vikings)$/ ) { 
         $nfc_north++;
     }
  1. Of course an operating principle here is that defined works for "unlimited" tables. Where every conceivable value is "set" in the table, but only a finite number of keys are defined.

There is a case that you might be working on a totally anonymous hash. But then, what's your interest in the keys that can't be satisfied with keys or values? Even if you're making a all-purpose hash "convenience function", it's better not concerning yourself with existences of particular keys in order to be totally neutral to what somebody else has stored there.

Axeman