views:

4411

answers:

8

Just what title says. I am surprised by insufficiency of results in Google search for this question! What I want to is the equivalent of Python dictionaries but in bash (and hence, should work across OSX, Ubuntu and other major Linux distributions).

A: 

The list of list approach?

Tom
+4  A: 
hput () {
  eval hash"$1"='$2'
}

hget () {
  eval echo '${hash'"$1"'#hash}'
}
hput France Paris
hput Netherlands Amsterdam
hput Spain Madrid
echo `hget France` and `hget Netherlands` and `hget Spain`


$ sh hash.sh
Paris and Amsterdam and Madrid
DigitalRoss
It's always funny to see people still don't care about or understand arbitrary code injection.
lhunath
Sigh, that seems unnecessarily insulting and it's inaccurate anyway. One would not put input validation, escaping, or encoding (see, I actually do know) in the guts of the hash table, but rather in a wrapper and as soon as possible after input.
DigitalRoss
+3  A: 

You can further modify the hput()/hget() interface so that you have named hashes as follows:

hput() {
    eval "$1""$2"='$3'
}

hget() {
    eval echo '${'"$1$2"'#hash}'
}

and then

hput capitols France Paris
hput capitols Netherlands Amsterdam
hput capitols Spain Madrid
echo `hget capitols France` and `hget capitols Netherlands` and `hget capitols Spain`

This lets you define other maps that don't conflict (e.g., 'rcapitols' which does country lookup by capitol city). But, either way, I think you'll find that this is all pretty terrible, performance-wise.

If you really want fast hash lookup, there's a terrible, terrible hack that actually works really well. It is this: write your key/values out to a temporary file, one-per line, then use 'grep "^$key"' to get them out, using pipes with cut or awk or sed or whatever to retrieve the values.

Like I said, it sounds terrible, and it sounds like it ought to be slow and do all sorts of unnecessary IO, but in practice it is very fast (disk cache is awesome, ain't it?), even for very large hash tables. You have to enforce key uniqueness yourself, etc. Even if you only have a few hundred entries, the output file/grep combo is going to be quite a bit faster - in my experience several times faster. It also eats less memory.

Here's one way to do it:

hinit() {
    rm -f /tmp/hashmap.$1
}

hput() {
    echo "$2 $3" >> /tmp/hashmap.$1
}

hget() {
    grep "^$2 " /tmp/hashmap.$1 | awk '{ print $2 };'
}

hinit capitols
hput capitols France Paris
hput capitols Netherlands Amsterdam
hput capitols Spain Madrid

echo `hget capitols France` and `hget capitols Netherlands` and `hget capitols Spain`
Al P.
A: 

To get a little more performance remember that grep has a stop function, to stop when it finds the nth match in this case n would be 1.

grep --max_count=1 ... or grep -m 1 ...

bozon
+1  A: 

Prior to bash 4 there is no good way to use associative arrays in bash. Your best bet is to use an interpreted language that actually has support for such things, like awk. On the other hand, bash 4 does support them.

As for less good ways in bash 3, here is a reference than might help: http://mywiki.wooledge.org/BashFAQ/006

kojiro
A: 

Bash 4 natively supports this feature. You declare an associative array by doing:

declare -A animals

You can fill it up with elements using the normal array assignment operator:

animals=( ["moo"]="cow" )

Or merge them:

declare -A animals=( ["moo"]="cow" )

Then use them just like normal arrays:

echo "${animals["moo"]}"
for sound in "${!animals[@]}"; do echo "$sound - ${animals["$sound"]}"; done

In bash 3, you don't have associative arrays. Do not use eval to emulate them. You must avoid eval like the plague, because it is the plague of shell scripting.

First and foremost: Consider upgrading to bash 4. Seriously. The future is now, stop living in the past and suffering from it by forcing stupid broken and ugly hacks on your code.

If you have some silly excuse why you "can't upgrade", declare is a far safer option. It does not evaluate data as bash code like eval does, and as such it does not allow arbitrary code injection quite so easily.

Let's prepare the answer by introducing the concepts:

First, indirection (seriously; never use this unless you're mentally ill or have some other bad excuse for writing hacks).

$ animals_moo=cow; sound=moo; i="animals_$sound"; echo "${!i}"
cow

Secondly, declare:

$ sound=moo; animal=cow; declare "animals_$sound=$animal"; echo "$animals_moo"
cow

Bring them together:

# Set a value:
declare "array_$index=$value"

# Get a value:
arrayGet() { 
    local array=$1 index=$2
    local i="${array}_$index"
    printf '%s' "${!i}"
}

Let's use it:

$ sound=moo
$ animal=cow
$ declare "animals_$sound=$animal"
$ arrayGet animals "$sound"
cow

Note: declare cannot be put in a function. Any use of declare inside a bash function turns the variable it creates local to the scope of that function, meaning we can't access or modify global arrays with it.

Once more. Upgrade to bash 4. If you can't, consider awk before doing ugly hacks as described above. And definitely stay the heck away from eval hackery.

lhunath
A: 

Two things, you can use memory instead of /tmp in any kernel 2.6 by using /dev/shm (Redhat) other distros may vary. Also hget can be reimplemented using read as follows:

function hget {

while read key idx
do
if [ $key = $2 ]
then
echo $idx
return
fi
done < /dev/shm/hashmap.$1
}

In addition by assuming that all keys are unique, the return short circuits the read loop and prevents having to read through all entries. If your implementation can have duplicate keys, then simply leave out the return. This saves the expense of reading and forking both grep and awk. Using /dev/shm for both implementations yielded the following using time hget on a 3 entry hash searching for the last entry :

Grep/Awk: hget() {
grep "^$2 " /dev/shm/hashmap.$1 | awk '{ print $2 };'
}

time echo $(hget FD oracle)

3

real 0m0.011s
user 0m0.002s
sys 0m0.013s

Read/echo:

time echo $(hget FD oracle)

3

real 0m0.004s
user 0m0.000s
sys 0m0.004s

on multiple invocations I never saw less then a 50% improvement. This can all be attributed to fork over head, due to the use of /dev/shm.

jrichard
A: 

I really liked Al P's answer but wanted uniqueness enforced cheaply so I took it one step further - use a directory. There are some obvious limitations (directory file limits, invalid file names) but it should work for most cases.

hinit() {
    rm -rf /tmp/hashmap.$1
    mkdir -p /tmp/hashmap.$1
}

hput() {
    printf "$3" > /tmp/hashmap.$1/$2
}

hget() {
    cat /tmp/hashmap.$1/$2
}

hkeys() {
    ls -1 /tmp/hashmap.$1
}

hinit ids

for (( i = 0; i < 10000; i++ )); do
    hput ids "key$i" "value$i"
done

for (( i = 0; i < 10000; i++ )); do
    printf '%s\n' $(hget ids "key$i") > /dev/null
done

It also performs a tad bit better in my tests.

$ time bash hash.sh 
real    0m46.500s
user    0m16.767s
sys     0m51.473s

$ time bash dirhash.sh 
real    0m35.875s
user    0m8.002s
sys     0m24.666s

Just thought I'd pitch in. Cheers!

Mauldus