views:

294

answers:

2

Given function f1 which receives n String arguments, what would be considered better ,in terms of runtime performance, random key generation strategy for memcache?

Our Memcache client does internal md5sum hashing on the keys it gets:

   public class MemcacheClient {  
       public Object get(String key) {
            String md5 = Md5sum.md5(key)
            // Talk to memcached to get the Serialization... 
            return memcached(md5);
       }
   }

My usage scenarios are:

First option

    public static String f1(String s1, String s2, String s3, String s4) {
         String key = s1 +  s2 + s3 + s4;
         return get(key);
    }

Second option

    /**
     * Calculate hash from Strings
     *
     * @param objects vararg list of String's
     *
     * @return calculated md5sum hash
     */
    public static String stringHash(Object... strings) {
        if(strings == null) 
            throw new NullPointerException("D'oh! Can't calculate hash for null");

        MD5 md5sum = new MD5();

//      if(prevHash != null)
//          md5sum.Update(prevHash);

        for(int i = 0; i < strings.length; i++) {
            if(strings[i] != null) {
                md5sum.Update("_"); 
                md5sum.Update(strings[i].toString()); // Convert to String...
                md5sum.Update("_");

            } else {
                // If object is null, allow minimum entropy  by hashing it's position
                md5sum.Update("_");
                md5sum.Update(i);
                md5sum.Update("_");
            }
        }

        return md5sum.asHex();
    }


    public static String f1(String s1, String s2, String s3, String s4) {
         String key = stringHash(s1, s2, s3, s4);
         return get(key);
    }

Note that the possible problem with the second option is that we are doing second md5sum (in the memcache client) on an already md5sum'ed digest result.

Thanks for reading, Maxim.

-- Edit Used MD5 utility source

+1  A: 

"Better" in what sense? Why would you think the second option is "better"? It does more string concatentations, more MD5 hashes and just generally seems much less efficient than the first...

Dean Harding
I disagree. On the first option I'm doing string concatenation, meaning I'm copying memory without creating another instance of string that will be GC'ed. On the second option I'm doing md5sum's which is cheap mathematical operation. For "Better" i meant runtime BigO performance.
Maxim Veksler
Please correct "without" to "thus", it should read "...copying memory thus creating another instance of..."
Maxim Veksler
I've edited the question to explain "better" better. Thank you for the comment.
Maxim Veksler
@Maxim: you're doing *more* string concatenations in the second version: `"_" + strings[i] + "_"` that's two per string!
Dean Harding
@codeka: any JIT worth its salt will optimize that away.
Ants Aasma
@Ants: perhaps, but `a + b + c + d` would also be optimizied by the compiler to whatever the java equivalent of `string.Concat(a, b, c, d)` is.
Dean Harding
@codeka I have fixed to code to avoid second option string concatenation. After this patch would you agree that the second option is more runtime efficient then the first? Think about a,b,c,d each 1000 byte size, option1 would create additional 4000 byte array (String's concat) where as second option would digest them all. Isn't that a better solution?
Maxim Veksler
@Maxim: Maybe, you'd need to use a profiler to know for sure though. Personally, I would still prefer the first version for it's simplicity.
Dean Harding
+1  A: 

Just nitpicking, but you probably don't want random key generation, the key generation should be deterministic, but should generate a uniform distribution in the key space.

If you consider only accidental collisions, then the first approach is almost fine. You should prefix the strings with their length so you don't get collisions when a substring moves from one param to another. Given md5's pretty good avalanche properties that will ensure that accidental collisions are rare enough to be ignored.

But be careful with MD5 if you process user input, it has known collision attacks. If an untrusted user can pick some arbitrary bytes for the function parameters and returning a wrong result can have security implications, then you have a security hole. For instance if you use this to cache authorization info, an attacker could work out two sets of parameters that hash to a single value. One would access something public and the other accesses a protected service. Now just request authorization with the first set, get the authorization cached and then access the protected service with the other set, receiving a green light from the cached authorization.

Ants Aasma