tags:

views:

387

answers:

5

I need the MD5 sums of 3 million strings or so in a bash script on ubuntu. 3 million strings -> 3 million MD5 hashes. The trivial implementation takes about 0.005sec per string. That's over 4 hours. What faster alternatives exist? Is there a way to pump groups of strings into md5sum?

#time md5sum running 100 times on short strings
#each iteration is ~0.494s/100 = 0.005s
time (for i in {0..99}; do md5sum <(echo $i); done) > /dev/null

real    0m0.494s
user    0m0.120s
sys     0m0.356s

A good solution will include a bash/Perl script that takes a list of strings from stdin and outputs a list of their MD5 hashes.

+2  A: 

I'm without a machine to test it on right now, but is md5sum <<< "$i" any faster than md5sum <(echo $i)? The <<< syntax would avoid the overhead of forking a sub-process for the echo and would pass $i directly to md5sum on standard input.

John Kugelman
Nice -- it shaves about a third off on my box.
Dirk Eddelbuettel
Yup, on my box too. Good catch!
Oren Trutner
+6  A: 

It's not hard to do in C (or Perl or Python) using any of the many md5 implementations -- at its heart md5 is a hash function that goes from a character vector to a character vector.

So just write a outer program that reads your 3 million strings, and then feed them one by one to the md5 implementation of your choice. That way you have one program startup rather than 3 million, and that alone will save you time.

FWIW in one project I used the md5 implementation (in C) by Christophe Devine, there is OpenSSL's as well and I am sure CPAN will have a number of them for Perl too.

Edit: Ok, couldn't resist. The md5 implementation I mentioned is e.g. inside this small tarball. Take the file md5.c and replace the (#ifdef'ed out) main() at the bottom with this

int main( int argc, char *argv[] ) {
    FILE *f;
    int j;
    md5_context ctx;
    unsigned char buf[1000];
    unsigned char md5sum[16];

    if( ! ( f = fopen( argv[1], "rb" ) ) ) {
        perror( "fopen" );
        return( 1 );
    }

    while( fscanf(f, "%s", buf) == 1 ) {
        md5_starts( &ctx );
        md5_update( &ctx, buf, (uint32) strlen((char*)buf) );
        md5_finish( &ctx, md5sum );

        for( j = 0; j < 16; j++ ) {
            printf( "%02x", md5sum[j] );
        }
        printf( " <- %s\n", buf );
    }
    return( 0 );
}

build a simple standalone program as e.g. in

/tmp$ gcc -Wall -O3 -o simple_md5 simple_md5.c

and then you get this:

# first, generate 300,000 numbers in a file (using 'little r', an R variant)
/tmp$ r -e'for (i in 1:300000) cat(i,"\n")' > foo.txt

# illustrate the output
/tmp$ ./simple_md5 foo.txt | head
c4ca4238a0b923820dcc509a6f75849b <- 1
c81e728d9d4c2f636f067f89cc14862c <- 2
eccbc87e4b5ce2fe28308fd9f2a7baf3 <- 3
a87ff679a2f3e71d9181a67b7542122c <- 4
e4da3b7fbbce2345d7772b0674a318d5 <- 5
1679091c5a880faf6fb5e6087eb1b2dc <- 6
8f14e45fceea167a5a36dedd4bea2543 <- 7
c9f0f895fb98ab9159f51fd0297e236d <- 8
45c48cce2e2d7fbdea1afc51c7c6ad26 <- 9
d3d9446802a44259755d38e6d163e820 <- 10

# let the program rip over it, suppressing stdout
/tmp$ time (./simple_md5 foo.txt > /dev/null)

real    0m1.023s
user    0m1.008s
sys     0m0.012s
/tmp$

So that's about a second for 300,000 (short) strings.

Dirk Eddelbuettel
On my box this is about 3 orders of magnitude faster than bash/md5sum. This isn't all that concise, but the speed can't be argued with. For future readers, note that fscanf reads until a whitespace is met; if you need to process full lines, consider fgets or similar, and think whether you need the terminating newline char in the hash. I'll award the answer to this one, before someone goes and writes a GPU accelerated MD5. At x1000 faster, my bottlenecks moved elsewhere. Thanks Dirk, kread, John, gbacon, and others.
Oren Trutner
+1  A: 

To get better performance you would probably need to use a different program, or create a C program that calls one of the publicly available md5 hash APIs.

Another option is to spawn multiple md5 calls at once to take advantage of multiple cores. Each loop through you could spawn 8 calls, the first 7 using & at the end (to indicate asynchronous). If you have 4-8 cores available, this could speed it up by a factor of 8.

Andrew
I am not sure about the factor of 8 -- you also need to transport the input to the cores, this problem is probably more i/o-bound than cpu-bound so I think you get a sublinear speedup. But in general, yes, this is 'embarassingly parallel' and could be approached this way for a still-decent speedup.
Dirk Eddelbuettel
+3  A: 
perl -MDigest::MD5=md5_hex -lpe '$_ = md5_hex $_'
Greg Bacon
Thanks for the concise answer; it takes about 200% longer than the md5sum loop in the question though.
Oren Trutner
Did you fork one perl process per line of input or supply all lines on the standard input of a single perl process?
Greg Bacon
Ah, I did fork a perl process per line. When piping lines to a single perl process, I get a x3 speed increase. Thanks!
Oren Trutner
+3  A: 
#~/sw/md5$ time (for i in {0..99}; do md5sum <(echo $i); done) > /dev/null

real    0m0.220s
user    0m0.084s
sys 0m0.160s
#~/sw/md5$ time (python test.py `for i in {0..99}; do echo $i; done`) > /dev/null

real    0m0.041s
user    0m0.024s
sys 0m0.012s

The python code is five times as fast for this small samples, for larger samples the difference is much bigger because of the missing spawns. 1k samples are 0.033s to 2.3s :) The script is:

#!/usr/bin/env python
import hashlib, sys

for arg in sys.argv[1:]:
  print hashlib.md5(arg).hexdigest()
kread
It's perhaps 2 orders of magnitude faster for longer lists -- thanks! I didn't realize the python solution could be so concise. The hashes come out different though; perhaps a newline character is or isn't included. I'll look into that. Thanks again!
Oren Trutner
It is concise, but it takes its arguments from the command-line which will not scale for 3,000,000 strings.
Dirk Eddelbuettel
Actually, when modifying the python code to read from stdin instead of the command line, I get speeds approaching or exceeding(!) the C solution. It seems to be fast enough to move the bottleneck to IO. While I've awarded the answer to the C solution, if you're reading this in the future you might want to try the python approach first. To read from stdin, change the loop to: "for line in sys.stdin: print hashlib.md5(line).hexdigest()"
Oren Trutner
It would also be pretty easy in Python 2.6 or later to use the multiprocessing module to scale this out to more processors. Create a Queue object of inputs, and one for results, spawn off a few process (perhaps based on the number of processors, or a command line switch, etc), and have all the subprocesses consume the inputs, compute their hashes and feed the results back to the master through the results queue.
Jim Dennis