views:

97

answers:

2

Hi,

I need to examine the output of a certain script 1000s of times on a unix platform and check if any of it has changed from before.

I've been doing this:

(script_stuff) | md5sum

and storing this value. I actually don't really need "md5", JUST a simple hash function which I can compare against a stored value to see if its changed. Its okay if there are an occassional false positive.

Is there anything better than md5sum that works faster and generates a fairly usable hash value? The script itself generates a few lines of text - maybe 10-20 on average to max 100 or so.

I had a look at http://stackoverflow.com/questions/1961752/fast-md5sum-on-millions-of-strings-in-bash-ubuntu - that's wonderful, but I can't compile a new program. Need a system utility... :(


Additional "background" details:

I've been asked to monitor the DNS record of a set of 1000 or so domains and immediately call certain other scripts if there has been any change. I intend to do a dig xyz +short statement and hash its output and store that, and then check it against a previously stored value. Any change will trigger the other script, otherwise it just goes on. Right now, we're planning on using cron for a set of these 1000, but can think completely diffeerently for "seriously heavy" usage - ~20,000 or so.

I have no idea what the use of such a system would be, I'm just doing this as a job for someone else...

+3  A: 

How big is the output you're checking? A hundred lines max. I'd just save the entire original file then use cmp to see if it's changed. Given that a hash calculation will have to read every byte anyway, the only way you'll get an advantage from a checksum type calculation is if the cost of doing it is less than reading two files of that size.

And cmp won't give you any false positives or negatives :-)

pax> echo hello >qq1.txt
pax> echo goodbye >qq2.txt
pax> cp qq1.txt qq3.txt
pax> cmp qq1.txt qq2.txt >/dev/null
pax> echo $?
1
pax> cmp qq1.txt qq3.txt >/dev/null
pax> echo $?
0

Based on your question update:

I've been asked to monitor the DNS record of a set of 1000 or so domains and immediately call certain other scripts if there has been any change. I intend to do a dig xyz +short statement and hash its output and store that, and then check it against a previously stored value. Any change will trigger the other script, otherwise it just goes on. Right now, we're planning on using cron for a set of these 1000, but can think completely diffeerently for "seriously heavy" usage - ~20,000 or so.

I'm not sure you need to worry too much about the file I/O. The following script executed dig microsoft.com +short 5000 times first with file I/O then with output to /dev/null (by changing the comments).

#!/bin/bash
rm -rf qqtemp
mkdir qqtemp
((i = 0))
while [[ $i -ne 5000 ]] ; do
        #dig microsoft.com +short >qqtemp/microsoft.com.$i
        dig microsoft.com +short >/dev/null
        ((i = i + 1))
done

The elapsed times at 5 runs each are:

File I/O  |  /dev/null
----------+-----------
    3:09  |  1:52
    2:54  |  2:33
    2:43  |  3:04
    2:49  |  2:38
    2:33  |  3:08

After removing the outliers and averaging, the results are 2:49 for the file I/O and 2:45 for the /dev/null. The time difference is four seconds for 5000 iterations, only 1/1250th of a second per item.

However, since an iteration over the 5000 takes up to three minutes, that's how long it will take maximum to detect a problem (a minute and a half on average). If that's not acceptable, you need to move away from bash to another tool.

Given that a single dig only takes about 0.012 seconds, you should theoretically do 5000 in sixty seconds assuming your checking tool takes no time at all. You may be better off doing something like this in Perl and using an associative array to store the output from dig.

Perl's semi-compiled nature means that it will probably run substantially faster than a bash script and Perl's fancy stuff will make the job a lot easier. However, you're unlikely to get that 60-second time much lower just because that's how long it takes to run the dig commands.

paxdiablo
Thanks pax - but i'm thinking the huge number of disk calls would slow down the program further . Do you agree?
RubiCon10
What's the huge number you're talking about exactly? Your question mentioned a hundred lines of text. Unless they're very _long_ lines, you're not talking about much I/O at all.
paxdiablo
This script will be running every 2-3 minutse or so on a set of 5000 data sets (which will be independently changing). If the output isnt "recorded"/compared in that time, I miss on a necessary alert to be generated
RubiCon10
Okay, @RubiCom10, based on this and your last question, it's probably best if you step back and tell us _what_ you're trying to achieve rather than _how_. By pre-supposing the solution, your tying your hands. Classic example, if you only want to see what files are changing, you can use the last modification time and forget about their hashes altogether. A script to do this on 5,000 files can be blindingly fast.
paxdiablo
There are no files here... the files are what your solution was generating. My previous question IS linked to this one because md5sum generates a " -" output for each call (in my question, I mistakenly thought it was a '*' character). I'll add more details in my question...
RubiCon10
A: 

The cksum utility calculates a non-cryptographic CRC checksum.

caf
Wow - this clicks for me caf - this shaved off 8 seconds on a test set of 200 (the only change was the replacement of md5sum with cksum)! Great! I wasnt even aware of such a tool!!
RubiCon10