views:

206

answers:

7

The following code makes a list of names and 'numbers' and gives each person a random age between 15 and 90.

#!/bin/sh

file=$1
n=$2

# if number is zero exit
if [ "$n" -eq "0" ]
then
    exit 0
fi

echo "Generating list of $n people."

for i in `seq 1 $n`;
do
    let "NUM=($RANDOM%75)+15"
    echo "name$i $NUM (###)###-####" >> $file
done

echo "List generated."

With it I'm attempting to make a list of 1M names. It's slow, I expected that; it was so slow tho that I lost patience and tried 10K names. That was slow too, but it got done in a few seconds.

The reason I'm generating the names is to sort them. What surprised me is that when I sorted the list of 10K names it was instant.

How can this be?

Is there something that's making this go ungodly slow? Both the sorting and the generating are accessing files so how can the sorting be faster? Is my random number math in the list generator what's slowing it down?

Here's my sorting script.

#!/bin/sh
#first argument is list to be sorted, second is output file
tr -s '' < $1 | sort -n -k2 > $2
+3  A: 

Try this for your main loop:

seq 1 $n | while read i
do
    let "NUM=($RANDOM%75)+15"
    echo "name$i $NUM (###)###-####"
done > $file

This will make the seq and the loop work in parallel instead of waiting for the seq to finish before starting the loop. This will be faster on multiple cores/CPUs but slightly slower on a single core.

And I agree with the others here: Does it have to be bash?

Edit: add chaos' suggestion to keep the file open, not open for append for each name.

pgs
+3  A: 

Don't know if it's the whole story, but re-opening the file to append to it for every name can't be helping anything. Doing the whole thing in any context where you can keep an open file handle to write to should help a lot.

chaos
I think that this is the biggest performance improvement. Appending to files is expensive!
James Thompson
+5  A: 

Using the shell to generate random numbers like this isn't really what it was designed to do. You'll likely be better off coding something to generate random numbers from a uniform distribution in another language, like Fortran, Perl or C.

In your code, one thing that's going to be very slow is generating a sequence of numbers from 1..1e7 and assigning them all to a variable. That's likely very wasteful, but you should profile if you want to be sure. As chaos points out, appending to the file is also likely to be very costly!

In Python, you can do something like this:

#!/usr/bin/python
import random
count = 1

print ' '.join( ['name', 'age'] )
while count <= 1000000:
    age = random.randrange(15,90)
    count = count + 1
    name = 'name' + str(count)
    print ' '.join( [ name, str(age) ] )

Running that on my laptop takes ~10 seconds. Assigning the seq from 1 to 1000000 takes ~10 seconds, when you add the random number generation your script takes over three minutes on the same machine. I got frustrated just as you did, and played around with the script to try and make it faster. Here's my shortened version of your code that I'm playing with:

for x in `seq 1 10000`; do
   let "NUM=($RANDOM%75)+15"
   echo $NUM >> test.txt
done

Running this takes about 5.3s:

$ time ./test.sh
real    0m5.318s
user    0m1.305s
sys     0m0.675s

Removing the file appending and simply redirecting STDOUT to a single file gives the following script:

for x in `seq 1 10000`; do
   let "NUM=($RANDOM%75)+15"
   echo $NUM
done

Running this takes about half a second:

$ time ./test.sh > test.txt
real    0m0.516s
user    0m0.449s
sys     0m0.067s

The slowness of your program is at least partly due to appending to that file. Curiously, when I tried to swap the seq call with a for loop, I didn't notice any speedup.

James Thompson
+5  A: 
for i in `seq 1 $n`

Yikes! This is generating 1,000,000 arguments to the for loop. That seq call will take a long, long, long time. Try

for ((i = 1; i <= n; i++))

Notice the lack of dollar signs, by the way. Peculiarly, the var++ syntax requires you to omit the dollar sign from the variable name. You are also allowed to use or to omit them elsewhere: it could be i <= n or $i <= $n, either one. The way I figure, you should omit dollar signs entirely in let, declare, and for ((x; y; z)) statements. See the ARITHMETIC EVALUATION section of the sh man page for the full explanation.

John Kugelman
good point. an earlier incarnation of my script had this type of loop but since i'm still learning the syntax and all i replaced it because something wasn't right.in your second loop are the '$'s necessary?like "for ((i=1; $i <= $n; $i++))"when should you use the '$' and when shouldn't you?
Victor
You can omit them within arithmetic context (for ((, ((, $((, let, array subscripts, ...), if they do what you want. you MUST use them if they contain characters that are used to detect the base, like "0090" will be octal 90, you need to use "10#$foo" then to force base 10
TheBonsai
In my tests with a no-op loop body, using seq was about twice as fast as a C-style for loop. You're trading using lots of memory for having to process the increment and test with every iteration.
Dennis Williamson
@Dennis - I noticed that too in my tests, and was surprised by it. The answer might be different on memory-limited systems, but apparently the increment and test are far more expensive than I would have believed based on experiences in other langauges.
James Thompson
+2  A: 

(I have a feeling you may not like this answer, but you technically didn't specify the answer had to remain in bash! :P)

It's common to rapidly develop something in prototyping language, and then possibly switch to another language (often C) as needed. Here's a very similar program in Python for you to compare:

#!/usr/bin/python
import sys
import random

def main(args=None):
    args = args or []
    if len(args) == 1:
     # default first parameter
     args = ["-"] + args
    if len(args) != 2:
     sys.stderr.write("error: invalid parameters\n")
     return 1
    n = int(args[1])
    output = sys.stdout if args[0] == "-" else open(args[0], "a")

    for i in xrange(1, n + 1):
     num = random.randint(0, 74)
     output.write("name%s %s (###)###-####\n" % (i, num))

    sys.stderr.write("List generated.\n") # see note below

if __name__ == "__main__":
    sys.exit(main(sys.argv[1:]))

Note: Only using stdout for "real output" instead of status notifications allows this program to be run in parallel with others, piping data directly from stdout of one to stdin of another. (It's possible with special files in *nix, but just easier if you can use stdout.) Example:

$./rand_names.py 1000000 | sort -n -k2 > output_file

And it should be fast enough:

$time ./rand_names.py 1000000 > /dev/null
List generated.

real    0m16.393s
user    0m15.108s
sys     0m0.171s
Roger Pate
+4  A: 

I guess the '>> $file' can be the source of your problem. On my system your script takes 10 seconds to generate 10000. If I remove the $file argument and instead just use stdout and capture the whole thing to a file it takes under a second.

$ time ./gen1.sh n1.txt 10000 Generating list of 10000 people. List generated.

real 0m7.552s user 0m1.355s sys 0m1.886s

$ time ./gen2.sh 10000 > n2.txt

real 0m0.806s user 0m0.576s sys 0m0.140s

John Nilsson
+3  A: 

Not a new answer, just new code.

This is what IMHO is a good middle way between nice and efficient code (as efficient as you can be in Bash, it IS slow, it's a shell...)

for ((i=1;i<=n;i++));
do
  echo "name$i $((NUM=(RANDOM%75)+15)) (###)###-####"
done > "$file"

Alternative, not using a classic counter loop

i=1
while ((i<=n)); do
  echo "name$((i++)) $((NUM=(RANDOM%75)+15)) (###)###-####"
done > "$file"

Both are about the same speed.

The fixes are the same as mentioned by all the others:

  • do not frequently close and re-open the file
  • use shell arithmetics
  • ah yes, and use QUOTES, but that's for sanity, not for speed
TheBonsai
wow, thanks for that. I didn't know that you could do the file output after 'done'. i was really hoping I could keep my scripts the way they were as far as how they take arguments. that really helps.
Victor