tags:

views:

81

answers:

2

I have a few directories that contain a lot of files. As some of them are approaching 600k files, they have become a major pain to handle. Just listing the files is slowly becoming a major bottleneck in the applications processing them.

The files are named like this: id_date1_date2.gz I've decided to split the files into several smaller one, depending on the first part, "id".

Since the same id may show up in a large number of files, and the same id shows up in several directories already, I need to keep track of which file id's that have been copied, and from what dirs. Otherwise I'd end up doing the same copying an insane amount of times, or missing id X when copied from dir Y, if already copied from dir Z.

I've written a script to accomplish this. Some debugging included

#!/bin/bash  
find /marketdata -maxdepth 2 -type d | grep "[0-9]\.[0-9][0-9][0-9]$" | sort | #head -n2 | tail -n1 |
    while read baseDir; do

    cd $baseDir;
    echo $baseDir > tmpFile;
    find . -type f | grep -v "\.\/\." | #sort | head -n4 |
            while read file; do
            name=$(awk 'BEGIN {print substr("'"$file"'", 3,index("'"$file"'", "_")-3 )}');

            dirkey=${baseDir//[\/,.]/_}"_"$name;
            if [ "${copied[$dirkey]}" != "true" ]; then
                    echo "Copying $baseDir/$name with:";
                    echo mkdir -p $(sed 's/data/data4/' tmpFile)/$name;
                    #mkdir -p $(sed 's/data/data4/' tmpFile)/$name;
                    oldName=$baseDir/$name"_*";
                    echo cp $oldName "$(sed 's/data/data4/' tmpFile)/$name/";
                    #cp $oldName "$(sed 's/data/data4/' tmpFile)/$name/";
                    echo "Setting $dirkey to true";
                    copied[$dirkey]="true";
            else
                    echo "$dirkey: ${copied[$dirkey]}"
                    sleep 1
            fi
    done;

    rm tmpFile;
done

The problem here is that the value of all keys in copied seem to become true from the very first copying, so my handling of bash arrays is probably the issue here.

Some progress: I tried writing each key to a file, and upon each iteration, I read that file in to an array instead. This is obviously really ugly, but looks like it accomplishes my goal. Could be that this becomes extremely slow as I've processed a few thousand id's. Will update later.

For someone else who may find this in the future, here's the final script:

declare -A copied

find /marketdata -maxdepth 2 -type d -name "[0-9]\.[0-9][0-9][0-9]" | sort | #head -n3 | tail -n1 |
    while read baseDir; do

    cd $baseDir;
    find . -type f | grep -v "\.\/\." | sort | #head -n100 |
            while read file; do
            length=$(expr index "$file" "_");
            name=${file:2:$((length - 3))};

            dirkey=${baseDir//[\/,.]/_}"_"$name; 
            if [ "${copied[$dirkey]}" != "true" ]; then
                    echo "Copying ${baseDir}/${name} to ${baseDir//data/data4}/$name";
                    mkdir -p "${baseDir//data/data4}/$name";
                    oldName="${baseDir}/${name}_*";
                    cp -n $oldName "${baseDir//data/data4}/${name}/";
                    copied[$dirkey]="true";
            fi
    done;
done

No awk, no sed, better quoted, no writing of temporary files to disc, less grep. I'm not sure if the dirkey hack is necessary now that the associative array is working properly, nor do I entirely understand why I need the oldName var.

A: 

The -n option to cp is very useful in situations like this. It lets you not worry if a file is already in the destination.

-n, --no-clobber
   do not overwrite an existing file (overrides 
   a previous -i option)

This basically makes the case you talk about where you do the same work twice go away. You can split your concerns into moving all the files and only moving files that haven't been moved before.

Paul Rubel
Thanks, I've added that to the script. While this does improve the situation, it still seems rather ugly. cp still needs to do several thousands of checks on individual files. Then again, I don't know if it's much faster to do the checking in bash.
Claes
+1  A: 

If the value in $dirkey contains alpha characters you'll have to use an associative array which isn't available before Bash 4. If you're using Bash 4 and the keys are alphanumeric rather than simply numeric, add the following at the top of your script:

declare -A copied

Additional comments:

You're using parameter expansion in some places and sed in others. You could use brace expansion in (perhaps) all cases.

I would recommend instead of doing quoting like $var"literal"$var, do it like "${var}literal${var}" or in cases where the literal will not be ambiguously interpreted as part of the variable name you can omit the braces: "literal$var".

Use variable passing with awk instead of complex "'" quoting: awk -v awkvar=$shellvar '{print awkvar}'.

Calling external executables in a loop can slow things down quite a lot, especially if it's only dealing with one value (or line of data) at a time. The 'sedcommands that I mentioned are examples of this. Also, yourawk` command may be able to be converted to parameter expansion form.

GNU find has a regex feature that you could use instead of grep.

All variable names which contain filenames should be quoted.

Dennis Williamson
I use bash 4, but haven't used associative arrays before, the declare was new to me, and seems to have made all the difference, thank you! I will correct my variables to use proper quoting. I actually tried -v with awk initially, but it failed to work for reasons I couldn't figure out. I will see about replacing that ugly sed too. I wasn't aware I could use regex with find. I can't quite get it work either, but if I drop the $, the -name accepts my regex. You post has been most informative. Thank you again.
Claes