I have a few directories that contain a lot of files. As some of them are approaching 600k files, they have become a major pain to handle. Just listing the files is slowly becoming a major bottleneck in the applications processing them.
The files are named like this: id_date1_date2.gz I've decided to split the files into several smaller one, depending on the first part, "id".
Since the same id may show up in a large number of files, and the same id shows up in several directories already, I need to keep track of which file id's that have been copied, and from what dirs. Otherwise I'd end up doing the same copying an insane amount of times, or missing id X when copied from dir Y, if already copied from dir Z.
I've written a script to accomplish this. Some debugging included
#!/bin/bash
find /marketdata -maxdepth 2 -type d | grep "[0-9]\.[0-9][0-9][0-9]$" | sort | #head -n2 | tail -n1 |
while read baseDir; do
cd $baseDir;
echo $baseDir > tmpFile;
find . -type f | grep -v "\.\/\." | #sort | head -n4 |
while read file; do
name=$(awk 'BEGIN {print substr("'"$file"'", 3,index("'"$file"'", "_")-3 )}');
dirkey=${baseDir//[\/,.]/_}"_"$name;
if [ "${copied[$dirkey]}" != "true" ]; then
echo "Copying $baseDir/$name with:";
echo mkdir -p $(sed 's/data/data4/' tmpFile)/$name;
#mkdir -p $(sed 's/data/data4/' tmpFile)/$name;
oldName=$baseDir/$name"_*";
echo cp $oldName "$(sed 's/data/data4/' tmpFile)/$name/";
#cp $oldName "$(sed 's/data/data4/' tmpFile)/$name/";
echo "Setting $dirkey to true";
copied[$dirkey]="true";
else
echo "$dirkey: ${copied[$dirkey]}"
sleep 1
fi
done;
rm tmpFile;
done
The problem here is that the value of all keys in copied seem to become true from the very first copying, so my handling of bash arrays is probably the issue here.
Some progress: I tried writing each key to a file, and upon each iteration, I read that file in to an array instead. This is obviously really ugly, but looks like it accomplishes my goal. Could be that this becomes extremely slow as I've processed a few thousand id's. Will update later.
For someone else who may find this in the future, here's the final script:
declare -A copied
find /marketdata -maxdepth 2 -type d -name "[0-9]\.[0-9][0-9][0-9]" | sort | #head -n3 | tail -n1 |
while read baseDir; do
cd $baseDir;
find . -type f | grep -v "\.\/\." | sort | #head -n100 |
while read file; do
length=$(expr index "$file" "_");
name=${file:2:$((length - 3))};
dirkey=${baseDir//[\/,.]/_}"_"$name;
if [ "${copied[$dirkey]}" != "true" ]; then
echo "Copying ${baseDir}/${name} to ${baseDir//data/data4}/$name";
mkdir -p "${baseDir//data/data4}/$name";
oldName="${baseDir}/${name}_*";
cp -n $oldName "${baseDir//data/data4}/${name}/";
copied[$dirkey]="true";
fi
done;
done
No awk, no sed, better quoted, no writing of temporary files to disc, less grep. I'm not sure if the dirkey hack is necessary now that the associative array is working properly, nor do I entirely understand why I need the oldName var.