views:

77

answers:

6

I have a tar.gz file about 13GB in size. It contains about 1.2 million documents. When I untar this all these files sit in one single directory & any reads from this directory takes ages. Is there any way I can split the files from the tar into multiple new folders?

e.g.: I would like to create new folders named [1,2,...] each having 1000 files.

A: 

you can look at the man page and see if there are options like that. worst comes to worst, just extract the files you need (maybe using --exclude ) and put them into your folders.

ghostdog74
MovieYoda
+1  A: 
  • Obtain filename list with --list
  • Make files containing filenames with grep
  • untar only these files using --files-from

Thus:

tar --list archive.tar > allfiles.txt
grep '^1' allfiles.txt > files1.txt
tar -xvf archive.tar --files-from=files1.txt
Sjoerd
A: 

tar doesn't provide that capability directly. It only restores its files into the same structure from which it was originally generated.

Can you modify the source directory to create the desired structure there and then tar the tree? If not, you could untar the files as they are in the file and then post-process that directory using a script to move the files into the desired arrangement. Given the number of files, this will take some time but at least it can be done in the background.

sizzzzlerz
+1  A: 

This is a quick and dirty solution but it does the job in Bash without using any temporary files.

i=0                                 # file counter
dir=0                               # folder name counter
mkdir $dir                          
tar -tzvf YOURFILE.tar.gz |
cut -d ' ' -f12 |                   # get the filenames contained in the archive
while read filename
    do 
        i=$((i+1))
        if [ $i == 1000 ]           # new folder for every 1000 files
        then
            i=0                     # reset the file counter
            dir=$((dir+1))
            mkdir $dir
        fi
        tar -C $dir -xvzf YOURFILE.tar.gz $filename
    done

Same as a one liner:

i=0; dir=0; mkdir $dir; tar -tzvf YOURFILE.tar.gz | cut -d ' ' -f12 | while read filename; do i=$((i+1)); if [ $i == 1000 ]; then i=0; dir=$((dir+1)); mkdir $dir; fi; tar -C $dir -xvzf YOURFILE.tar.gz $filename; done

Depending on your shell settings the "cut -d ' ' -f12" part for retrieving the last column (filename) of tar's content output could cause a problem and you would have to modify that.

It worked with 1000 files but if you have 1.2 million documents in the archive, consider testing this with something smaller first.

lecodesportif
thanks all. 'lecodesportif' solution was more readymade for my need!
MovieYoda
+1  A: 

If you have GNU tar you might be able to make use of the --checkpoint and --checkpoint-action options. I have not tested this, but I'm thinking something like:

# UNTESTED
cd /base/dir
mkdir  $(printf "dir%04d\n" {1..1500})  # probably more than you need
ln -s dest0 linkname
tar -C linkname ... --checkpoint=1000 \
        --checkpoint-action='sleep=1' \
        --checkpoint-action='exec=ln -snf dest%u linkname ...
Dennis Williamson
A: 

thanks all. 'lecodesportif' solution was more readymade for my need!

MovieYoda