tags:

views:

192

answers:

7

Hi,

I have around 20000 files coming from the output of some program, and their names follow the format:

data1.txt
data2.txt
...
data99.txt
data100.txt
...
data999.txt
data1000.txt
...
data20000.txt

I would like to write a script that gets as input argument the number N. Then it makes blocks of N concatenated files, so if N=5, it would make the following new files:

data_new_1.txt: it would contain (concatenated) data1.txt to data5.txt (like cat data1.txt data2.txt ...> data_new_1.txt )

data_new_2.txt: it would contain (concatenated) data6.txt to data10.txt
.....

I wonder what do you think would be the best approach to do this, whether bash, python or another one like awk, perl, etc.

The best approach I mean in terms of simplest code.

Thanks

A: 

Since this can easily be done in any shell I would simply use that.

This should do it:

#!/bin/sh
FILES=$1
FILENO=1

for i in data[0-9]*.txt; do
    FILES=`expr $FILES - 1`
    if [ $FILES -eq 0 ]; then
        FILENO=`expr $FILENO + 1`
        FILES=$1
    fi

    cat $i >> "data_new_${FILENO}.txt"
done

Python version:

#!/usr/bin/env python

import os
import sys

if __name__ == '__main__':
    files_per_file = int(sys.argv[1])

    i = 0
    while True:
        i += 1
        source_file = 'data%d.txt' % i
        if os.path.isfile(source_file):
            dest_file = 'data_new_%d.txt' % ((i / files_per_file) + 1)
            file(dest_file, 'wa').write(file(source_file).read())
        else:
            break
WoLpH
I am sure both languages will be limited by the speed of IO.
Joe Koberg
Spawning 2000 `cat` processes in the bash version surely can't _accelerate_ it compared to a Python version working in 1 process;-).
Alex Martelli
This program has a bug: `data[0-9]*.txt` will match `data19.txt` before `data2.txt`.
Jason Orendorff
yes Jason, you are totally right
Werner
@Jason: How it's sorted would depend on your shell, perhaps it should be sorted first indeed :)@Alex: very true... I hadn't thought of the cat overhead...
WoLpH
Yep, the bug's in both versions, as they both use globbing.. In my Python version, and @sorpigal's bash version, integer counters are used in lieu of globbing, to bypass the problem.
Alex Martelli
@WoLpH, globbing guarantees alphabetical sorting in both bash and Python's glob module.
Alex Martelli
You're correct Alex, I've replaced globbing by a simple range in the Python version, although it will break horribly in case of missing files ;)
WoLpH
Hi. I see that you never close any files, and you reopen data_new files multiple times. Does that matter? Is python smart enough to keep track of what files are open, flush the buffers that need to be flushed, etc.?
MiniQuark
Yes, the Python garbage collector will take care of that.Since the file handlers aren't stored they are closed automatically right after opening them.
WoLpH
I see your point, but I thought that the GC wasn't guaranteed to run at any specific moment in time. Are you sure that the file is guaranteed to be closed before the next iteration in the for loop takes place?
MiniQuark
There's an interesting discussion about this question here: http://stackoverflow.com/questions/1832528/is-close-necessary-when-using-iterator-on-a-python-file-object
MiniQuark
There is indeed no guarantee that it's closed at the time you'd like, however, with code similar to this in test scripts I haven't had problems in the past yet. Although that could have been pure luck ofcourse :)
WoLpH
+1  A: 

Best in what sense? Bash can do this quite well, but it may be harder for you to write a good bash script if you are more familiar with another scripting language. Do you want to optimize for something specific?

That said, here's a bash implementation:

 declare blocksize=5
 declare i=1
 declare blockstart=1
 declare blockend=$blocksize
 declare -a fileset 
 while [ -f data${i}.txt ] ; do
         fileset=("${fileset[@]}" $data${i}.txt)
         i=$(($i + 1))
         if [ $i -gt $blockend ] ; then
                  cat "${fileset[@]}" > data_new_${blockstart}.txt
                  fileset=() # clear
                  blockstart=$(($blockstart + $blocksize))
                  blockend=$(($blockend+ $blocksize))
         fi
 done

EDIT: I see you now say "Best" == "Simplest code", but what's simple depends on you. For me Perl is simpler than Python, for some Awk is simpler than bash. It depends on what you know best.

EDIT again: inspired by dtmilano, I've changed mine to use cat once per blocksize, so now cat will be called 'only' 4000 times.

Sorpigal
yes, exactly. but in this case, your code are rather simple, thanks
Werner
+4  A: 

Here's a Python (2.6) version (if you have Python 2.5, add a first line that says

from __future__ import with_statement

and the script will also work)...:

import sys

def main(N):
   rN = range(N)
   for iout, iin in enumerate(xrange(1, 99999, N)):
       with open('data_new_%s.txt' % (iout+1), 'w') as out:
           for di in rN:
               try: fin = open('data%s.txt' % (iin + di), 'r')
               except IOError: return
               out.write(fin.read())
               fin.close()

if __name__ == '__main__':
    if len(sys.argv) > 1:
        N = int(sys.argv[1])
    else:
        N = 5
    main(N)

As you see from other answers & comments, opinions on performance differ -- some believe that the Python startup (and imports of modules) will make this slower than bash (but the import part at least is bogus: sys, the only needed module, is a built-in module, requires no "loading" and therefore basically negligible overhead to import it); I suspect avoiding the repeated fork/exec of cat may slow bash down; others think that I/O will dominate anyway, making the two solutions equivalent. You'll have to benchmark with your own files, on your own system, to solve this performance doubt.

Alex Martelli
Your `fin.read()` wouldn't happen to load everything into memory, now would it? :)
vladr
Not "everything", of course -- just the contents of the current data file. If you're dealing with individual files bigger than a gigabyte, or whatever you can comfortably load in memory (many gigabytes on a well-RAM-equipped 64-bit server), it's easy to constrain the buffer to whatever you need (1GB, 16GB, whatever), see the `shutil` standard library module for the lazy way, or just code out the three-lines loop;-).
Alex Martelli
It's the three-lines loop I was aiming at. ;) By "everything", of course, I meant "everything" in `fin`. :)
vladr
@Vlad, `while True:\n data=fin.read(BUFSIZ)\n if not data:break\n out.write(data)` (weak formatting, but that's what you have to do in SO comments;-). When dealing with 20,000 files, though, I wouldn't worry about them being more than a few GB each, so wouldn't bother w/this loop;-).
Alex Martelli
A: 

Let's say, if you have a simple script that concatenates files and keeps a counter for you, like the following:

#!/usr/bin/bash
COUNT=0
if [ -f counter ]; then
  COUNT=`cat counter`
fi
COUNT=$[$COUNT+1]
echo $COUNT > counter
cat $@ > $COUNT.data

The a command line will do:

find -name "*" -type f -print0 | xargs -0 -n 5 path_to_the_script
Codism
+1  A: 

I like this one which saves on executing processes, only 1 cat per block

#! /bin/bash

N=5 # block size
S=1 # start
E=20000 # end

for n in $(seq $S $N $E)
do
    CMD="cat "
    i=$n
    while [ $i -lt $((n + N)) ]
    do
        CMD+="data$((i++)).txt "
    done
    $CMD > data_new_$((n / N + 1)).txt
done
dtmilano
A: 

Simple enough?

make_cat.py

limit = 1000
n = 5
for i in xrange( 0, (limit+n-1)//n ):
     names = [ "data{0}.txt".format(j) for j in range(i*n,i*n+n) ]
     print "cat {0} >data_new_{1}.txt".format( " ".join(names), i )

Script

python make_cat.py | sh
S.Lott
why do you have to make calls to the shell? You can do everything in Python.
ghostdog74
"The best approach I mean in terms of simplest code." I think that's a perfectly rotten definition of "best". That's why I wrote such a peculiar piece of code. I think it meets the definition. And demonstrates that "simplest code" is not always desirable.
S.Lott
+1  A: 

how about a one liner ? :)

ls data[0-9]*txt|sort -nk1.5|awk 'BEGIN{rn=5;i=1}{while((getline _<$0)>0){print _ >"data_new_"i".txt"}close($0)}NR%rn==0{i++}'
ghostdog74
Hi, I realize now that your approach is the best. Just one thing, for more than 20000 files I get errors, how can you use your script with xargs? thanks
Werner
`ls file* |xargs -n1 | sort...`
ghostdog74