ansaurus

Question

What is the best, python or bash for selectively concatenating lots of files?

Answer 1

A:

Since this can easily be done in any shell I would simply use that.

This should do it:

#!/bin/sh
FILES=$1
FILENO=1

for i in data[0-9]*.txt; do
    FILES=`expr $FILES - 1`
    if [ $FILES -eq 0 ]; then
        FILENO=`expr $FILENO + 1`
        FILES=$1
    fi

    cat $i >> "data_new_${FILENO}.txt"
done

Python version:

#!/usr/bin/env python

import os
import sys

if __name__ == '__main__':
    files_per_file = int(sys.argv[1])

    i = 0
    while True:
        i += 1
        source_file = 'data%d.txt' % i
        if os.path.isfile(source_file):
            dest_file = 'data_new_%d.txt' % ((i / files_per_file) + 1)
            file(dest_file, 'wa').write(file(source_file).read())
        else:
            break

WoLpH 2010-03-12 17:58:17

I am sure both languages will be limited by the speed of IO.

Joe Koberg 2010-03-12 17:59:18

Spawning 2000 `cat` processes in the bash version surely can't _accelerate_ it compared to a Python version working in 1 process;-).

Alex Martelli 2010-03-12 18:08:22

This program has a bug: `data[0-9]*.txt` will match `data19.txt` before `data2.txt`.

Jason Orendorff 2010-03-12 18:13:16

yes Jason, you are totally right

Werner 2010-03-12 18:16:53

@Jason: How it's sorted would depend on your shell, perhaps it should be sorted first indeed :)@Alex: very true... I hadn't thought of the cat overhead...

WoLpH 2010-03-12 18:20:57

Yep, the bug's in both versions, as they both use globbing.. In my Python version, and @sorpigal's bash version, integer counters are used in lieu of globbing, to bypass the problem.

Alex Martelli 2010-03-12 18:23:04

@WoLpH, globbing guarantees alphabetical sorting in both bash and Python's glob module.

Alex Martelli 2010-03-12 18:24:04

You're correct Alex, I've replaced globbing by a simple range in the Python version, although it will break horribly in case of missing files ;)

WoLpH 2010-03-12 18:27:38

Hi. I see that you never close any files, and you reopen data_new files multiple times. Does that matter? Is python smart enough to keep track of what files are open, flush the buffers that need to be flushed, etc.?

MiniQuark 2010-03-12 18:34:43

Yes, the Python garbage collector will take care of that.Since the file handlers aren't stored they are closed automatically right after opening them.

WoLpH 2010-03-12 18:36:44

I see your point, but I thought that the GC wasn't guaranteed to run at any specific moment in time. Are you sure that the file is guaranteed to be closed before the next iteration in the for loop takes place?

MiniQuark 2010-03-12 18:43:51

There's an interesting discussion about this question here: http://stackoverflow.com/questions/1832528/is-close-necessary-when-using-iterator-on-a-python-file-object

MiniQuark 2010-03-12 18:51:48

There is indeed no guarantee that it's closed at the time you'd like, however, with code similar to this in test scripts I haven't had problems in the past yet. Although that could have been pure luck ofcourse :)

WoLpH 2010-03-12 23:46:35

Answer 2

+1 A:

Best in what sense? Bash can do this quite well, but it may be harder for you to write a good bash script if you are more familiar with another scripting language. Do you want to optimize for something specific?

That said, here's a bash implementation:

 declare blocksize=5
 declare i=1
 declare blockstart=1
 declare blockend=$blocksize
 declare -a fileset 
 while [ -f data${i}.txt ] ; do
         fileset=("${fileset[@]}" $data${i}.txt)
         i=$(($i + 1))
         if [ $i -gt $blockend ] ; then
                  cat "${fileset[@]}" > data_new_${blockstart}.txt
                  fileset=() # clear
                  blockstart=$(($blockstart + $blocksize))
                  blockend=$(($blockend+ $blocksize))
         fi
 done

EDIT: I see you now say "Best" == "Simplest code", but what's simple depends on you. For me Perl is simpler than Python, for some Awk is simpler than bash. It depends on what you know best.

EDIT again: inspired by dtmilano, I've changed mine to use cat once per blocksize, so now cat will be called 'only' 4000 times.

Sorpigal 2010-03-12 18:04:33

yes, exactly. but in this case, your code are rather simple, thanks

Werner 2010-03-12 18:17:13

Answer 3

+4 A:

Here's a Python (2.6) version (if you have Python 2.5, add a first line that says

from __future__ import with_statement

and the script will also work)...:

import sys

def main(N):
   rN = range(N)
   for iout, iin in enumerate(xrange(1, 99999, N)):
       with open('data_new_%s.txt' % (iout+1), 'w') as out:
           for di in rN:
               try: fin = open('data%s.txt' % (iin + di), 'r')
               except IOError: return
               out.write(fin.read())
               fin.close()

if __name__ == '__main__':
    if len(sys.argv) > 1:
        N = int(sys.argv[1])
    else:
        N = 5
    main(N)

As you see from other answers & comments, opinions on performance differ -- some believe that the Python startup (and imports of modules) will make this slower than bash (but the import part at least is bogus: sys, the only needed module, is a built-in module, requires no "loading" and therefore basically negligible overhead to import it); I suspect avoiding the repeated fork/exec of cat may slow bash down; others think that I/O will dominate anyway, making the two solutions equivalent. You'll have to benchmark with your own files, on your own system, to solve this performance doubt.

Alex Martelli 2010-03-12 18:20:52

Your `fin.read()` wouldn't happen to load everything into memory, now would it? :)

vladr 2010-03-13 03:53:54

Not "everything", of course -- just the contents of the current data file. If you're dealing with individual files bigger than a gigabyte, or whatever you can comfortably load in memory (many gigabytes on a well-RAM-equipped 64-bit server), it's easy to constrain the buffer to whatever you need (1GB, 16GB, whatever), see the `shutil` standard library module for the lazy way, or just code out the three-lines loop;-).

Alex Martelli 2010-03-13 03:59:47

It's the three-lines loop I was aiming at. ;) By "everything", of course, I meant "everything" in `fin`. :)

vladr 2010-03-13 04:03:07

@Vlad, `while True:\n data=fin.read(BUFSIZ)\n if not data:break\n out.write(data)` (weak formatting, but that's what you have to do in SO comments;-). When dealing with 20,000 files, though, I wouldn't worry about them being more than a few GB each, so wouldn't bother w/this loop;-).

Alex Martelli 2010-03-13 05:20:54

Answer 4

A:

Let's say, if you have a simple script that concatenates files and keeps a counter for you, like the following:

#!/usr/bin/bash
COUNT=0
if [ -f counter ]; then
  COUNT=`cat counter`
fi
COUNT=$[$COUNT+1]
echo $COUNT > counter
cat $@ > $COUNT.data

The a command line will do:

find -name "*" -type f -print0 | xargs -0 -n 5 path_to_the_script

Codism 2010-03-12 18:29:18

Answer 5

+1 A:

I like this one which saves on executing processes, only 1 cat per block

#! /bin/bash

N=5 # block size
S=1 # start
E=20000 # end

for n in $(seq $S $N $E)
do
    CMD="cat "
    i=$n
    while [ $i -lt $((n + N)) ]
    do
        CMD+="data$((i++)).txt "
    done
    $CMD > data_new_$((n / N + 1)).txt
done

dtmilano 2010-03-12 18:46:05

Answer 6

A:

Simple enough?

make_cat.py

limit = 1000
n = 5
for i in xrange( 0, (limit+n-1)//n ):
     names = [ "data{0}.txt".format(j) for j in range(i*n,i*n+n) ]
     print "cat {0} >data_new_{1}.txt".format( " ".join(names), i )

Script

python make_cat.py | sh

S.Lott 2010-03-12 19:34:42

why do you have to make calls to the shell? You can do everything in Python.

ghostdog74 2010-03-13 02:17:34

"The best approach I mean in terms of simplest code." I think that's a perfectly rotten definition of "best". That's why I wrote such a peculiar piece of code. I think it meets the definition. And demonstrates that "simplest code" is not always desirable.

S.Lott 2010-03-13 02:47:45

Answer 7

+1 A:

how about a one liner ? :)

ls data[0-9]*txt|sort -nk1.5|awk 'BEGIN{rn=5;i=1}{while((getline _<$0)>0){print _ >"data_new_"i".txt"}close($0)}NR%rn==0{i++}'

ghostdog74 2010-03-13 01:48:47

Hi, I realize now that your approach is the best. Just one thing, for more than 20000 files I get errors, how can you use your script with xargs? thanks

Werner 2010-03-15 20:54:58

`ls file* |xargs -n1 | sort...`

ghostdog74 2010-03-15 23:37:47

ansaurus

tags:

views:

answers:

What is the best, python or bash for selectively concatenating lots of files?

related questions