tags:

views:

97

answers:

2

A common task I have to perform is an SQL-like JOIN on two text files. i.e. create a new file from the "left hand" and "right hand" files, using some sort of join on an identifier column shared between them. Variations such as outer joins etc are sometimes required.

Of course I could write a simple script to do this in a generic way, but is there a python module - built-in or installable - that can do this? Something that can handle huge files would be ideal.

EDIT:

  • I'm aware of PyTables, but is that the simplest solution for flat text files?
  • By "huge files" I mean sometimes the "left hand" file is too large to be stored in memory
  • The lack (so far) of a python answer worries me. Am I using the wrong tool/paradigm for this? The reason I asked for a python lib is to allow for easy adding of other transformations on each line (validate identifiers etc).
+1  A: 

[wild idea]

Will these files fit into you system's memory and leave enough still? In that case you can load them into tables using SQLite and then join them to your heart's content using SQL proper.

[/wild idea]

Update

Scratch it. The OP has said that one of the files is too large to be stored in memory.. See this answer by @Dave Kirby. SQLite can be used with an in-disk database.

Manoj Govindan
A: 

If you are using a unixy system or cygwin then take a look at the join command - it may do exactly what you are asking.

[26] % join --help
Usage: join [OPTION]... FILE1 FILE2
For each pair of input lines with identical join fields, write a line to
standard output.  The default join field is the first, delimited
by whitespace.  When FILE1 or FILE2 (not both) is -, read standard input.

  -a FILENUM        print unpairable lines coming from file FILENUM, where
                      FILENUM is 1 or 2, corresponding to FILE1 or FILE2
  -e EMPTY          replace missing input fields with EMPTY
  -i, --ignore-case ignore differences in case when comparing fields
  -j FIELD          equivalent to `-1 FIELD -2 FIELD'
  -o FORMAT         obey FORMAT while constructing output line
  -t CHAR           use CHAR as input and output field separator
  -v FILENUM        like -a FILENUM, but suppress joined output lines
  -1 FIELD          join on this FIELD of file 1
  -2 FIELD          join on this FIELD of file 2
      --help     display this help and exit
      --version  output version information and exit

Unless -t CHAR is given, leading blanks separate fields and are ignored,
else fields are separated by CHAR.  Any FIELD is a field number counted
from 1.  FORMAT is one or more comma or blank separated specifications,
each being `FILENUM.FIELD' or `0'.  Default FORMAT outputs the join field,
the remaining fields from FILE1, the remaining fields from FILE2, all
separated by CHAR.

Important: FILE1 and FILE2 must be sorted on the join fields.

Report bugs to <[email protected]>.

If you want something more sophisticated or you absolutely have to do it in python then consider reading the files into a in-memory SQLite database - you then have the full power of SQL to merge and manipulate the data.

edit just read that the files are too big to fit in memory. You can still use SQLite, but create a temporary on-disk database.

Dave Kirby
I considered join, but it has the limitation that the files have to be sorted
pufferfish