ansaurus

Question

SQL-like JOIN on two text files in Python, is there a built-in way?

Answer 1

+1 A:

[wild idea]

Will these files fit into you system's memory and leave enough still? In that case you can load them into tables using SQLite and then join them to your heart's content using SQL proper.

[/wild idea]

Update

~~Scratch it. The OP has said that one of the files is too large to be stored in memory.~~. See this answer by @Dave Kirby. SQLite can be used with an in-disk database.

Manoj Govindan 2010-09-02 11:47:57

Answer 2

A:

If you are using a unixy system or cygwin then take a look at the join command - it may do exactly what you are asking.

[26] % join --help
Usage: join [OPTION]... FILE1 FILE2
For each pair of input lines with identical join fields, write a line to
standard output.  The default join field is the first, delimited
by whitespace.  When FILE1 or FILE2 (not both) is -, read standard input.

  -a FILENUM        print unpairable lines coming from file FILENUM, where
                      FILENUM is 1 or 2, corresponding to FILE1 or FILE2
  -e EMPTY          replace missing input fields with EMPTY
  -i, --ignore-case ignore differences in case when comparing fields
  -j FIELD          equivalent to `-1 FIELD -2 FIELD'
  -o FORMAT         obey FORMAT while constructing output line
  -t CHAR           use CHAR as input and output field separator
  -v FILENUM        like -a FILENUM, but suppress joined output lines
  -1 FIELD          join on this FIELD of file 1
  -2 FIELD          join on this FIELD of file 2
      --help     display this help and exit
      --version  output version information and exit

Unless -t CHAR is given, leading blanks separate fields and are ignored,
else fields are separated by CHAR.  Any FIELD is a field number counted
from 1.  FORMAT is one or more comma or blank separated specifications,
each being `FILENUM.FIELD' or `0'.  Default FORMAT outputs the join field,
the remaining fields from FILE1, the remaining fields from FILE2, all
separated by CHAR.

Important: FILE1 and FILE2 must be sorted on the join fields.

Report bugs to <[email protected]>.

If you want something more sophisticated or you absolutely have to do it in python then consider reading the files into a in-memory SQLite database - you then have the full power of SQL to merge and manipulate the data.

edit just read that the files are too big to fit in memory. You can still use SQLite, but create a temporary on-disk database.

Dave Kirby 2010-09-02 12:33:19

I considered join, but it has the limitation that the files have to be sorted

pufferfish 2010-09-05 11:20:42

ansaurus

tags:

views:

answers:

SQL-like JOIN on two text files in Python, is there a built-in way?

related questions