views:

806

answers:

6

Hi - I have to process text files 10-20GB in size of the format: field1 field2 field3 field4 field5

I would like to parse the data from each line of field2 into one of several files; the file this gets pushed into is determined line-by-line by the value in field4. There are 25 different possible values in field2 and hence 25 different files the data can get parsed into.

I have tried using Perl (slow) and awk (faster but still slow) - does anyone have any suggestions or pointers toward alternative approaches?

FYI here is the awk code I was trying to use; note I had to revert to going through the large file 25 times because I wasn't able to keep 25 files open at once in awk:

chromosomes=(1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25)
for chr in ${chromosomes[@]}
do

awk < my_in_file_here -v pat="$chr" '{if ($4 == pat) for (i = $2; i <= $2+52; i++) print i}' >> my_out_file_"$chr".query 

done
+6  A: 

do you know why its slow? its because you are processing that big file 25 times with the outer shell for loop.!!

awk '
$4 <=25 {
    for (i = $2; i <= $2+52; i++){
        print i >> "my_out_file_"$4".query"
    }
}' bigfile
You've got an unmatched quote in your output file spec.
Dennis Williamson
Also, it increments `pat` for each line so stops doing anything after the 26th line. You need nested `for loops` instead.
Dennis Williamson
thanks for catching. as to your 2nd comment, I will wait for OP to show his input file before making changes. the pat variable may be incremented using specific patterns in the file that signify next iteration.
You never defined 'pat'. I think you want `print i >> "my_out_file_"$4".query"`
SiegeX
thanks for spotting. fixed
+13  A: 

With Perl, open the files during initialization and then match the output for each line to the appropriate file:

#! /usr/bin/perl

use warnings;
use strict;

my @values = (1..25);

my %fh;
foreach my $chr (@values) {
  my $path = "my_out_file_$chr.query";
  open my $fh, ">", $path
    or die "$0: open $path: $!";

  $fh{$chr} = $fh;
}

while (<>) {
  chomp;
  my($a,$b,$c,$d,$e) = split " ", $_, 5;

  print { $fh{$d} } "$_\n"
    for $b .. $b+52;
}
Greg Bacon
+1  A: 

There are times when awk is not the answer.

There are also times when scripting languages are not the answer, when you are just plain better off biting the bullet and dragging down your copy of K&R and hacking some C code.

If your operating system implements pipes using concurrent processes and interprocess communications, as opposed to big temp files, what you might be able to do is write an awk script that reformats the line, to put the selector field at the beginning of the line in a format easily readable with scanf(), write a C program that opens the 25 files and distributes lines among them, and the pipe the awk script output into the C program.

John R. Strohm
-1: when large I/O is involved, C doesn't help. He'd rather work on the algorithm (e.g. reading the file once, instead of 25 times), than switching language!
Davide
+6  A: 

Here is a solution in Python. I have tested it on a small fake file I made up. I think this will be acceptably fast for even a large file, because most of the work will be done by C code inside of Python. And I think this is a pleasant and easy to understand program; I prefer Python to Perl.

import sys

s_usage = """\
Usage: csplit <filename>
Splits input file by columns, writes column 2 to file based on chromosome from column 4."""

if len(sys.argv) != 2 or sys.argv[1] in ("-h", "--help", "/?"):

    sys.stderr.write(s_usage + "\n")
    sys.exit(1)


# replace these with the actual patterns, of course
lst_pat = [
    'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j',
    'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't',
    'u', 'v', 'w', 'x', 'y'
]


d = {}
for s_pat in lst_pat:
    # build a dictionary mapping each pattern to an open output file
    d[s_pat] = open("my_out_file_" + s_pat, "wt")

if False:
    # if the patterns are unsuitable for filenames (contain '*', '?', etc.) use this:
    for i, s_pat in enumerate(lst_pat):
        # build a dictionary mapping each pattern to an output file
        d[s_pat] = open("my_out_file_" + str(i), "wt")

for line in open(sys.argv[1]):
    # split a line into words, and unpack into variables.
    # use '_' for a variable name to indicate data we don't care about.
    # s_data is the data we want, and s_pat is the pattern controlling the output
    _, s_data, _, s_pat, _ = line.split()
    # use s_pat to get to the file handle of the appropriate output file, and write data.
    d[s_pat].write(s_data + "\n")

# close all the output file handles.
for key in d:
    d[key].close()

EDIT: Here's a little more information about this program, since it seems you will be using it.

All of the error handling is implicit. If an error happens, Python will "raise an exception" which will terminate processing. For example, if one of the files fails to open, this program will stop executing and Python will print a backtrace showing which line of code caused the exception. I could have wrapped the critical parts with a "try/except" block, to catch errors, but for a program this simple, I didn't see any point.

It's subtle, but there is a check to see if there are exactly five words on each line of the input file. When this code unpacks a line, it does so into five variables. (The variable name "_" is a legal variable name, but there is a convention in the Python community to use it for variables you don't actually care about.) Python will raise an exception if there are not exactly five words on the input line to unpack into the five variables. If your input file can sometimes have four words on a line, or six or more, you could modify the program to not raise an exception; change the main loop to this:

for line in open(sys.argv[1]):
    lst = line.split()
    d[lst[3]].write(lst[1] + "\n")

This splits the line into words, and then just assigns the whole list of words into a single variable, lst. So that line of code doesn't care how many words are on the line. Then the next line indexes into the list to get the values out. Since Python indexes a list using 0 to start, the second word is lst[1] and the fourth word is lst[3]. As long as there are at least four words in the list, that line of code won't raise an exception either.

And of course, if the fourth word on the line is not in the dictionary of file handles, Python will raise an exception for that too. That would stop processing. Here is some example code for how to use a "try/except" block to handle this:

for line in open(sys.argv[1]):
    lst = line.split()
    try:
        d[lst[3]].write(lst[1] + "\n")
    except KeyError:
        sys.stderr.write("Warning: illegal line seen: " + line)

Good luck with your project.

EDIT: @larelogio pointed out that this code doesn't match the AWK code. The AWK code has an extra for loop that I do not understand. Here is Python code to do the same thing:

for line in open(sys.argv[1]):
    lst = line.split()
    n = int(lst[1])
    for i in range(n, n+53):
        d[lst[3]].write(i + "\n")

And here is another way to do it. This might be a little faster, but I have not tested it so I am not certain.

for line in open(sys.argv[1]):
    lst = line.split()
    n = int(lst[1])
    s = "\n".join(str(i) for i in range(n, n+53))
    d[lst[3]].write(s + "\n")

This builds a single string with all the numbers to write, then writes them in one chunk. This may save time compared to calling .write() 53 times.

steveha
steveha - many thanks for this - I have never tried python, but this works well - I am still seeing ~ two hours or so to process this file, which is somewhat similar to the time it was taking awk. I guess that I may have to take the plunge and follow John Strohm's answer, which is to try C - really out of my expertise, but I'll give it a go over xmas1ch1g0 - thanks for this - I had gone the route of going through the file so many times because I wasn't able to keep 26 files open at once - awk was throwing an error; perhaps I should have tried gawk or similar.happy holidays.
Andrew
steveha - an update - I spoke too soon - your python solution actually runs in about 30 mins - this will work just fine for me, I really appreciate the help.
Andrew
I'm glad I was able to help. I updated my answer with some additional information; I hope you will find the additional information useful. Good luck with your project, and happy holidays!
steveha
Is this program producing the correct output? It seams that it will differ from the original proposal in awk.
larelogio
Hmmm... I was working from the text at the top, explaining the problem. Looking at the AWK code, I'm now very confused. What is that for loop doing? Why is it printing extra numbers? Well, what the heck, I'll edit my answer and provide code matching the AWK.
steveha
@steveha: That loop is responsible for some of the processing time and for most of the output size. :(
larelogio
@larelogio, presumably the files generated by this program will be processed more later. If you could generate the 53 sequential values as part of the later processing, it might streamline the process. But if it makes sense to generate them from this, well, that's easy enough and I provided code. Happy holidays everyone!
steveha
+1  A: 

Sounds like you're on your way, but I just wanted to mention Memory Mapped I/O as being a huge help when working with gigantic files. There was a time when I had to parse a .5GB binary file with Visual Basic 5 (yes)... importing the CreateFileMapping API allowed me to parse the file (and create several-gig "human-readable" file) in minutes. And it only took a half hour or so to implement.

Here's a link describing the API on Microsoft platforms, though I'm sure MMIO should be on just about any platform: MSDN

Good luck!

A: 

There are some precalculations that may help.

For example you can precalculate the outputs for each value of your field2. Admiting that they are 25 like field4:

my %tx = map {my $tx=''; for my $tx1 ($_ .. $_+52) {$tx.="$tx1\n"}; $_=>$tx} (1..25);

Latter when writing you can do print {$fh{$pat}} $tx{$base};

larelogio