views:

209

answers:

5

Hi, I have a longitudinal data set generated by a computer simulation that can be represented by the following tables ('var' are variables):

time subject var1 var2 var3
t1   subjectA  ...
t2   subjectB  ...

and

subject   name
subjectA  nameA
subjectB  nameB

However, the file generated writes a data file in a format similar to the following:

time t1 
  description
subjectA nameA
  var1 var2 var3
subjectB nameB
  var1 var2 var3
time t2
  description
subjectA nameA
  var1 var2 var3
subjectB nameB
  var1 var2 var3
...(and so on)

I have been using a (python) script to process this output data into a flat text file so that I can import it into R, python, SQL, or awk/grep it to extract information - an example of the type of information desired from a single query (in SQL notation, after the data is converted to a table) is shown below:

SELECT var1, var2, var3 FROM datatable WHERE subject='subjectB'

I wonder if there is a more efficient solution as each of these data files can be ~100MB each (and I have hundreds of them) and creating the flat text file is time-consuming and takes up additional hard drive space with redundant information. Ideally, I would interact with the original data set directly to extract the information that I desire, without creating the extra flat text file... Is there an awk/perl solution for such tasks that is simpler? I'm quite proficient at text-processing in python but my skills in awk are rudimentary and I have no working knowledge of perl; I wonder if these or other domain-specific tools can provide a better solution.

Thanks!

Postscript: Wow, thanks to all! I am sorry that I cannot choose everyone's answers @FM: thanks. My Python script resembles your code without the filtering step. But your organization is clean. @PP: I thought I was already proficient in grep but apparently not! This is very helpful... but I think grepping becomes difficult when mixing the 'time' into the output (which I failed to include as a possible extraction scenario in my example! That's my bad). @ghostdog74: This is just fantastic... but modifying the line to get 'subjectA' was not straightforward... (though I'll be reading up more on awk in the meantime and hopefully I'll grok later). @weismat: Well stated. @S.Lott: This is extremely elegant and flexible - I was not asking for a python(ic) solution but this fits in cleanly with the parse, filter, and output framework suggested by PP, and is flexible enough to accommodate a number of different queries to extract different types of information from this hierarchical file.

Again, I am grateful to everyone - thanks so much.

+2  A: 

If all you want is var1, var2, var3 upon matching a particular subject then you could try the following command:


  grep -A 1 'subjectB'

The -A 1 command line argument instructs grep to print out the matched line and one line after the matched line (and in this case the variables come on a line after the subject).

You might want to use the -E option to make grep search for a regular expression and anchor the subject search to the beginning-of-line (e.g. grep -A 1 -E '^subjectB').

Finally the output will now consist of the subject line and variable line you want. You may want to hide the subject line:


  grep -A 1 'subjectB' |grep -v 'subjectB'

And you may wish to process the variable line:


  grep -A 1 'subjectB' |grep -v 'subjectB' |perl -pe 's/ /,/g'

PP
+1  A: 

If you are lazy and have enough RAM, then I would work on a RAM disk instead of the file system as long as you need them immediately.
I do not think that Perl or awk will be faster than Python if you are just recoding your current algorithm into a different language.

weismat
+1  A: 
awk '/time/{f=0}/subjectB/{f=1;next}f' file
ghostdog74
+2  A: 

The best option would be to modify the computer simulation to produce rectangular output. Assuming you can't do that, here's one approach:

In order to be able to use the data in R, SQL, etc. you need to convert it from hierarchical to rectangular one way or another. If you already have a parser that can convert the entire file into a rectangular data set, you are most of the way there. The next step is to add additional flexibility to your parser, so that it can filter out unwanted data records. Instead of having a file converter, you'll have a data extraction utility.

The example below is in Perl, but you can do the same thing in Python. The general idea is to maintain a clean separation between (a) parsing, (b) filtering, and (c) output. That way, you have a flexible environment, making it easy to add different filtering or output methods, depending on your immediate data-crunching needs. You can also set up the filtering methods to accept parameters (either from command line or a config file) for greater flexibility.

use strict;
use warnings;

read_file($ARGV[0], \&check_record);

sub read_file {
    my ($file_name, $check_record) = @_;
    open(my $file_handle, '<', $file_name) or die $!;
    # A data structure to hold an entire record.
    my $rec = {
        time => '',
        desc => '',
        subj => '',
        name => '',
        vars => [],
    };
    # A code reference to get the next line and do some cleanup.
    my $get_line = sub {
        my $line = <$file_handle>;
        return unless defined $line;
        chomp $line;
        $line =~ s/^\s+//;
        return $line;
    };
    # Start parsing the data file.
    while ( my $line = $get_line->() ){
        if ($line =~ /^time (\w+)/){
            $rec->{time} = $1;
            $rec->{desc} = $get_line->();
        }
        else {
            ($rec->{subj}, $rec->{name}) = $line =~ /(\w+) +(\w+)/;
            $rec->{vars} = [ split / +/, $get_line->() ];

            # OK, we have a complete record. Now invoke our filtering
            # code to decide whether to export record to rectangular format.
            $check_record->($rec);
        }
    }
}

sub check_record {
    my $rec = shift;
    # Just an illustration. You'll want to parameterize this, most likely.
    write_output($rec)
        if  $rec->{subj} eq 'subjectB'
        and $rec->{time} eq 't1'
    ;
}

sub write_output {
    my $rec = shift;
    print join("\t", 
        $rec->{time}, $rec->{subj}, $rec->{name},
        @{$rec->{vars}},
    ), "\n";
}
FM
+1: From past experience, I know that parsing large files into hashes can consume a *lot* of memory. I have to say that this solution will probably be hard to better wrt memory-miserliness...
Zaid
+2  A: 

This is what Python generators are all about.

def read_as_flat( someFile ):
    line_iter= iter(someFile)
    time_header= None
    for line in line_iter:
        words = line.split()
        if words[0] == 'time':
            time_header = [ words[1:] ] # the "time" line
            description= line_iter.next()
            time_header.append( description )
        elif words[0] in subjectNameSet:
            data = line_iter.next()
            yield time_header + data

You can use this like a standard Python iterator

for time, description, var1, var2, var3 in read_as_flat( someFile ):
    etc.
S.Lott
I like, very elegant.
mythz