ansaurus

Question

processing text from a non-flat file (to extract information as if it *were* a flat file)

Answer 1

+2 A:

If all you want is var1, var2, var3 upon matching a particular subject then you could try the following command:


  grep -A 1 'subjectB'

The -A 1 command line argument instructs grep to print out the matched line and one line after the matched line (and in this case the variables come on a line after the subject).

You might want to use the -E option to make grep search for a regular expression and anchor the subject search to the beginning-of-line (e.g. grep -A 1 -E '^subjectB').

Finally the output will now consist of the subject line and variable line you want. You may want to hide the subject line:


  grep -A 1 'subjectB' |grep -v 'subjectB'

And you may wish to process the variable line:


  grep -A 1 'subjectB' |grep -v 'subjectB' |perl -pe 's/ /,/g'

PP 2010-02-15 07:58:50

Answer 2

+1 A:

If you are lazy and have enough RAM, then I would work on a RAM disk instead of the file system as long as you need them immediately.
I do not think that Perl or awk will be faster than Python if you are just recoding your current algorithm into a different language.

weismat 2010-02-15 08:16:46

Answer 3

+1 A:

awk '/time/{f=0}/subjectB/{f=1;next}f' file

ghostdog74 2010-02-15 09:03:48

Answer 4

+2 A:

The best option would be to modify the computer simulation to produce rectangular output. Assuming you can't do that, here's one approach:

In order to be able to use the data in R, SQL, etc. you need to convert it from hierarchical to rectangular one way or another. If you already have a parser that can convert the entire file into a rectangular data set, you are most of the way there. The next step is to add additional flexibility to your parser, so that it can filter out unwanted data records. Instead of having a file converter, you'll have a data extraction utility.

The example below is in Perl, but you can do the same thing in Python. The general idea is to maintain a clean separation between (a) parsing, (b) filtering, and (c) output. That way, you have a flexible environment, making it easy to add different filtering or output methods, depending on your immediate data-crunching needs. You can also set up the filtering methods to accept parameters (either from command line or a config file) for greater flexibility.

use strict;
use warnings;

read_file($ARGV[0], \&check_record);

sub read_file {
    my ($file_name, $check_record) = @_;
    open(my $file_handle, '<', $file_name) or die $!;
    # A data structure to hold an entire record.
    my $rec = {
        time => '',
        desc => '',
        subj => '',
        name => '',
        vars => [],
    };
    # A code reference to get the next line and do some cleanup.
    my $get_line = sub {
        my $line = <$file_handle>;
        return unless defined $line;
        chomp $line;
        $line =~ s/^\s+//;
        return $line;
    };
    # Start parsing the data file.
    while ( my $line = $get_line->() ){
        if ($line =~ /^time (\w+)/){
            $rec->{time} = $1;
            $rec->{desc} = $get_line->();
        }
        else {
            ($rec->{subj}, $rec->{name}) = $line =~ /(\w+) +(\w+)/;
            $rec->{vars} = [ split / +/, $get_line->() ];

            # OK, we have a complete record. Now invoke our filtering
            # code to decide whether to export record to rectangular format.
            $check_record->($rec);
        }
    }
}

sub check_record {
    my $rec = shift;
    # Just an illustration. You'll want to parameterize this, most likely.
    write_output($rec)
        if  $rec->{subj} eq 'subjectB'
        and $rec->{time} eq 't1'
    ;
}

sub write_output {
    my $rec = shift;
    print join("\t", 
        $rec->{time}, $rec->{subj}, $rec->{name},
        @{$rec->{vars}},
    ), "\n";
}

FM 2010-02-15 12:43:56

+1: From past experience, I know that parsing large files into hashes can consume a *lot* of memory. I have to say that this solution will probably be hard to better wrt memory-miserliness...

Zaid 2010-02-15 13:10:01

Answer 5

+2 A:

This is what Python generators are all about.

def read_as_flat( someFile ):
    line_iter= iter(someFile)
    time_header= None
    for line in line_iter:
        words = line.split()
        if words[0] == 'time':
            time_header = [ words[1:] ] # the "time" line
            description= line_iter.next()
            time_header.append( description )
        elif words[0] in subjectNameSet:
            data = line_iter.next()
            yield time_header + data

You can use this like a standard Python iterator

for time, description, var1, var2, var3 in read_as_flat( someFile ):
    etc.

S.Lott 2010-02-15 13:14:27

I like, very elegant.

mythz 2010-02-15 15:29:39

ansaurus

tags:

views:

answers:

processing text from a non-flat file (to extract information as if it were a flat file)

related questions