Hi, I have a longitudinal data set generated by a computer simulation that can be represented by the following tables ('var' are variables):
time subject var1 var2 var3
t1 subjectA ...
t2 subjectB ...
and
subject name
subjectA nameA
subjectB nameB
However, the file generated writes a data file in a format similar to the following:
time t1
description
subjectA nameA
var1 var2 var3
subjectB nameB
var1 var2 var3
time t2
description
subjectA nameA
var1 var2 var3
subjectB nameB
var1 var2 var3
...(and so on)
I have been using a (python) script to process this output data into a flat text file so that I can import it into R, python, SQL, or awk/grep it to extract information - an example of the type of information desired from a single query (in SQL notation, after the data is converted to a table) is shown below:
SELECT var1, var2, var3 FROM datatable WHERE subject='subjectB'
I wonder if there is a more efficient solution as each of these data files can be ~100MB each (and I have hundreds of them) and creating the flat text file is time-consuming and takes up additional hard drive space with redundant information. Ideally, I would interact with the original data set directly to extract the information that I desire, without creating the extra flat text file... Is there an awk/perl solution for such tasks that is simpler? I'm quite proficient at text-processing in python but my skills in awk are rudimentary and I have no working knowledge of perl; I wonder if these or other domain-specific tools can provide a better solution.
Thanks!
Postscript: Wow, thanks to all! I am sorry that I cannot choose everyone's answers @FM: thanks. My Python script resembles your code without the filtering step. But your organization is clean. @PP: I thought I was already proficient in grep but apparently not! This is very helpful... but I think grepping becomes difficult when mixing the 'time' into the output (which I failed to include as a possible extraction scenario in my example! That's my bad). @ghostdog74: This is just fantastic... but modifying the line to get 'subjectA' was not straightforward... (though I'll be reading up more on awk in the meantime and hopefully I'll grok later). @weismat: Well stated. @S.Lott: This is extremely elegant and flexible - I was not asking for a python(ic) solution but this fits in cleanly with the parse, filter, and output framework suggested by PP, and is flexible enough to accommodate a number of different queries to extract different types of information from this hierarchical file.
Again, I am grateful to everyone - thanks so much.