tags:

views:

458

answers:

7

I have to deal with text files in a motley selection of formats. Here's an example (Columns A and B are tab delimited):

A B
a Name1=Val1, Name2=Val2, Name3=Val3
b Name1=Val4, Name3=Val5
c Name1=Val6, Name2=Val7, Name3=Val8

The files could have headers or not, have mixed delimiting schemes, have columns with name/value pairs as above etc.
I often have the ad-hoc need to extract data from such files in various ways. For example from the above data I might want the value associated with Name2 where it is present. i.e.

A B
a Val2
c Val7

What tools/techniques are there for performing such manipulations as one line commands, using the above as an example but extensible to other cases?

+1  A: 

You have all the basic bash shell commands, for example grep, cut, sed and awk at your disposal. You can also use Perl or Ruby for more complex things.

auramo
A: 

From what I've seen I'd start with Awk for this sort of thing and then if you need something more complex, I'd progress to Python.

Onorio Catenacci
A: 

I would use sed:

   # print section of file between two regular expressions (inclusive)
   sed -n '/Iowa/,/Montana/p'             # case sensitive
Cetra
A: 

Supplementary: How would you use your suggested tool to solve the above problem as a one liner?

Hobbo
A: 

Since you have cygwin, I'd go with Perl. It's the easiest to learn (check out the O'Reily book: Learning Perl) and widely applicable.

+1  A: 

I don't like sed too much, but it works for such things...

var="Name2";sed -n "1p;s/\([^ ]*\) .*$var=\([^ ,]*\).*/\1 \2/p" < filename

Gives you

 A       B
 a Val2
 c Val7
Weidenrinde
A: 

I would use Perl. Write a small module (or more than one) for dealing with the different formats. You could then run perl oneliners using that library. Example for what it would look like as follows:

perl -e 'use Parser;' -e 'parser("in.input").get("Name2");'

Don't quote me on the syntax, but that's the general idea. Abstract the task at hand to allow you to think in terms of what you need to do, not how you need to do it. Ruby would be another option, it tends to have a cleaner syntax, but either language would work.

deterb