views:

289

answers:

6

# To fix the symptom

How can you sum up the following columns effectively?

Column 1

1
3
3
...

Column 2

2323
343
232
...

This should give me

Expected result

2324
346
235
...

I have the columns in two files.


# Initial situation

I use sometimes too many curly brackets such that I have used one more this { than this } in my files. I am trying to find where I have used the one unnecessary curly bracket. I have used the following steps in getting the data

Find commands

 find . * -exec grep '{' {} + > /tmp/1
 find . * -exec grep '}' {} + > /tmp/2

AWK commands

 awk -F: '{ print $2 }' /tmp/1 > /tmp/11
 awk -F: '{ print $2 }' /tmp/2 > /tmp/22

The column are in the files /tmp/11 and /tmp/22.

I repeat a lot of similar commands in my procedure. This suggests me that this is not the right way.

Please, suggests me any way such as Python, Perl or any Unix tool which can decrease the number of steps.

+10  A: 

Using python:

totals = [ int(i)+int(j) for i, j in zip ( open(fname1), open(fname2) ) ]
SilentGhost
That is Python :)
Python was mentioned in the question as OK for a solution.
unwind
You are right, I've skipped the tags.
It would help if you told people what it was, you can't just type that on the commandline, I only got that it was Python because it used a list comprehension.
Chas. Owens
I added one missing bracket to the code. - I have not yet managed to get the code to work. **Do you need any library to get it work?** - I get this error `coercing to Unicode: need string or buffer, int foundWARNING: Failure executing file:` when running the command with my filenames.
Masi
Worked beautifully for me using Python 2.5. No library required.
Fred Larson
+3  A: 

You can avoid the intermediate steps by just using a command that do the counts and the comparison at the same time:

find . -type f -exec perl -nle 'END { print $ARGV if $h{"{"} != $h{"}"} } $h{$_}++ for /([}{])/g' {}\;

This calls the Perl program once per file, the Perl program counts the number of each type curly brace and prints the name of the file if they counts don't match.

You must be careful with the /([}{]])/ section, find will think it needs to do the replacement on {} if you say /([{}]])/.

WARNING: this code will have false positives and negatives if you are trying to run it against source code. Consider the following cases:

balanced, but curlies in strings:

if ($s eq '{') {
    print "I saw a {\n"
}

unbalanced, but curlies in strings:

while (1) {
   print "}";

You can expand the Perl command by using B::Deparse:

perl -MO=Deparse -nle 'END { print $ARGV if $h{"{"} != $h{"}"} } $h{$_}++ for /([}{])/g'

Which results in:

BEGIN { $/ = "\n"; $\ = "\n"; }
LINE: while (defined($_ = <ARGV>)) {
    chomp $_;
    sub END {
        print $ARGV if $h{'{'} != $h{'}'};
    }
    ;
    ++$h{$_} foreach (/([}{])/g);
}

We can now look at each piece of the program:

BEGIN { $/ = "\n"; $\ = "\n"; }

This is caused by the -l option. It sets both the input and output record separators to "\n". This means anything read in will be broken into records based "\n" and any print statement will have "\n" appended to it.

LINE: while (defined($_ = <ARGV>)) {
}

This is created by the -n option. It loops over every file passed in via the commandline (or STDIN if no files are passed) reading each line of those files. This also happens to set $ARGV to the last file read by <ARGV>.

chomp $_;

This removes whatever is in the $/ variable from the line that was just read ($_), it does nothing useful here. It was caused by the -l option.

sub END {
    print $ARGV if $h{'{'} != $h{'}'};
}

This is an END block, this code will run at the end of the program. It prints $ARGV (the name of the file last read from, see above) if the values stored in %h associated with the keys '{' and '}' are equal.

++$h{$_} foreach (/([}{])/g);

This needs to be broken down further:

/
    (    #begin capture
    [}{] #match any of the '}' or '{' characters
    )    #end capture
/gx

Is a regex that returns a list of '{' and '}' characters that are in the string being matched. Since no string was specified the $_ variable (which holds the line last read from the file, see above) will be matched against. That list is fed into the foreach statement which then runs the statement it is in front of for each item (hence the name) in the list. It also sets $_ (as you can see $_ is a popular variable in Perl) to be the item from the list.

++h{$_}

This line increments the value in $h that is associated with $_ (which will be either '{' or '}', see above) by one.

Chas. Owens
+10  A: 

If c1 and c2 are youre files, you can do this:

$ paste c1 c2 | awk '{print $1 + $2}'

Or (without AWK):

$ paste c1 c2 | while read i j; do echo $(($i+$j)); done
+1  A: 

In Python (or Perl, Awk, &c) you can reasonably do it in a single stand-alone "pass" -- I'm not sure what you mean by "too many curly brackets", but you can surely count curly use per file. For example (unless you have to worry about multi-GB files), the 10 files using most curly braces:

import heapq
import os
import re

curliest = dict()

for path, dirs, files in os.walk('.'):
  for afile in files:
    fn = os.path.join(path, afile)
    with open(fn) as f:
      data = f.read()
      braces = data.count('{') + data.count('}')
    curliest[fn] = bracs

top10 = heapq.nlargest(10, curlies, curliest.get)
top10.sort(key=curliest.get)
for fn in top10:
  print '%6d %s' % (curliest[fn], fn)
Alex Martelli
A: 

Reply to Lutz'n answer

My problem was finally solved by this commnad

paste -d: /tmp/1 /tmp/2 | awk -F: '{ print $1 "\t" $2 - $4 }'
Masi
A: 

your problem can be solved with just 1 awk command...

awk '{getline i<"file1";print i+$0}'  file2
ghostdog74