tags:

views:

396

answers:

7

I have a set of 4 massive CSV files that I need to modify. What I need to do is match this expression /^(.*),,/ copy the atom then prepend it to every subsequent line until the atom is matched again. Then I need to rinse and repeat until the end of the file (each file has approx 25k lines in it). Finally I need to go back through and remove the first atom.

I would like to use sed for this if it's possible. I tried doing it with vim but couldn't get the regex right. Any help would be greatly appreciated. An example is illustrated below:

Before:

0917,,
,882-1273,1
,95F 9475,1
,276-080,1
,40K 0080,1
,275-690A,1
,TX-2311,3
,TX-3351,4
,B-07432,1
,B-6901,1
,23-753,1
,02F 4307,1
,5.1K QBK-ND,1
,0944-026,1
,0944-027,1
,0944-004,1
,0944-056,1
,0944-057,1
,0944-082,1
,0944-024,1
,0944-025,1
,0944-102,4
,LOR 102,1
0918,,
,CJ1085,1
,1352-152,4
,DMS3102A-18-,1
,6-32 KEP,7
,6-32 X 3/4,4
,6-32X1/2,4
,1251-102,8
,Oct-32,4
,10-32 SAE,8

After:

0917,882-1273,1
0917,95F 9475,1
0917,276-080,1
0917,40K 0080,1
0917,275-690A,1
0917,TX-2311,3
0917,TX-3351,4
0917,B-07432,1
0917,B-6901,1
0917,23-753,1
0917,02F 4307,1
0917,5.1K QBK-ND,1
0917,0944-026,1
0917,0944-027,1
0917,0944-004,1
0917,0944-056,1
0917,0944-057,1
0917,0944-082,1
0917,0944-024,1
0917,0944-025,1
0917,0944-102,4
0917,LOR 102,1
0918,CJ1085,1
0918,1352-152,4
0918,DMS3102A-18-,1
0918,6-32 KEP,7
0918,6-32 X 3/4,4
0918,6-32X1/2,4
0918,1251-102,8
0918,Oct-32,4
0918,10-32 SAE,8
+1  A: 

Since syntax of sed is cryptic and you don't use day by day, not talking about your colleagues, this code will be hard to maintain. Perl/awk solution is preferred.

Any way, here is the best sed manual I ever saw.

Good luck Dima

dimba
Doesn't answer the question. "RTFM" is not an answer. "Use another language" is not an answer. "Use another language, and here is how you can do it" IS an answer. If you suggest Perl, then give a Perl example of how to do it.
A. Levy
As you see dear A. Levy, all answers uses high level languages. Guess why? sed maybe cool tool, but if you not using it aggressively in day by day basis, something more sophisticated than search and replace (s///) is not welcomed.When question is asked it's legitimate to raise question "Why you need it?", "Maybe there's another way to solve it?" etc.This is my personal opinion and if you're not accept it than its your own problem.
dimba
+1  A: 

Perl might be easier:

#!/usr/bin/perl -w

$filename = $ARGV[0];
open FILE, "<", $filename or die $!;

while (<FILE>) {
    if(/^(.*),,/) {
     $prefix = $_;
     $prefix =~ s/,//g;
     $prefix =~ s/\s+//g;
     next; 
    }
    s/^,/$prefix,/g;
    print $_;
}

close FILE;
Phil Vollhardt
Don't forget `use strict;` prefer lexical filehandles; why are you capturing and not using the captured substring; `.*` will match the empty string etc etc
Sinan Ünür
OK. i'm a little new to perl. Thank you for the advice. I see your's seems much simpler. I'll look through it. Thanks again
Phil Vollhardt
+3  A: 

As I pointed out with your previous question on the same topic, I find Perl to be easier:

#!/usr/bin/perl

use strict;
use warnings;

my $prefix = q{};

while ( <> ) {
    last unless /\S/;
    if ( /^(.+),,$/ ) {
        $prefix = $1;
        next;
    }
    print $prefix, $_;
}
Sinan Ünür
figured i'd give it another go without the "Community Wiki". thanks again for pointing out the error.
-1: you are assuming that the CSV file format is simple, but it's not
Stefano Borini
@Stefano no, all existing variations on CSV are not simple, but the format given by the OP is. The purpose of the program seems to be to process data produced according to some spec, not arbitrary spreadsheets produced by a bazillion users. This is a totally bogus downvote.
Sinan Ünür
I changed the captured part to `.+` to avoid matching empty strings. You might want to choose a pattern that corresponds better to your specs such as `[A-Za-z0-9]`. Is the length of the prefix always 4?
Sinan Ünür
I agree with your point. Indeed I reread the post in more details and the specs that bsisco gave, and you are right. I converted my downvote in a +1 vote. Your solution works equally well as mine, and you focused only on the specs. Actually, I invite bsisco to grant you the correct answer tickmark, as technically you are doing exactly what is asked, no more, no less (and we know how important it is in software development).
Stefano Borini
+1  A: 

The program (python)

import csv
infile=file("in","r")
outfile=file("out","w")
reader = csv.reader(infile , dialect='excel')
writer = csv.writer(outfile , dialect='excel')
current_header=""
for inrow in reader:
    if len(inrow[0].strip()) != 0:
        current_header = inrow[0]
        continue

    writer.writerow([current_header]+inrow[1:])

infile.close()
outfile.close()
print "done"

The input

0917,,
,882-1273,1
,95F 9475,1
,276-080,1
,40K 0080,1
,275-690A,1
,TX-2311,3
,TX-3351,4
,B-07432,1
,B-6901,1
,23-753,1
,02F 4307,1
,5.1K QBK-ND,1
,0944-026,1
,0944-027,1
,0944-004,1
,0944-056,1
,0944-057,1
,0944-082,1
,0944-024,1
,0944-025,1
,0944-102,4
,LOR 102,1
0918,,
,CJ1085,1
,1352-152,4
,DMS3102A-18-,1
,6-32 KEP,7
,6-32 X 3/4,4
,6-32X1/2,4
,1251-102,8
,Oct-32,4
,10-32 SAE,8

The output

0917,882-1273,1
0917,95F 9475,1
0917,276-080,1
0917,40K 0080,1
0917,275-690A,1
0917,TX-2311,3
0917,TX-3351,4
0917,B-07432,1
0917,B-6901,1
0917,23-753,1
0917,02F 4307,1
0917,5.1K QBK-ND,1
0917,0944-026,1
0917,0944-027,1
0917,0944-004,1
0917,0944-056,1
0917,0944-057,1
0917,0944-082,1
0917,0944-024,1
0917,0944-025,1
0917,0944-102,4
0917,LOR 102,1
0918,CJ1085,1
0918,1352-152,4
0918,DMS3102A-18-,1
0918,6-32 KEP,7
0918,6-32 X 3/4,4
0918,6-32X1/2,4
0918,1251-102,8
0918,Oct-32,4
0918,10-32 SAE,8

Have fun

Stefano Borini
that's awesome!!!...thank you so much. it worked perfectly
Not in sed, but it at least solves the problem.
A. Levy
much faster and shorter too i'm sure....i never even thought to use python!
I don't think it's shorter. the perl is shorter, so I guess that awk is short as well. Sinan pointed out that your specs are very definite. My solution is more general, as the python csv module keeps into account the diverse dialects of csv. Use a python solution if your developers are more accustomed with python and you expect strange csv datasets. Use the perl solution if your developers are more accustomed to perl and your format is and will always stay as you specified.
Stefano Borini
With Perl, use either http://search.cpan.org/perldoc/Text::CSV or http://search.cpan.org/perldoc/Text::XSV for arbitrary character separated file formats
Sinan Ünür
@Stefano...Good point. I'm pretty much the only one here that gets stuff like this dropped on me. I'm learning (and LOVING) python so that's the direction I tend to lean. thanks again.
A: 

Here is a solution with awk:

awk -F, '{ if ($1 != "") prefix=$1; else printf "%s%s\n", prefix,$0 }' myfile.csv
mouviciel
A: 

Here's a full featured example, written in Perl, that uses the new features of Perl 5.10;

#!/usr/bin/perl
use strict;
use warnings;

use feature qw'switch say';

my $append;

while( <> ){
  given( $_ ){

    when( /^$/ ){
      # handle empty line
      say STDERR '#';
    }

    # handle lines that start with "#"
    when( /^\s*[#](.*)/s ){
      print STDERR '# comment:', $1;
    }

    # handle lines that end with two commas
    when( /(.+),,\s*$/ ){
      $append = $1;
    }

    # handle lines that start with a comma
    when( /^,/ ){
      die unless defined $append;
      print $append, $_;
    }
  }
}

Input

0917,,
,882-1273,1
,95F 9475,1
,276-080,1

,40K 0080,1
,275-690A,1
,TX-2311,3
# ignore
 # ignore this too
,TX-3351,4
,B-07432,1
,B-6901,1
,23-753,1
,02F 4307,1
,5.1K QBK-ND,1
,0944-026,1
,0944-027,1
,0944-004,1
,0944-056,1
,0944-057,1
,0944-082,1
,0944-024,1
,0944-025,1
,0944-102,4
,LOR 102,1
0918,,
,CJ1085,1
,1352-152,4
,DMS3102A-18-,1
,6-32 KEP,7
,6-32 X 3/4,4
,6-32X1/2,4
,1251-102,8
,Oct-32,4
,10-32 SAE,8

Output

#
# comment: ignore
# comment: ignore this too
0917,882-1273,1
0917,95F 9475,1
0917,276-080,1
0917,40K 0080,1
0917,275-690A,1
0917,TX-2311,3
0917,TX-3351,4
0917,B-07432,1
0917,B-6901,1
0917,23-753,1
0917,02F 4307,1
0917,5.1K QBK-ND,1
0917,0944-026,1
0917,0944-027,1
0917,0944-004,1
0917,0944-056,1
0917,0944-057,1
0917,0944-082,1
0917,0944-024,1
0917,0944-025,1
0917,0944-102,4
0917,LOR 102,1
0918,CJ1085,1
0918,1352-152,4
0918,DMS3102A-18-,1
0918,6-32 KEP,7
0918,6-32 X 3/4,4
0918,6-32X1/2,4
0918,1251-102,8
0918,Oct-32,4
0918,10-32 SAE,8
Brad Gilbert
A: 

Here's a sed solution. It's not the cleanest, and I'm sure there's a better way to transliterate the two lines, but I found this interesting. (In reality, I would go with a Perl solution, but am posting this for the novelty, and would like to see improvements.)

sed -e '/,,/{s/,,//; h; d;}' -e 'G' -e 's/\(.*\)\n\(.*\)/\2\1/'
William Pursell
because CSV files have distinct data structure, such as fixed delimiters. therefore, it will be easier, to use fields rather than regex to manipulate data.
ghostdog74
@ghostdog74: is that a vote down for my answer? That down vote should go to the question, not my response!
William Pursell