views:

110

answers:

3

This question has 2 sections one for "single line match" and one for "multi line region matching" Also, I have a semi working solution, I want to find more robustness and elegance in my solution.

  1. Single Line Match: I would like to duplicate each line of an input file such that the second line was a regex modification of the first: E.G.

File.txt

YY BANANA, YYZ, ABC YHZ YY1
YY APPLE , YYZ, ABC YHZ YY1
YY ORANGE, YYZ, ABC YHZ YY1
YZ GRAPE , YZZ, ABC YHZ YZ1

Would BECOME:

YY BANANA, YYZ, ABC YHZ YY1
XY BANANA, XYZ, ABC YHZ XY1
YY APPLE , YYZ, ABC YHZ YY1
XY APPLE , XYZ, ABC YHZ XY1
YY ORANGE, YYZ, ABC YHZ YY1
XY ORANGE, XYZ, ABC YHZ XY1
YZ GRAPE , YZZ, ABC YHZ YZ1
XZ GRAPE , XZZ, ABC YHZ XZ1

Keep in mind the real file is large, and The example of YY ->XY and YZ ->XZ is exactly correct In other words in my file case YY, YH, YZ, Y1, Y2, Y3 are the symbols that I would like to change to XY, XH, XZ, X1, X2, X3.

I have done something in PERL that is very raw ( will create a link to it as as starting point to show What I was thinking) But the perl script I wrote is not elegant or general and requires multiple passes over the file.

My Raw Stab.... IN PERL. http://www.quantprinciple.com/invest/index.php/docs/tipsandtricks/perl-sed-awk/conditional-duplicate/

Usage of my raw stab:

MatchDuplicate.pl  INPUT.txt YY XY > INPUT2.txt
MatchDuplicate.pl  INPUT2.txt YH XH > INPUT3.txt
MatchDuplicate.pl  INPUT3.txt Y1 X1 > INPUT4.txt
MatchDuplicate.pl  INPUT4.txt Y2 X2 > INPUT5.txt

INPUT5.txt is used...

  1. Multi Line Match Exactly the same as above, but each "record" of the input will match multiple lines:

File.txt

< some starting marker...startRecord:>
data
data
YY data
YY BANANA, YYZ, ABC YHZ YY1
<some ending record marker>
< some starting marker...startRecord:>
data
data
YY data
YY APPLE , YYZ, ABC YHZ YY1
<some ending record marker>
< some starting marker...startRecord:>
data
data
YY data
YY ORANGE, YYZ, ABC YHZ YY1
<some ending record marker>
< some starting marker...startRecord:>
data
data
YZ data
YZ GRAPE , YZZ, ABC YHZ YZ1
<some ending record marker>

Would BECOME:

< some starting marker...startRecord:>
data
data
YY data
YY BANANA, YYZ, ABC YHZ YY1
<some ending record marker>
< some starting marker...startRecord:>
data
data
XY data
XY BANANA, XYZ, ABC YHZ XY1
<some ending record marker>
< some starting marker...startRecord:>
data
data
YY data
YY APPLE , YYZ, ABC YHZ YY1
<some ending record marker>
< some starting marker...startRecord:>
data
data
XY data
XY APPLE , XYZ, ABC YHZ XY1
<some ending record marker>
< some starting marker...startRecord:>
data
data
YY data
YY ORANGE, YYZ, ABC YHZ YY1
<some ending record marker>
< some starting marker...startRecord:>
data
data
XY data
XY ORANGE, XYZ, ABC YHZ XY1
<some ending record marker>
< some starting marker...startRecord:>
data
data
YZ data
YZ GRAPE , YZZ, ABC YHZ YZ1
<some ending record marker>
< some starting marker...startRecord:>
data
data
XZ data
XZ GRAPE , XZZ, ABC YHZ XZ1
<some ending record marker>

My Raw Stab: http://www.quantprinciple.com/invest/index.php/docs/tipsandtricks/perl-sed-awk/multi-line-conditional-duplicate/

+2  A: 

For 1:

while(<>) {
    say $_;
    say $_ if s/$pattern/$replacement/;
}

Add file handles and other boilerplate as appropriate.

EDIT: Let's go for something a bit more general then.

First, we'll parse out our command-line arguments, and put our replacements into a hash:

$filename = shift @ARGV;
%patterns = ();
while (scalar @ARGV) {
    my $pattern = shift @ARGV;
    my $replacement = shift @ARGV;
    $patterns{$pattern} = $replacement
}

Then for each line in the file, we'll output the line verbatim, and then see if it matches any of our patterns.

while (<>) {
    say $_;
    while (my ($pattern, $replacement) = each %patterns) {
        s/$pattern/$replacement/g and say $_ if /^$pattern/;
    }
}
Anon.
This is more elegant than I am currently doing, however it does not solve the "real" issue that I was posing, I want to solve the problem of programmatically determining both $pattern and $replacement based on the problem scope and requirements.In this solution, I will still have either a complex conditional to to handle YY -> XY and YH ->XH etc...
Q Boiler
Using this in conjunction with answer 2 is highly effective.thanks.
Q Boiler
+1  A: 

If the end-of-record marker is the same for all records, you can set the $/ variable so that <FILE> will read in one record at a time.

$\ = "<some ending record marker>\n";
while (<FILE>) {
    print $_;
    # $_ is a multi-line string so use /m modifier
    print $_ if s/$pattern/$replacement/m;
}
mobrule
Embedding this with answer 1 allows multi line to work elegant and effective.
Q Boiler
+2  A: 

This will solve your 1st question:

use strict;
use warnings;

die "usage..." unless @ARGV == 3;
my ($file, $src, $dst) = @ARGV;

open my $fh, '<', $file or die "Can not open $file: $!";
while (<$fh>) {
    print;
    if (/^$src\b/) {
        s/$src/$dst/g;
        print;
    }
}
close $fh;

Looking at your linked scripts... you could easily convert your block comments to POD so that they effectively become a manpage for your code. Then you could use POD::Usage to get usage info when the user does something stupid.

toolic
This is very powerful, and did work elegantly for my first question, but combining the 1 and 2 answers together forms a very useful and elegant perl script. Thanks for the input...
Q Boiler
Thanks for the POD tip. That is very useful as even I will be forgetting usage in less than 1 year.
Q Boiler