views:

102

answers:

6

I have several text files, that were once tables in a database, which is now disassembled. I'm trying to reassemble them, which will be easy, once I get them into a usable form. The first file, "keys.text" is just a list of labels, inconsistently formatted. Like:

Sa 1 #
Sa 2
U 328 #*

It's always letter(s), [space], number(s), [space], and sometime symbol(s). The text files that match these keys are the same, then followed by a line of text, also separated, or delimited, by a SPACE.

Sa 1 # Random line of text follows.
Sa 2 This text is just as random.
U 328 #* Continuing text...

What I'm trying to do in the code below, is match the key from "keys.text", with the same key in the .txt files, and put a tab between the key, and the text. I'm sure I'm overlooking something very basic, but the result I'm getting, looks identical to the source .txt file.

Thanks in advance for any leads or assistance!

#!/usr/bin/perl

use strict;
use warnings;
use diagnostics;
open(IN1, "keys.text");

my $key;

# Read each line one at a time
while ($key = <IN1>) {

# For each txt file in the current directory
foreach my $file (<*.txt>) {
  open(IN, $file) or die("Cannot open TXT file for reading: $!");
  open(OUT, ">temp.txt") or die("Cannot open output file: $!");

  # Add temp modified file into directory 
  my $newFilename = "modified\/keyed_" . $file;
  my $line;

  # Read each line one at a time
  while ($line = <IN>) {

     $line =~ s/"\$key"/"\$key" . "\/t"/;
     print(OUT "$line");

  }
  rename("temp.txt", "$newFilename");
 }   
}

EDIT: Just to clarify, the results should retain the symbols from the keys as well, if there are any. So they'd look like:

Sa 1 #      Random line of text follows.
Sa 2        This text is just as random.
U 328 #*    Continuing text...
+1  A: 

The regex seems quoted rather oddly to me. Wouldn't

$line =~ s/$key/$key\t/;

work better?

Also, IIRC, <IN1> will leave the newline on the end of your $key. chomp $key to get rid of that.

And don't put parentheses around your print args, esp when you're writing to a file handle. It looks wrong, whether it is or not, and distracts people from the real problems.

cHao
A: 

if Perl is not a must, you can use this awk one liner

$ cat keys.txt
Sa 1 #
Sa 2
U 328 #*

$ cat mytext.txt
Sa 1 # Random line of text follows.
Sa 2 This text is just as random.
U 328 #* Continuing text...

$ awk 'FNR==NR{ k[$1 SEP $2];next }($1 SEP $2 in k) {$2=$2"\t"}1 ' keys.txt mytext.txt
Sa 1     # Random line of text follows.
Sa 2     This text is just as random.
U 328    #* Continuing text...
ghostdog74
A: 

Using split rather than s/// makes the problem straightforward. In the code below, read_keys extracts the keys from keys.text and records them in a hash.

Then for all files named on the command line, available in the special Perl array @ARGV, we inspect each line to see whether it begins with a key. If not, we leave it alone, but otherwise insert a TAB between the key and the text.

Note that we edit the files in-place thanks to Perl's handy -i option:

-i[extension]

specifies that files processed by the <> construct are to be edited in-place. It does this by renaming the input file, opening the output file by the original name, and selecting that output file as the default for print statements. The extension, if supplied, is used to modify the name of the old file to make a backup copy …

The line split " ", $_, 3 separates the current line into exactly three fields. This is necessary to protect whitespace that's likely to be present in the text portion of the line.

#! /usr/bin/perl -i.bak

use warnings;
use strict;

sub usage { "Usage: $0 text-file\n" }

sub read_keys {
  my $path = "keys.text";
  open my $fh, "<", $path
    or die "$0: open $path: $!";

  my %key;
  while (<$fh>) {
    my($text,$num) = split;
    ++$key{$text}{$num} if defined $text && defined $num;
  }

  wantarray ? %key : \%key;
}

die usage unless @ARGV;
my %key = read_keys;

while (<>) {
  my($text,$num,$line) = split " ", $_, 3;
  $_ = "$text $num\t$line" if defined $text &&
                              defined $num &&
                              $key{$text}{$num};
  print;
}

Sample run:

$ ./add-tab input

$ diff -u input.bak input
--- input.bak   2010-07-20 20:47:38.688916978 -0500
+++ input   2010-07-20 21:00:21.119531937 -0500
@@ -1,3 +1,3 @@
-Sa 1 # Random line of text follows.
-Sa 2 This text is just as random.
-U 328 #* Continuing text...
+Sa 1   # Random line of text follows.
+Sa 2   This text is just as random.
+U 328  #* Continuing text...
Greg Bacon
A: 

Fun answers:

$line =~ s/(?<=$key)/\t/;

Where (?<=XXXX) is a zero-width positive lookbehind for XXXX. That means it matches just after XXXX without being part of the match that gets substituted.

And:

$line =~ s/$key/$key . "\t"/e;

Where the /e flag at the end means to do one eval of what's in the second half of the s/// before filling it in.

Important note: I'm not recommending either of these, they obfuscate the program. But they're interesting. :-)

eruciform
A: 

How about doing two separate slurps of each file. For the first file you open the keys and create a preliminary hash. For the second file then all you need to do is add the text to the hash.

use strict;
use warnings;

my $keys_file = "path to keys.txt";
my $content_file = "path to content.txt";
my $output_file = "path to output.txt";

my %hash = ();

my $keys_regex = '^([a-zA-Z]+)\s*\(d+)\s*([^\da-zA-Z\s]+)';

open my $fh, '<', $keys_file or die "could not open $key_file";
while(<$fh>){
    my $line = $_;
    if ($line =~ /$keys_regex/){
        my $key = $1;
        my $number = $2;
        my $symbol = $3;
        $hash{$key}{'number'} = $number;
        $hash{$key}{'symbol'} = $symbol;
    }
}
close $fh;

open my $fh, '<', $content_file or die "could not open $content_file";
while(<$fh>){
    my $line = $_;
    if ($line =~ /^([a-zA-Z]+)/){
        my $key = $1;
// strip content_file line from keys/number/symbols to leave text
        line =~ s/^$key//;
        line =~ s/\s*$hash{$key}{'number'}//;
        line =~ s/\s*$hash{$key}{'symbol'}//;
        $line =~ s/^\s+//g;
        $hash{$key}{'text'} = $line;
    }
}
close $fh;

open my $fh, '>', $output_file or die "could not open $output_file";
for my $key (keys %hash){
    print $fh $key . " " . $hash{$key}{'number'} . " " . $hash{$key}{'symbol'} . "\t" . $hash{$key}{'text'} . "\n";
}
close $fh;

I haven't had a chance to test it yet and the solution seems a little hacky with all the regex but might give you an idea of something else you can try.

stocherilac
A: 

This looks like the perfect place for the map function in Perl! Read in the entire text file into an array, then apply the map function across the entire array. The only other thing you might want to do is use the quotemeta function to escape out any possible regular expressions in your keys.

Using map is very efficient. I also read the keys into an array in order to not have to keep opening and closing the keys file in my loop. It's an O^2 algorithm, but if your keys aren't that big, it shouldn't be too bad.

#! /usr/bin/env perl

use strict;
use vars;
use warnings;

open (KEYS, "keys.text")
    or die "Cannot open 'keys.text' for reading\n";
my @keys = <KEYS>;
close (KEYS);

foreach my $file (glob("*.txt")) {
    open (TEXT, "$file")
        or die "Cannot open '$file' for reading\n";
    my @textArray = <TEXT>;
    close (TEXT);

    foreach my $line (@keys) {
        chomp $line;
        map($_ =~ s/^$line/$line\t/, @textArray);
    }
    open (NEW_TEXT, ">$file.new") or
        die qq(Can't open file "$file" for writing\n);

    print TEXT join("\n", @textArray) . "\n";
close (TEXT);
}
David W.
Comment on my own answer:Instead of looping to write the file, you could do a join and write the whole thing at once.I'll edit my command to show.
David W.