views:

228

answers:

5

Here is what I am trying to do:

I want to read a text file into an array of strings. I want the string to terminate when the file reads in a certain character (mainly ; or |).

For example, the following text

Would you; please
hand me| my coat?

would be put away like this:

$string[0] = 'Would you;';
$string[1] = ' please hand me|';
$string[2] = ' my coat?';

Could I get some help on something like this?

A: 

Something along the lines of

$text = <INPUTFILE>;

@string = split(/[;!]/, $text);

should do the trick more or less.

Edit: I've changed "/;!/" to "/[;!]/".

AAT
If you're going to do it that way, I think you have to use some form of "slurping" the file in, since <FILE> usually processes it line-by-line. See, for instance, http://www.perl.com/pub/a/2003/11/21/slurp.html
Devin Ceartas
Don't forget to do a `s/\n/ /g`.
RegDwight
That does not work at all like the the OP wants. Give it a try. It splits on the 2-character sequence `;!`, which does not occur in the input.
toolic
@toolic - quite right, I've changed the pattern. Doesn't address the issue of returning the terminators but you've got to leave something for the questing mind to work on...
AAT
AAT
+3  A: 

One way is to inject another character, like \n, whenever your special character is found, then split on the \n:

use warnings;
use strict;
use Data::Dumper;

while (<DATA>) {
    chomp;
    s/([;|])/$1\n/g;
    my @string = split /\n/;
    print Dumper(\@string);
}

__DATA__
Would you; please hand me| my coat?

Prints out:

$VAR1 = [
          'Would you;',
          ' please hand me|',
          ' my coat?'
        ];

UPDATE: The original question posed by James showed the input text on a single line, as shown in __DATA__ above. Because the question was poorly formatted, others edited the question, breaking the 1 line into 2. Only James knows whether 1 or 2 lines was intended.

toolic
Nice answer. Notice the line `s/([;|])/$1\n/g;`, the $1 includes the matching pattern within the parentheses ("()") to the output.
mctylr
This introduces \n as a third record separator.
darch
@darch: First, the only `\n` is removed with `chomp`. Second, `\n` is injected for every special character using `s///g`. Third, all injected `\n` are removed by `split`. If you see a problem with this method, please elaborate. It is one way to solve the problem posed in the original question.
toolic
The `\n` is chomped away. But by the time this has happened, you've already used it as a record separator with the `<>` on the previous line.To see exactly what I mean, insert a newline after `please` in your code and run the script. You'll notice that you wind up with four chunks rather than three.
darch
+1  A: 

I prefer @toolic's answer because it deals with multiple separators very easily.

However, if you wanted to overly complicate things, you could always try:

#!/usr/bin/perl

use strict; use warnings;

my @contents = ('');

while ( my $line = <DATA> ) {
    last unless $line =~ /\S/;
    $line =~ s{$/}{ };
    if ( $line =~ /^([^|;]+[|;])(.+)$/ ) {
        $contents[-1] .= $1;
        push @contents, $2;
    }
    else {
        $contents[-1] .= $1;
    }
}

print "[$_]\n" for @contents;

__DATA__
Would you; please
hand me| my coat?
Sinan Ünür
A: 

Let Perl do half the work for you by setting $/ (the input record separator) to vertical bar, and then extract semicolon-separated fields:

#!/usr/bin/perl

use warnings;
use strict;

my @string;

*ARGV = *DATA;

$/ = "|";
while (<>) {
  s/\n+$//;
  s/\n/ /g;
  push @string => $1 while s/^(.*;)//;
  push @string => $_;
}

for (my $i = 0; $i < @string; ++$i) {
  print "\$string[$i] = '$string[$i]';\n";
}

__DATA__
Would you; please
hand me| my coat?

Output:

$string[0] = 'Would you;';
$string[1] = ' please hand me|';
$string[2] = ' my coat?';
Greg Bacon
+1 Nice and effective way.
Hynek -Pichi- Vychodil
+6  A: 

This will do it. The trick to using split while preserving the token you're splitting on is to use a zero-width lookback match: split(/(?<=[;|])/, ...).

Note: mctylr's answer (currently the top rated) isn't actually correct -- it will split fields on newlines, b/c it only works on a single line of the file at a time.

gbacon's answer using the input record separator ($/) is quite clever--it's both space and time efficient--but I don't think I'd want to see it in production code. Putting one split token in the record separator and the other in the split strikes me as a little too unobvious (you have to fight that with Perl ...) which will make it hard to maintain. I'm also not sure why he's deleting multiple newlines (which I don't think you asked for?) and why he's doing that only for the end of '|'-terminated records.

# open file for reading, die with error message if it fails
open(my $fh, '<', 'data.txt') || die $!; 

# set file reading to slurp (whole file) mode (note that this affects all 
# file reads in this block)
local $/ = undef; 

my $string = <$fh>; 

# convert all newlines into spaces, not specified but as per example output
$string =~ s/\n/ /g; 

# split string on ; or |, using a zero-width lookback match (?<=) to preserve char
my (@strings) = split(/(?<=[;|])/, $string); 
curveship
My solution (to which mctylr responded) is correct for the the input which was provided in the original question. The question was later modified, changing the input. Also the question, in my opinion, is ambiguous: does the OP want a single array for the entire file, or an array for every line of the file? James should clarify.
toolic
Oops, sorry, toolic, you're right -- I meant your reply, not mctylr's! And I came in after the the change was made to the input.One limitation of using a lookback match, that your code doesn't have: lookback matches have to be constant width. It works fine here -- ; and | are both just 1 char -- but if James has a token with a different width, say "//" that he also wants to split on, then you're technique (replacing the varied tokens with a single one) will work better.
curveship