ansaurus

Question

With Perl, how do I read records from a file with two possible record separators?

Answer 1

A:

Something along the lines of

$text = <INPUTFILE>;

@string = split(/[;!]/, $text);

should do the trick more or less.

Edit: I've changed "/;!/" to "/[;!]/".

AAT 2010-02-12 02:14:56

If you're going to do it that way, I think you have to use some form of "slurping" the file in, since <FILE> usually processes it line-by-line. See, for instance, http://www.perl.com/pub/a/2003/11/21/slurp.html

Devin Ceartas 2010-02-12 02:18:51

Don't forget to do a `s/\n/ /g`.

RegDwight 2010-02-12 02:19:13

That does not work at all like the the OP wants. Give it a try. It splits on the 2-character sequence `;!`, which does not occur in the input.

toolic 2010-02-12 02:27:12

@toolic - quite right, I've changed the pattern. Doesn't address the issue of returning the terminators but you've got to leave something for the questing mind to work on...

AAT 2010-02-12 02:30:31

AAT 2010-02-12 02:36:29

Answer 2

+3 A:

One way is to inject another character, like \n, whenever your special character is found, then split on the \n:

use warnings;
use strict;
use Data::Dumper;

while (<DATA>) {
    chomp;
    s/([;|])/$1\n/g;
    my @string = split /\n/;
    print Dumper(\@string);
}

__DATA__
Would you; please hand me| my coat?

Prints out:

$VAR1 = [
          'Would you;',
          ' please hand me|',
          ' my coat?'
        ];

UPDATE: The original question posed by James showed the input text on a single line, as shown in __DATA__ above. Because the question was poorly formatted, others edited the question, breaking the 1 line into 2. Only James knows whether 1 or 2 lines was intended.

toolic 2010-02-12 02:17:34

Nice answer. Notice the line `s/([;|])/$1\n/g;`, the $1 includes the matching pattern within the parentheses ("()") to the output.

mctylr 2010-02-12 02:40:37

This introduces \n as a third record separator.

darch 2010-02-12 17:52:03

@darch: First, the only `\n` is removed with `chomp`. Second, `\n` is injected for every special character using `s///g`. Third, all injected `\n` are removed by `split`. If you see a problem with this method, please elaborate. It is one way to solve the problem posed in the original question.

toolic 2010-02-12 18:02:51

The `\n` is chomped away. But by the time this has happened, you've already used it as a record separator with the `<>` on the previous line.To see exactly what I mean, insert a newline after `please` in your code and run the script. You'll notice that you wind up with four chunks rather than three.

darch 2010-02-12 22:40:49

Answer 3

+1 A:

I prefer @toolic's answer because it deals with multiple separators very easily.

However, if you wanted to overly complicate things, you could always try:

#!/usr/bin/perl

use strict; use warnings;

my @contents = ('');

while ( my $line = <DATA> ) {
    last unless $line =~ /\S/;
    $line =~ s{$/}{ };
    if ( $line =~ /^([^|;]+[|;])(.+)$/ ) {
        $contents[-1] .= $1;
        push @contents, $2;
    }
    else {
        $contents[-1] .= $1;
    }
}

print "[$_]\n" for @contents;

__DATA__
Would you; please
hand me| my coat?

Sinan Ünür 2010-02-12 02:20:51

Answer 4

A:

Let Perl do half the work for you by setting $/ (the input record separator) to vertical bar, and then extract semicolon-separated fields:

#!/usr/bin/perl

use warnings;
use strict;

my @string;

*ARGV = *DATA;

$/ = "|";
while (<>) {
  s/\n+$//;
  s/\n/ /g;
  push @string => $1 while s/^(.*;)//;
  push @string => $_;
}

for (my $i = 0; $i < @string; ++$i) {
  print "\$string[$i] = '$string[$i]';\n";
}

__DATA__
Would you; please
hand me| my coat?

Output:

$string[0] = 'Would you;';
$string[1] = ' please hand me|';
$string[2] = ' my coat?';

Greg Bacon 2010-02-12 03:51:39

+1 Nice and effective way.

Hynek -Pichi- Vychodil 2010-02-12 09:56:48

Answer 5

+6 A:

This will do it. The trick to using split while preserving the token you're splitting on is to use a zero-width lookback match: split(/(?<=[;|])/, ...).

Note: mctylr's answer (currently the top rated) isn't actually correct -- it will split fields on newlines, b/c it only works on a single line of the file at a time.

gbacon's answer using the input record separator ($/) is quite clever--it's both space and time efficient--but I don't think I'd want to see it in production code. Putting one split token in the record separator and the other in the split strikes me as a little too unobvious (you have to fight that with Perl ...) which will make it hard to maintain. I'm also not sure why he's deleting multiple newlines (which I don't think you asked for?) and why he's doing that only for the end of '|'-terminated records.

# open file for reading, die with error message if it fails
open(my $fh, '<', 'data.txt') || die $!; 

# set file reading to slurp (whole file) mode (note that this affects all 
# file reads in this block)
local $/ = undef; 

my $string = <$fh>; 

# convert all newlines into spaces, not specified but as per example output
$string =~ s/\n/ /g; 

# split string on ; or |, using a zero-width lookback match (?<=) to preserve char
my (@strings) = split(/(?<=[;|])/, $string);

curveship 2010-02-12 05:08:40

My solution (to which mctylr responded) is correct for the the input which was provided in the original question. The question was later modified, changing the input. Also the question, in my opinion, is ambiguous: does the OP want a single array for the entire file, or an array for every line of the file? James should clarify.

toolic 2010-02-12 13:34:52

Oops, sorry, toolic, you're right -- I meant your reply, not mctylr's! And I came in after the the change was made to the input.One limitation of using a lookback match, that your code doesn't have: lookback matches have to be constant width. It works fine here -- ; and | are both just 1 char -- but if James has a token with a different width, say "//" that he also wants to split on, then you're technique (replacing the varied tokens with a single one) will work better.

curveship 2010-02-12 15:01:44

ansaurus

tags:

views:

answers:

With Perl, how do I read records from a file with two possible record separators?

related questions